Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional histogram to MarkDuplicatesSpark to match Picard version #6155

Open
droazen opened this issue Sep 11, 2019 · 2 comments
Open

Comments

@droazen
Copy link
Contributor

droazen commented Sep 11, 2019

A new histogram was added to the MarkDuplicates output in Picard in PR broadinstitute/picard#569. We should add this histogram to the Spark version as well.

@jamesemery
Copy link
Collaborator

Based on my understanding of the changes in broadinstitute/picard#569 this might actually be easier than I first expected. Unfortunately the metrics collection code is one of the parts of MarkDuplicatesSpark that cannot depend on Picard classes so we are going to have to reimplement this histogram ourselves. It looks like it would require we serialize one additional integer for every duplicate set (that is the count of the total number of elements in the set as opposed to the count of the number of optical duplicates in the set). Since it involves changing the Spark driving code for the tool it is somewhat non-trivial.

@droazen droazen modified the milestones: Engine-Q3-2019, GATK-Triaged-Issues Sep 23, 2019
@droazen droazen assigned tomwhite and unassigned jamesemery Sep 23, 2019
@tomwhite
Copy link
Contributor

tomwhite commented Oct 7, 2019

I started investigating how to do this, here are a few notes:

  • The change in Picard's MD added a histogram for counts of (all) duplicates, optical duplicates, and non-optical duplicates.
  • The histogram is serialized to text in the metrics file. (I believe there was no histogram before this change.)
  • The duplicates are found by sorting the file and breaking reads into chunks, where each chunk contains reads that are duplicates. (See MarkDuplicates#generateDuplicateIndexes)

It's not clear to me where the equivalent code would live in the GATK Spark implementation. It looks like MarkDuplicatesSparkUtils#markDuplicateRecords is where the duplicate counts can be obtained, but I'm not sure if the code that uses this method (MarkDuplicatesSpark#mark) can piece together the counts for the histogram. Even if it could, the return type of MarkDuplicatesSpark#mark is JavaRDD<GATKRead>, which would need altering to incorporate the extra three int fields for the counts. Thoughts @jamesemery?

@tomwhite tomwhite assigned jamesemery and unassigned tomwhite Oct 28, 2019
@droazen droazen modified the milestones: GATK-Triaged-Issues, GATK-Priority-Backlog Oct 28, 2019
@droazen droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants