Add additional histogram to MarkDuplicatesSpark to match Picard version #6155

droazen · 2019-09-11T15:49:55Z

A new histogram was added to the MarkDuplicates output in Picard in PR broadinstitute/picard#569. We should add this histogram to the Spark version as well.

The text was updated successfully, but these errors were encountered:

jamesemery · 2019-09-12T19:35:52Z

Based on my understanding of the changes in broadinstitute/picard#569 this might actually be easier than I first expected. Unfortunately the metrics collection code is one of the parts of MarkDuplicatesSpark that cannot depend on Picard classes so we are going to have to reimplement this histogram ourselves. It looks like it would require we serialize one additional integer for every duplicate set (that is the count of the total number of elements in the set as opposed to the count of the number of optical duplicates in the set). Since it involves changing the Spark driving code for the tool it is somewhat non-trivial.

tomwhite · 2019-10-07T14:23:30Z

I started investigating how to do this, here are a few notes:

The change in Picard's MD added a histogram for counts of (all) duplicates, optical duplicates, and non-optical duplicates.
The histogram is serialized to text in the metrics file. (I believe there was no histogram before this change.)
The duplicates are found by sorting the file and breaking reads into chunks, where each chunk contains reads that are duplicates. (See MarkDuplicates#generateDuplicateIndexes)

It's not clear to me where the equivalent code would live in the GATK Spark implementation. It looks like MarkDuplicatesSparkUtils#markDuplicateRecords is where the duplicate counts can be obtained, but I'm not sure if the code that uses this method (MarkDuplicatesSpark#mark) can piece together the counts for the histogram. Even if it could, the return type of MarkDuplicatesSpark#mark is JavaRDD<GATKRead>, which would need altering to incorporate the extra three int fields for the counts. Thoughts @jamesemery?

droazen added the MarkDuplicatesSpark label Sep 11, 2019

droazen added this to the Engine-Q3-2019 milestone Sep 11, 2019

droazen assigned jamesemery Sep 11, 2019

droazen modified the milestones: Engine-Q3-2019, GATK-Triaged-Issues Sep 23, 2019

droazen assigned tomwhite and unassigned jamesemery Sep 23, 2019

tomwhite assigned jamesemery and unassigned tomwhite Oct 28, 2019

droazen modified the milestones: GATK-Triaged-Issues, GATK-Priority-Backlog Oct 28, 2019

droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020

droazen unassigned jamesemery Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional histogram to MarkDuplicatesSpark to match Picard version #6155

Add additional histogram to MarkDuplicatesSpark to match Picard version #6155

droazen commented Sep 11, 2019

jamesemery commented Sep 12, 2019

tomwhite commented Oct 7, 2019

Add additional histogram to MarkDuplicatesSpark to match Picard version #6155

Add additional histogram to MarkDuplicatesSpark to match Picard version #6155

Comments

droazen commented Sep 11, 2019

jamesemery commented Sep 12, 2019

tomwhite commented Oct 7, 2019