-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional histogram to MarkDuplicatesSpark to match Picard version #6155
Comments
Based on my understanding of the changes in broadinstitute/picard#569 this might actually be easier than I first expected. Unfortunately the metrics collection code is one of the parts of MarkDuplicatesSpark that cannot depend on Picard classes so we are going to have to reimplement this histogram ourselves. It looks like it would require we serialize one additional integer for every duplicate set (that is the count of the total number of elements in the set as opposed to the count of the number of optical duplicates in the set). Since it involves changing the Spark driving code for the tool it is somewhat non-trivial. |
I started investigating how to do this, here are a few notes:
It's not clear to me where the equivalent code would live in the GATK Spark implementation. It looks like |
A new histogram was added to the
MarkDuplicates
output in Picard in PR broadinstitute/picard#569. We should add this histogram to the Spark version as well.The text was updated successfully, but these errors were encountered: