-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate de novo release #654
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions and minor suggestions
Co-authored-by: Katherine Chao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple more minor comments, then waiting on the other PR
@@ -135,34 +135,50 @@ def get_releasable_de_novo_calls_ht( | |||
) | |||
mt = annotate_adj(mt) | |||
|
|||
# Approximate the AD and PL fields when missing. | |||
# Many of our larger datasets have the PL and AD fields for homref genotypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Many of our larger datasets have the PL and AD fields for homref genotypes | |
# Many of our larger datasets have the PL and AD fields for homref genotypes |
can you check if this is true for v3 as well? if yes, it's best to say v3 and v4 have the PL and ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are also NA in v3 VDS. FYI, I used this code to check:
from gnomad_qc.v3.resources.basics import get_gnomad_v3_vds
from gnomad.sample_qc.relatedness import filter_to_trios
vds = get_gnomad_v3_vds(split=True,
filter_intervals=['chr11:113409605-113475691'],
)
trios = hl.import_fam("gs://gnomad-qin/g1k.trios.fam", delimiter='\t')
vds = filter_to_trios(vds, trios)
mt = hl.vds.to_dense_mt(vds)
mt.entries().show(20)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for looking! this is good to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor suggestions and questions
gnomad_qc/v4/resources/sample_qc.py
Outdated
|
||
|
||
def trio_denovo_ht( | ||
releasable: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does a non-releasable version of this HT exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will remove.
gnomad_qc/v4/resources/release.py
Outdated
".all_confidences.filtered" | ||
if by_confidence == "all" | ||
else ".high_confidence.filtered" | ||
) | ||
postfix = f".{datetime.today().strftime('%Y-%m-%d')}" if test else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basic question, is this how we've started naming test files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, Julia just added this to have different versions to inspect. I could remove it.
|
||
return selects | ||
ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this 1000 be an argument rather than hard coded? not sure how strict we've been about this in gnomad_qc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I will make it an argument default to 1000 and use 1/10 for test.
:return: Select expression dict. | ||
""" | ||
return {"allele_info": ht.allele_info.drop("nonsplit_alleles")} | ||
ht = ht.filter(~hl.is_missing(ht.de_novo_call_info.confidence)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this filter is in place to remove variants that were candidate de novos (per ht.de_novo_call_info.is_de_novo
) but failed for another reason, right? do you think users would find it useful to see these variants in the Hail Table with all de novos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are the failed de_novos, it will cause some kind of AC problem if we keep them, don't you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, if we keep them, we'd have to calculate the AC across these sites (sites that were candidate de novos but failed for some other quality reason) and subtract them to report the various AC (AC_all, high confidence AC, etc).
could you ask in your ATGU presentation whether people think it would be useful to retain these variants? I suspect the ATGU community will include some users of this new feature, so they would be the best people to ask about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two plots in my slides that will show you how many failed per chromosome, I think they're disproportionately too many, what do you think?
ht = process_consequences(ht, has_polyphen=False) | ||
|
||
ht = ht.annotate( | ||
alt_is_star=ht.alleles[1] == "*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than alt_is_star
, the annotation added earlier (mixed_site
) might be more valuable. users can pretty easily check for themselves whether the alt is a star allele
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, they're different annotations -- I suggested keeping mixed_site
in case you were trying to retain information about the locus for our users.
for alt_is_star
specifically, it feels cleaner to filter directly using the logic here:
ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))
as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star
annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this, and I also moved the all confidence table after it.
gene_id=ht.vep.worst_csq_for_variant.gene_id, | ||
transcript_id=ht.vep.worst_csq_for_variant.transcript_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I see this again, I wonder if these should be named worst_csq_gene_id
and worst_csq_transcript_id
. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I could change that.
"splice_polypyrimidine_tract_variant", | ||
"splice_donor_5th_base_variant", | ||
"splice_region_variant", | ||
"splice_donor_region_variant", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you should also remove incomplete_terminal_codon_variant
("A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed"), coding_sequence_variant
("A sequence variant that changes the coding sequence"), and coding_transcript_variant
("A transcript variant of a protein coding gene") here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added them, but coding_sequence_variant
was in Kaitlin's and Sarah's list, and incomplete_terminal_codon_variant
was in Sarah's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency then, let's keep coding_sequence_variant
# the fields are present in the HT. | ||
final_fields = get_final_ht_fields(ht, schema=FINALIZED_SCHEMA) | ||
return ht.select(*final_fields["rows"]).select_globals(*final_fields["globals"]) | ||
return ht_all_conf, ht | ||
|
||
|
||
def restructure_for_tsv(ht: hl.Table) -> hl.Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this docstring should note that the TSV only contains high confidence de novos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments
@@ -16,6 +16,7 @@ | |||
CSQ_CODING_MEDIUM_IMPACT, | |||
process_consequences, | |||
) | |||
from hail.ggplot.utils import n_partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, cool -- I didn't know this existed. how did you find it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this import doesn't appear to be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julia added it, I never used it.
ht = process_consequences(ht, has_polyphen=False) | ||
|
||
ht = ht.annotate( | ||
alt_is_star=ht.alleles[1] == "*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, they're different annotations -- I suggested keeping mixed_site
in case you were trying to retain information about the locus for our users.
for alt_is_star
specifically, it feels cleaner to filter directly using the logic here:
ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))
as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star
annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well
:return: Select expression dict. | ||
""" | ||
return {"allele_info": ht.allele_info.drop("nonsplit_alleles")} | ||
ht = ht.filter(~hl.is_missing(ht.de_novo_call_info.confidence)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, if we keep them, we'd have to calculate the AC across these sites (sites that were candidate de novos but failed for some other quality reason) and subtract them to report the various AC (AC_all, high confidence AC, etc).
could you ask in your ATGU presentation whether people think it would be useful to retain these variants? I suspect the ATGU community will include some users of this new feature, so they would be the best people to ask about this
This cleaned up some redundant dense mt functions and arguments since we had PR #651. Julia created the dense MT in job 0a3e98e75ba14b2f8341012515d11f8b before the PR was merged.
This is waiting on PR #760 in gnomad_methods.
Test run on chr20: b0729d42ea77466c8cc1fe9697e73af1