Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate de novo release #654

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Generate de novo release #654

wants to merge 22 commits into from

Conversation

KoalaQin
Copy link
Contributor

@KoalaQin KoalaQin commented Jan 29, 2025

This cleaned up some redundant dense mt functions and arguments since we had PR #651. Julia created the dense MT in job 0a3e98e75ba14b2f8341012515d11f8b before the PR was merged.

This is waiting on PR #760 in gnomad_methods.

Test run on chr20: b0729d42ea77466c8cc1fe9697e73af1

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions and minor suggestions

@KoalaQin KoalaQin requested a review from ch-kr February 3, 2025 15:48
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple more minor comments, then waiting on the other PR

@@ -135,34 +135,50 @@ def get_releasable_de_novo_calls_ht(
)
mt = annotate_adj(mt)

# Approximate the AD and PL fields when missing.
# Many of our larger datasets have the PL and AD fields for homref genotypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Many of our larger datasets have the PL and AD fields for homref genotypes
# Many of our larger datasets have the PL and AD fields for homref genotypes

can you check if this is true for v3 as well? if yes, it's best to say v3 and v4 have the PL and ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are also NA in v3 VDS. FYI, I used this code to check:

from gnomad_qc.v3.resources.basics import get_gnomad_v3_vds
from gnomad.sample_qc.relatedness import filter_to_trios

vds = get_gnomad_v3_vds(split=True, 
                       filter_intervals=['chr11:113409605-113475691'],
                       )

trios = hl.import_fam("gs://gnomad-qin/g1k.trios.fam", delimiter='\t')
vds = filter_to_trios(vds, trios)

mt = hl.vds.to_dense_mt(vds)
mt.entries().show(20)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for looking! this is good to know

@KoalaQin KoalaQin requested a review from ch-kr February 21, 2025 16:04
@KoalaQin KoalaQin changed the title Generate de novo calls Generate de novo release Feb 21, 2025
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor suggestions and questions



def trio_denovo_ht(
releasable: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does a non-releasable version of this HT exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will remove.

".all_confidences.filtered"
if by_confidence == "all"
else ".high_confidence.filtered"
)
postfix = f".{datetime.today().strftime('%Y-%m-%d')}" if test else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic question, is this how we've started naming test files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, Julia just added this to have different versions to inspect. I could remove it.


return selects
ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this 1000 be an argument rather than hard coded? not sure how strict we've been about this in gnomad_qc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I will make it an argument default to 1000 and use 1/10 for test.

:return: Select expression dict.
"""
return {"allele_info": ht.allele_info.drop("nonsplit_alleles")}
ht = ht.filter(~hl.is_missing(ht.de_novo_call_info.confidence))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this filter is in place to remove variants that were candidate de novos (per ht.de_novo_call_info.is_de_novo) but failed for another reason, right? do you think users would find it useful to see these variants in the Hail Table with all de novos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are the failed de_novos, it will cause some kind of AC problem if we keep them, don't you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if we keep them, we'd have to calculate the AC across these sites (sites that were candidate de novos but failed for some other quality reason) and subtract them to report the various AC (AC_all, high confidence AC, etc).

could you ask in your ATGU presentation whether people think it would be useful to retain these variants? I suspect the ATGU community will include some users of this new feature, so they would be the best people to ask about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two plots in my slides that will show you how many failed per chromosome, I think they're disproportionately too many, what do you think?

ht = process_consequences(ht, has_polyphen=False)

ht = ht.annotate(
alt_is_star=ht.alleles[1] == "*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than alt_is_star, the annotation added earlier (mixed_site) might be more valuable. users can pretty easily check for themselves whether the alt is a star allele

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they are the same thing:
Screenshot 2025-02-21 at 1 34 10 PM
Maybe we could keep both? I need to use alt_is_star to filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this, and I also moved the all confidence table after it.

Comment on lines 180 to 181
gene_id=ht.vep.worst_csq_for_variant.gene_id,
transcript_id=ht.vep.worst_csq_for_variant.transcript_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I see this again, I wonder if these should be named worst_csq_gene_id and worst_csq_transcript_id. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I could change that.

"splice_polypyrimidine_tract_variant",
"splice_donor_5th_base_variant",
"splice_region_variant",
"splice_donor_region_variant",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you should also remove incomplete_terminal_codon_variant ("A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed"), coding_sequence_variant ("A sequence variant that changes the coding sequence"), and coding_transcript_variant ("A transcript variant of a protein coding gene") here

Copy link
Contributor Author

@KoalaQin KoalaQin Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them, but coding_sequence_variant was in Kaitlin's and Sarah's list, and incomplete_terminal_codon_variant was in Sarah's.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency then, let's keep coding_sequence_variant

# the fields are present in the HT.
final_fields = get_final_ht_fields(ht, schema=FINALIZED_SCHEMA)
return ht.select(*final_fields["rows"]).select_globals(*final_fields["globals"])
return ht_all_conf, ht


def restructure_for_tsv(ht: hl.Table) -> hl.Table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this docstring should note that the TSV only contains high confidence de novos

@KoalaQin KoalaQin requested a review from ch-kr February 21, 2025 19:21
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments

@@ -16,6 +16,7 @@
CSQ_CODING_MEDIUM_IMPACT,
process_consequences,
)
from hail.ggplot.utils import n_partitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, cool -- I didn't know this existed. how did you find it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this import doesn't appear to be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julia added it, I never used it.

ht = process_consequences(ht, has_polyphen=False)

ht = ht.annotate(
alt_is_star=ht.alleles[1] == "*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

:return: Select expression dict.
"""
return {"allele_info": ht.allele_info.drop("nonsplit_alleles")}
ht = ht.filter(~hl.is_missing(ht.de_novo_call_info.confidence))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if we keep them, we'd have to calculate the AC across these sites (sites that were candidate de novos but failed for some other quality reason) and subtract them to report the various AC (AC_all, high confidence AC, etc).

could you ask in your ATGU presentation whether people think it would be useful to retain these variants? I suspect the ATGU community will include some users of this new feature, so they would be the best people to ask about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants