Generate de novo release #654

KoalaQin · 2025-01-29T17:47:08Z

This cleaned up some redundant dense mt functions and arguments since we had PR #651. Julia created the dense MT in job 0a3e98e75ba14b2f8341012515d11f8b before the PR was merged.

This is waiting on PR #760 in gnomad_methods.

Test run on chr20: b0729d42ea77466c8cc1fe9697e73af1

ch-kr

some questions and minor suggestions

gnomad_qc/v4/resources/sample_qc.py

gnomad_qc/v4/create_release/create_de_novo_release.py

Co-authored-by: Katherine Chao <[email protected]>

… into qh/denovo

ch-kr

just a couple more minor comments, then waiting on the other PR

ch-kr · 2025-02-03T21:03:39Z

gnomad_qc/v4/create_release/create_de_novo_release.py

@@ -135,34 +135,50 @@ def get_releasable_de_novo_calls_ht(
    )
    mt = annotate_adj(mt)

-    # Approximate the AD and PL fields when missing.
+    # Many of our larger datasets have the PL and AD fields for homref genotypes


Suggested change

# Many of our larger datasets have the PL and AD fields for homref genotypes

# Many of our larger datasets have the PL and AD fields for homref genotypes

can you check if this is true for v3 as well? if yes, it's best to say v3 and v4 have the PL and ...

Yes, they are also NA in v3 VDS. FYI, I used this code to check:

from gnomad_qc.v3.resources.basics import get_gnomad_v3_vds from gnomad.sample_qc.relatedness import filter_to_trios vds = get_gnomad_v3_vds(split=True, filter_intervals=['chr11:113409605-113475691'], ) trios = hl.import_fam("gs://gnomad-qin/g1k.trios.fam", delimiter='\t') vds = filter_to_trios(vds, trios) mt = hl.vds.to_dense_mt(vds) mt.entries().show(20)

thank you for looking! this is good to know

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr

a few minor suggestions and questions

ch-kr · 2025-02-21T16:15:12Z

gnomad_qc/v4/resources/sample_qc.py

+
+
+def trio_denovo_ht(
+    releasable: bool = True,


does a non-releasable version of this HT exist?

No, will remove.

ch-kr · 2025-02-21T16:16:46Z

gnomad_qc/v4/resources/release.py

+        ".all_confidences.filtered"
+        if by_confidence == "all"
+        else ".high_confidence.filtered"
+    )
    postfix = f".{datetime.today().strftime('%Y-%m-%d')}" if test else ""


basic question, is this how we've started naming test files?

I don't think so, Julia just added this to have different versions to inspect. I could remove it.

ch-kr · 2025-02-21T18:03:04Z

gnomad_qc/v4/create_release/create_de_novo_release.py


-    return selects
+    ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000)


should this 1000 be an argument rather than hard coded? not sure how strict we've been about this in gnomad_qc

Yeah, I think I will make it an argument default to 1000 and use 1/10 for test.

ch-kr · 2025-02-21T18:04:59Z

gnomad_qc/v4/create_release/create_de_novo_release.py

-    :return: Select expression dict.
-    """
-    return {"allele_info": ht.allele_info.drop("nonsplit_alleles")}
+    ht = ht.filter(~hl.is_missing(ht.de_novo_call_info.confidence))


this filter is in place to remove variants that were candidate de novos (per ht.de_novo_call_info.is_de_novo) but failed for another reason, right? do you think users would find it useful to see these variants in the Hail Table with all de novos?

Yes, those are the failed de_novos, it will cause some kind of AC problem if we keep them, don't you think?

yeah, if we keep them, we'd have to calculate the AC across these sites (sites that were candidate de novos but failed for some other quality reason) and subtract them to report the various AC (AC_all, high confidence AC, etc).

could you ask in your ATGU presentation whether people think it would be useful to retain these variants? I suspect the ATGU community will include some users of this new feature, so they would be the best people to ask about this

I added two plots in my slides that will show you how many failed per chromosome, I think they're disproportionately too many, what do you think?

ch-kr · 2025-02-21T18:07:58Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+    ht = process_consequences(ht, has_polyphen=False)
+
+    ht = ht.annotate(
+        alt_is_star=ht.alleles[1] == "*",


rather than alt_is_star, the annotation added earlier (mixed_site) might be more valuable. users can pretty easily check for themselves whether the alt is a star allele

I don't think they are the same thing:

Maybe we could keep both? I need to use alt_is_star to filter.

yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

Change this, and I also moved the all confidence table after it.

ch-kr · 2025-02-21T18:10:06Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+        gene_id=ht.vep.worst_csq_for_variant.gene_id,
+        transcript_id=ht.vep.worst_csq_for_variant.transcript_id,


now that I see this again, I wonder if these should be named worst_csq_gene_id and worst_csq_transcript_id. what do you think?

Yeah, I could change that.

ch-kr · 2025-02-21T18:15:07Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+            "splice_polypyrimidine_tract_variant",
+            "splice_donor_5th_base_variant",
+            "splice_region_variant",
+            "splice_donor_region_variant",


maybe you should also remove incomplete_terminal_codon_variant ("A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed"), coding_sequence_variant ("A sequence variant that changes the coding sequence"), and coding_transcript_variant ("A transcript variant of a protein coding gene") here

I added them, but coding_sequence_variant was in Kaitlin's and Sarah's list, and incomplete_terminal_codon_variant was in Sarah's.

for consistency then, let's keep coding_sequence_variant

ch-kr · 2025-02-21T18:15:54Z

gnomad_qc/v4/create_release/create_de_novo_release.py

-    # the fields are present in the HT.
-    final_fields = get_final_ht_fields(ht, schema=FINALIZED_SCHEMA)
-    return ht.select(*final_fields["rows"]).select_globals(*final_fields["globals"])
+    return ht_all_conf, ht


 def restructure_for_tsv(ht: hl.Table) -> hl.Table:


this docstring should note that the TSV only contains high confidence de novos

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr

a few more comments

ch-kr · 2025-02-21T19:32:52Z

gnomad_qc/v4/create_release/create_de_novo_release.py

@@ -16,6 +16,7 @@
    CSQ_CODING_MEDIUM_IMPACT,
    process_consequences,
 )
+from hail.ggplot.utils import n_partitions


wow, cool -- I didn't know this existed. how did you find it?

also, this import doesn't appear to be used

Julia added it, I never used it.

ch-kr · 2025-02-21T19:42:33Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+    ht = process_consequences(ht, has_polyphen=False)
+
+    ht = ht.annotate(
+        alt_is_star=ht.alleles[1] == "*",


yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

ch-kr · 2025-02-21T19:44:09Z

gnomad_qc/v4/create_release/create_de_novo_release.py

-    :return: Select expression dict.
-    """
-    return {"allele_info": ht.allele_info.drop("nonsplit_alleles")}
+    ht = ht.filter(~hl.is_missing(ht.de_novo_call_info.confidence))


yeah, if we keep them, we'd have to calculate the AC across these sites (sites that were candidate de novos but failed for some other quality reason) and subtract them to report the various AC (AC_all, high confidence AC, etc).

could you ask in your ATGU presentation whether people think it would be useful to retain these variants? I suspect the ATGU community will include some users of this new feature, so they would be the best people to ask about this

KoalaQin added 3 commits January 28, 2025 16:02

Generate de novo calls with new function

7812365

Add Julia's changes for dense MT and clean up

4b1355a

revert a comment

277e2fc

KoalaQin assigned KoalaQin and ch-kr Jan 29, 2025

KoalaQin requested a review from ch-kr January 29, 2025 17:54

ch-kr requested changes Jan 29, 2025

View reviewed changes

KoalaQin and others added 5 commits January 30, 2025 17:13

Apply suggestions from code review

d3a6d3d

Co-authored-by: Katherine Chao <[email protected]>

Address review comments

da2af71

Merge branch 'qh/denovo' of https://github.com/broadinstitute/gnomad_qc…

e157ed3

… into qh/denovo

Change to use new functions from gnomad_methods

dba9f98

import new function

73fd7da

KoalaQin requested a review from ch-kr February 3, 2025 15:48

Select entries and transmute colnames

7c8bc8b

ch-kr reviewed Feb 3, 2025

View reviewed changes

KoalaQin added 6 commits February 6, 2025 14:10

Change the filter step

d6fbcb4

Modify comment about missing AD

0966825

Reflect gnomad_methods changes

af47133

Add aggregate and annnotate function

db83d47

Clean up the aggregate function

79b5ae2

Clean up the functions

aa66a8a

KoalaQin requested a review from ch-kr February 21, 2025 16:04

Change docstring

f3f67f3

KoalaQin changed the title ~~Generate de novo calls~~ Generate de novo release Feb 21, 2025

ch-kr requested changes Feb 21, 2025

View reviewed changes

KoalaQin added 4 commits February 21, 2025 14:04

Address review comments

d3a15dc

remove postfix

d793b96

Correct errors

bb179fd

Add mixed_site

fb279ed

KoalaQin requested a review from ch-kr February 21, 2025 19:21

Put coding_sequence_variant back

8f5c4ff

ch-kr reviewed Feb 21, 2025

View reviewed changes

Address review comments

def7de5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate de novo release #654

Generate de novo release #654

KoalaQin commented Jan 29, 2025 •

edited

Loading

ch-kr left a comment

ch-kr left a comment

ch-kr Feb 3, 2025

KoalaQin Feb 6, 2025

ch-kr Feb 6, 2025

ch-kr left a comment

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 24, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 24, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 21, 2025 •

edited

Loading

ch-kr Feb 21, 2025

ch-kr Feb 21, 2025

ch-kr left a comment

ch-kr Feb 21, 2025

ch-kr Feb 21, 2025

KoalaQin Feb 24, 2025

ch-kr Feb 21, 2025

ch-kr Feb 21, 2025

	# Many of our larger datasets have the PL and AD fields for homref genotypes
	# Many of our larger datasets have the PL and AD fields for homref genotypes


		return selects
		ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000)

		gene_id=ht.vep.worst_csq_for_variant.gene_id,
		transcript_id=ht.vep.worst_csq_for_variant.transcript_id,

Generate de novo release #654

Are you sure you want to change the base?

Generate de novo release #654

Conversation

KoalaQin commented Jan 29, 2025 • edited Loading

ch-kr left a comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin commented Jan 29, 2025 •

edited

Loading

KoalaQin Feb 21, 2025 •

edited

Loading