-
Notifications
You must be signed in to change notification settings - Fork 31
calculate constraint metrics per gen_anc #722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a little bit more thought. I think instead of skipping downsampling, there should just be an option to split by pop in addition. Then we should really modify the downsampling and/or pop filtering to use the filter_arrays_by_meta
, which was specifically designed to do this sort of metadata filtering. I added a little bit of a suggestion for how to do this, but it could be improved. After we decide what functions we are keeping/adding/modifying, we should add tests for those functions
) | ||
|
||
|
||
def get_downsampling_freq_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we really should be making use of filter_arrays_by_meta
, which didn't exist when we added this in. We can remove this function since it's likely not used much outside this module, or if we are worried about doing that, we can use from deprecated import deprecated
See below for the suggested use of filter_arrays_by_meta
|
||
|
||
def downsampling_counts_expr( | ||
def pop_counts_expr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just change to get_counts_expr
:return: Aggregation Expression for an array of the variant counts in downsamplings | ||
for specified population. | ||
""" | ||
# Get an array of indices sorted by "downsampling" key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would change these lines to this to use filter_arrays_by_meta
might need to add an option for sex if you don't want to include it, or something to direct match.
Could still split out the filtering/sorting portion from the counts function if wanted, but then I think the count function should really just take the already filtered function.
# Get an array of indices sorted by "downsampling" key. | |
# Determine the genetic ancestry label to use for filtering by the | |
# `genetic_ancestry_label` key in `freq_meta_expr`. | |
genetic_ancestry_label = genetic_ancestry_label or hl.eval( | |
hl.literal(["gen_anc", "pop"]).find( | |
lambda c: freq_meta_expr.flatmap(lambda x: x.keys()).contains(c) | |
) | |
) | |
# Build filters to apply to `freq_meta_expr` to get the desired metadata groups for | |
# the aggregate count expression. | |
filters = { | |
"group": [variant_quality], | |
genetic_ancestry_label: [pop], | |
**({"subset": [subset]} if subset else {}), | |
**( | |
{"downsampling": list(map(str, downsamplings))} | |
if not skip_downsamplings and downsamplings is not None | |
else {} | |
) | |
} | |
filters = [(filters, {})] | |
if subset is None: | |
filters.append((["subset"], {"keep": False})) | |
if skip_downsamplings: | |
filters.append((["downsampling"], {"keep": False})) | |
elif downsamplings is None: | |
filters.append((["downsampling"], {})) | |
# Apply filters to `freq_meta_expr` and it's indices to get the indices of the | |
# desired metadata groups. | |
filtered_meta = freq_meta_expr | |
indices = hl.enumerate(freq_meta_expr) | |
for f, params in filters: | |
filtered_meta, filtered_idx = filter_arrays_by_meta( | |
filtered_meta, | |
indices, | |
items_to_filter=f, | |
**params | |
) | |
# Get an array of indices and meta dictionaries sorted by "downsampling" key if | |
# downsamplings are not skipped. | |
if not skip_downsamplings: | |
indices = hl.sorted(indices, key=lambda f: hl.int(f[1]["downsampling"])) | |
def _get_criteria(i: hl.expr.Int32Expression) -> hl.expr.Int32Expression: | |
""" | |
Return 1 when variant meets specified criteria (`singleton` or `max_af`), if requested, or with an AC > 0. | |
:param i: The index of a downsampling. | |
:return: Returns 1 if the variant in the downsampling with specified index met | |
the criteria. Otherwise, returns 0. | |
""" | |
if singleton: | |
return hl.int(freq_expr[i].AC == 1) | |
elif max_af: | |
return hl.int((freq_expr[i].AC > 0) & (freq_expr[i].AF <= max_af)) | |
else: | |
return hl.int(freq_expr[i].AC > 0) | |
# Map `_get_criteria` function to each downsampling indexed by `sorted_indices` to | |
# generate a list of 1's and 0's for each variant, where the length of the array is | |
# the total number of downsamplings for the specified population and each element | |
# in the array indicates if the variant in the downsampling indexed by | |
# `sorted_indices` meets the specified criteria. | |
# Return an array sum aggregation that aggregates arrays generated from mapping. | |
return hl.agg.array_sum(hl.map(_get_criteria, indices.map(lambda x: x[0]))) |
pop, | ||
) | ||
agg[f"downsampling_counts_{pop}"] = downsampling_counts_expr( | ||
agg[f"downsampling_counts_{pop}"] = pop_counts_expr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If skipping downsamplings, we probably don't want to name it downsampling_counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what if both downsamplings and a split by pop is wanted? It seems like this should be an addition rather than a skip
will be replaced by #725 |
Calculate LOEUF and raw z-score per genetic ancestry group