Skip to content

Conversation

klaricch
Copy link
Contributor

@klaricch klaricch commented Aug 5, 2024

Calculate LOEUF and raw z-score per genetic ancestry group

@klaricch klaricch requested a review from jkgoodrich August 5, 2024 14:43
@klaricch klaricch assigned klaricch and jkgoodrich and unassigned jkgoodrich Aug 8, 2024
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a little bit more thought. I think instead of skipping downsampling, there should just be an option to split by pop in addition. Then we should really modify the downsampling and/or pop filtering to use the filter_arrays_by_meta, which was specifically designed to do this sort of metadata filtering. I added a little bit of a suggestion for how to do this, but it could be improved. After we decide what functions we are keeping/adding/modifying, we should add tests for those functions

)


def get_downsampling_freq_indices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we really should be making use of filter_arrays_by_meta, which didn't exist when we added this in. We can remove this function since it's likely not used much outside this module, or if we are worried about doing that, we can use from deprecated import deprecated See below for the suggested use of filter_arrays_by_meta



def downsampling_counts_expr(
def pop_counts_expr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just change to get_counts_expr

:return: Aggregation Expression for an array of the variant counts in downsamplings
for specified population.
"""
# Get an array of indices sorted by "downsampling" key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would change these lines to this to use filter_arrays_by_meta

might need to add an option for sex if you don't want to include it, or something to direct match.

Could still split out the filtering/sorting portion from the counts function if wanted, but then I think the count function should really just take the already filtered function.

Suggested change
# Get an array of indices sorted by "downsampling" key.
# Determine the genetic ancestry label to use for filtering by the
# `genetic_ancestry_label` key in `freq_meta_expr`.
genetic_ancestry_label = genetic_ancestry_label or hl.eval(
hl.literal(["gen_anc", "pop"]).find(
lambda c: freq_meta_expr.flatmap(lambda x: x.keys()).contains(c)
)
)
# Build filters to apply to `freq_meta_expr` to get the desired metadata groups for
# the aggregate count expression.
filters = {
"group": [variant_quality],
genetic_ancestry_label: [pop],
**({"subset": [subset]} if subset else {}),
**(
{"downsampling": list(map(str, downsamplings))}
if not skip_downsamplings and downsamplings is not None
else {}
)
}
filters = [(filters, {})]
if subset is None:
filters.append((["subset"], {"keep": False}))
if skip_downsamplings:
filters.append((["downsampling"], {"keep": False}))
elif downsamplings is None:
filters.append((["downsampling"], {}))
# Apply filters to `freq_meta_expr` and it's indices to get the indices of the
# desired metadata groups.
filtered_meta = freq_meta_expr
indices = hl.enumerate(freq_meta_expr)
for f, params in filters:
filtered_meta, filtered_idx = filter_arrays_by_meta(
filtered_meta,
indices,
items_to_filter=f,
**params
)
# Get an array of indices and meta dictionaries sorted by "downsampling" key if
# downsamplings are not skipped.
if not skip_downsamplings:
indices = hl.sorted(indices, key=lambda f: hl.int(f[1]["downsampling"]))
def _get_criteria(i: hl.expr.Int32Expression) -> hl.expr.Int32Expression:
"""
Return 1 when variant meets specified criteria (`singleton` or `max_af`), if requested, or with an AC > 0.
:param i: The index of a downsampling.
:return: Returns 1 if the variant in the downsampling with specified index met
the criteria. Otherwise, returns 0.
"""
if singleton:
return hl.int(freq_expr[i].AC == 1)
elif max_af:
return hl.int((freq_expr[i].AC > 0) & (freq_expr[i].AF <= max_af))
else:
return hl.int(freq_expr[i].AC > 0)
# Map `_get_criteria` function to each downsampling indexed by `sorted_indices` to
# generate a list of 1's and 0's for each variant, where the length of the array is
# the total number of downsamplings for the specified population and each element
# in the array indicates if the variant in the downsampling indexed by
# `sorted_indices` meets the specified criteria.
# Return an array sum aggregation that aggregates arrays generated from mapping.
return hl.agg.array_sum(hl.map(_get_criteria, indices.map(lambda x: x[0])))

pop,
)
agg[f"downsampling_counts_{pop}"] = downsampling_counts_expr(
agg[f"downsampling_counts_{pop}"] = pop_counts_expr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If skipping downsamplings, we probably don't want to name it downsampling_counts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if both downsamplings and a split by pop is wanted? It seems like this should be an addition rather than a skip

@jkgoodrich jkgoodrich self-assigned this Aug 12, 2024
@klaricch
Copy link
Contributor Author

will be replaced by #725

@klaricch klaricch closed this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants