Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Open
wants to merge 471 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Nov 30, 2023

This PR is a lighter weight version of #947 that involves using the original AnnData object as the class to hold obs and var xr.Dataset.

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Attention: Patch coverage is 92.60780% with 36 lines in your changes missing coverage. Please review.

Project coverage is 84.26%. Comparing base (b2c7a21) to head (04dc77e).

Files with missing lines Patch % Lines
src/anndata/experimental/backed/_lazy_arrays.py 94.17% 6 Missing ⚠️
src/anndata/tests/helpers.py 73.91% 6 Missing ⚠️
src/anndata/_core/storage.py 50.00% 5 Missing ⚠️
src/anndata/experimental/backed/_xarray.py 92.53% 5 Missing ⚠️
src/anndata/_io/specs/lazy_methods.py 93.93% 4 Missing ⚠️
src/anndata/_core/index.py 66.66% 3 Missing ⚠️
src/anndata/experimental/backed/_compat.py 84.21% 3 Missing ⚠️
src/anndata/_io/specs/registry.py 88.23% 2 Missing ⚠️
src/anndata/experimental/backed/_io.py 96.15% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1247      +/-   ##
==========================================
- Coverage   86.11%   84.26%   -1.85%     
==========================================
  Files          40       45       +5     
  Lines        6242     6673     +431     
==========================================
+ Hits         5375     5623     +248     
- Misses        867     1050     +183     
Files with missing lines Coverage Δ
src/anndata/_core/aligned_df.py 95.83% <100.00%> (+0.18%) ⬆️
src/anndata/_core/anndata.py 82.65% <100.00%> (+0.04%) ⬆️
src/anndata/_core/merge.py 85.44% <100.00%> (-8.46%) ⬇️
src/anndata/_core/views.py 85.40% <100.00%> (-5.35%) ⬇️
src/anndata/_io/specs/__init__.py 100.00% <ø> (ø)
src/anndata/_io/specs/methods.py 88.36% <100.00%> (-0.41%) ⬇️
src/anndata/_io/zarr.py 83.75% <100.00%> (+0.20%) ⬆️
src/anndata/_types.py 86.11% <100.00%> (+0.81%) ⬆️
src/anndata/experimental/__init__.py 100.00% <100.00%> (ø)
src/anndata/experimental/backed/__init__.py 100.00% <100.00%> (ø)
... and 10 more

... and 3 files with indirect coverage changes

@ilan-gold ilan-gold added this to the 0.11.0 milestone Jul 2, 2024
@ilan-gold ilan-gold self-assigned this Jul 2, 2024
@ilan-gold ilan-gold changed the base branch from main to ig/read_dask_elem July 9, 2024 15:44
@ilan-gold ilan-gold mentioned this pull request Jul 10, 2024
3 tasks
Base automatically changed from ig/read_dask_elem to main July 23, 2024 08:39
@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jul 23, 2024

@ivirshup @flying-sheep Not really looking for a thorough code review at the moment, more of a look at the structure of what we are exporting. The big changes are

  1. read_elem_as_dask->read_elem_lazy becomes a more general method focused on supporting the reading of obs and var lazy as xarray objects (although now it can read categoricals, nullables, etc. as a side-effect)
  2. read_backed is exported as a one-stop-shop for reading everything at once backed, as possible

Do we want this way of doing things? Or is there some other route?

Separately, are the changes made to the core acceptable? After that, I think we can look into the specifics of the code I added. Or you can review that now, but I'd rather get big changes out of the way first.

@ilan-gold
Copy link
Contributor Author

I will continue to make little changes to clean things up (this is still a draft!) but I think this structure is the way I would go. But maybe you have different ideas!

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the approach.

Would of course be better if that stuff got upstreamed, but with only the category and mask handling being done by us, this is feasible I think, but I don’t have a lot of xarray experience.

There’s one hack in there that I really don’t want us to leave in, otherwise already looks quite clean.

I’ll take a deeper look once you’re done.

from anndata.experimental import read_lazy
from anndata.tests.helpers import assert_equal, gen_adata

from .conftest import ANNDATA_ELEMS, get_key_trackers_for_columns_on_axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, I thought this doesn’t work. did they change that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What didn't work? Importing from `conftest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

]
dev-doc = ["towncrier>=24.8.0"] # release notes tool
test-full = ["anndata[test,lazy]"]
Copy link
Contributor Author

@ilan-gold ilan-gold Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a different name? Not sure about test-full

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in scanpy we have

  • test-min which is used in the minimum deps job,
  • test which is a healthy subset of functionality, and
  • test-full, which is everything (except for external I think)

@flying-sheep
Copy link
Member

flying-sheep commented Feb 25, 2025

OK, so I went through all open conversations and new commits, and there’s almost nothing left:

@ilan-gold
Copy link
Contributor Author

pytestmark should work and is less complex.

Putting at the top-level of the conftest didn't work so I could have added it to the files individually but this seems anti-scalable.

I’m also still not happy about the deps here, it‘s pretty unclear what causes them to be there. Maybe annotate non-obvious ones so we can check if we need them and remove them at some point instead of cargo-culting them into infinity?

That's fair - maybe let's wait until zarr v3. The reason is simply that "using remote data with zarr requires these, and otherwise you will get a RuntimeError"

I’d still like to see a more local/explicit version of the "experimental/backed" in str(request.node.path)

I can redo this

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One last thing:

@ilan-gold
Copy link
Contributor Author

I will open follow up PRs after this one to account for a few things, but, for now, I am going to leave this unmerged because the zarr v3 PR should go in first.

I am very happy with the state of things :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants