(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold · 2023-11-30T12:51:54Z

This PR is a lighter weight version of #947 that involves using the original AnnData object as the class to hold obs and var xr.Dataset.

Closes Dask and Zarr not loading obsp and obsm from remote s3 #951 and closes lazy dataframes in .obs and .var with backed="r" mode #981
Tests added
Release note added (or unnecessary)

codecov · 2023-12-07T16:24:55Z

Codecov Report

Attention: Patch coverage is 92.60780% with 36 lines in your changes missing coverage. Please review.

Project coverage is 84.26%. Comparing base (b2c7a21) to head (04dc77e).

Files with missing lines	Patch %	Lines
src/anndata/experimental/backed/_lazy_arrays.py	94.17%	6 Missing ⚠️
src/anndata/tests/helpers.py	73.91%	6 Missing ⚠️
src/anndata/_core/storage.py	50.00%	5 Missing ⚠️
src/anndata/experimental/backed/_xarray.py	92.53%	5 Missing ⚠️
src/anndata/_io/specs/lazy_methods.py	93.93%	4 Missing ⚠️
src/anndata/_core/index.py	66.66%	3 Missing ⚠️
src/anndata/experimental/backed/_compat.py	84.21%	3 Missing ⚠️
src/anndata/_io/specs/registry.py	88.23%	2 Missing ⚠️
src/anndata/experimental/backed/_io.py	96.15%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1247      +/-   ##
==========================================
- Coverage   86.11%   84.26%   -1.85%     
==========================================
  Files          40       45       +5     
  Lines        6242     6673     +431     
==========================================
+ Hits         5375     5623     +248     
- Misses        867     1050     +183

Files with missing lines	Coverage Δ
src/anndata/_core/aligned_df.py	`95.83% <100.00%> (+0.18%)`	⬆️
src/anndata/_core/anndata.py	`82.65% <100.00%> (+0.04%)`	⬆️
src/anndata/_core/merge.py	`85.44% <100.00%> (-8.46%)`	⬇️
src/anndata/_core/views.py	`85.40% <100.00%> (-5.35%)`	⬇️
src/anndata/_io/specs/__init__.py	`100.00% <ø> (ø)`
src/anndata/_io/specs/methods.py	`88.36% <100.00%> (-0.41%)`	⬇️
src/anndata/_io/zarr.py	`83.75% <100.00%> (+0.20%)`	⬆️
src/anndata/_types.py	`86.11% <100.00%> (+0.81%)`	⬆️
src/anndata/experimental/__init__.py	`100.00% <100.00%> (ø)`
src/anndata/experimental/backed/__init__.py	`100.00% <100.00%> (ø)`
... and 10 more

... and 3 files with indirect coverage changes

ilan-gold · 2024-07-23T13:05:18Z

@ivirshup @flying-sheep Not really looking for a thorough code review at the moment, more of a look at the structure of what we are exporting. The big changes are

read_elem_as_dask->read_elem_lazy becomes a more general method focused on supporting the reading of obs and var lazy as xarray objects (although now it can read categoricals, nullables, etc. as a side-effect)
read_backed is exported as a one-stop-shop for reading everything at once backed, as possible

Do we want this way of doing things? Or is there some other route?

Separately, are the changes made to the core acceptable? After that, I think we can look into the specifics of the code I added. Or you can review that now, but I'd rather get big changes out of the way first.

ilan-gold · 2024-07-23T13:05:59Z

I will continue to make little changes to clean things up (this is still a draft!) but I think this structure is the way I would go. But maybe you have different ideas!

flying-sheep

I really like the approach.

Would of course be better if that stuff got upstreamed, but with only the category and mask handling being done by us, this is feasible I think, but I don’t have a lot of xarray experience.

There’s one hack in there that I really don’t want us to leave in, otherwise already looks quite clean.

I’ll take a deeper look once you’re done.

ci/scripts/min-deps.py

pyproject.toml

src/anndata/_core/anndata.py

Co-authored-by: Philipp A. <[email protected]>

pyproject.toml

Co-authored-by: Philipp A. <[email protected]>

…ring`

…/xarray_compat

flying-sheep · 2025-02-20T12:41:53Z

tests/lazy/test_concat.py

+from anndata.experimental import read_lazy
+from anndata.tests.helpers import assert_equal, gen_adata
+
+from .conftest import ANNDATA_ELEMS, get_key_trackers_for_columns_on_axis


wait, I thought this doesn’t work. did they change that?

What didn't work? Importing from `conftest?

…exing ops

ilan-gold · 2025-02-25T10:24:38Z

pyproject.toml

 ]
 dev-doc = ["towncrier>=24.8.0"] # release notes tool
+test-full = ["anndata[test,lazy]"]


Maybe a different name? Not sure about test-full

in scanpy we have

test-min which is used in the minimum deps job,

test which is a healthy subset of functionality, and

test-full, which is everything (except for external I think)

src/anndata/_core/index.py

src/anndata/_core/merge.py

flying-sheep · 2025-02-25T14:31:27Z

OK, so I went through all open conversations and new commits, and there’s almost nothing left:

4e6bb60 looks like the opposite of a change I would make. How were the tests not marked “properly” before? pytestmark should work and is less complex.
I’d still like to see a more local/explicit version of the "experimental/backed" in str(request.node.path): (feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247 (comment)
I’m also still not happy about the deps here, it‘s pretty unclear what causes them to be there. Maybe annotate non-obvious ones so we can check if we need them and remove them at some point instead of cargo-culting them into infinity? (feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247 (comment)

ilan-gold · 2025-02-25T14:58:26Z

pytestmark should work and is less complex.

Putting at the top-level of the conftest didn't work so I could have added it to the files individually but this seems anti-scalable.

I’m also still not happy about the deps here, it‘s pretty unclear what causes them to be there. Maybe annotate non-obvious ones so we can check if we need them and remove them at some point instead of cargo-culting them into infinity?

That's fair - maybe let's wait until zarr v3. The reason is simply that "using remote data with zarr requires these, and otherwise you will get a RuntimeError"

I’d still like to see a more local/explicit version of the "experimental/backed" in str(request.node.path)

I can redo this

flying-sheep

Looks great! One last thing:

src/testing/anndata/_pytest.py

for more information, see https://pre-commit.ci

ilan-gold · 2025-02-26T12:14:29Z

I will open follow up PRs after this one to account for a few things, but, for now, I am going to leave this unmerged because the zarr v3 PR should go in first.

I am very happy with the state of things :)

ilan-gold mentioned this pull request Nov 30, 2023

Dask and Zarr not loading obsp and obsm from remote s3 #951

Open

ilan-gold mentioned this pull request Jan 31, 2024

lazy dataframes in .obs and .var with backed="r" mode #981

Open

ilan-gold added this to the 0.11.0 milestone Jul 2, 2024

ilan-gold self-assigned this Jul 2, 2024

ilan-gold added the skip-gpu-ci label Jul 5, 2024

ilan-gold force-pushed the ig/xarray_compat branch from 68fcd2b to 6165f07 Compare July 5, 2024 14:16

ilan-gold changed the base branch from main to ig/read_dask_elem July 9, 2024 15:44

ilan-gold mentioned this pull request Jul 10, 2024

(feat): read_elem_as_dask method #1469

Merged

3 tasks

ilan-gold added 2 commits July 23, 2024 10:27

(fix): migrate to use read_elem infrastructure

fcb1763

Merge branch 'ig/read_dask_elem' into ig/xarray_compat

adcd48a

Base automatically changed from ig/read_dask_elem to main July 23, 2024 08:39

ilan-gold added 5 commits July 23, 2024 10:45

Merge branch 'main' into ig/xarray_compat

2a72ec0

(fix): no first access of categories

4c659a1

(fix): last small cleanups

d3a811a

(fix): try not runnign xarray tests

e852a74

(fix): oops! forgot one test to mark!

8c92a41

ilan-gold requested review from ivirshup and flying-sheep July 23, 2024 13:03

Merge branch 'main' into ig/xarray_compat

47be954

flying-sheep requested changes Aug 6, 2024

View reviewed changes

ci/scripts/min-deps.py Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

src/anndata/_core/anndata.py Outdated Show resolved Hide resolved

ilan-gold and others added 3 commits August 6, 2024 08:09

Update pyproject.toml

55f706f

Co-authored-by: Philipp A. <[email protected]>

(fix): change unused category function from method to function

6fa97f0

Merge branch 'main' into ig/xarray_compat

9e2e21d

flying-sheep reviewed Aug 6, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

ilan-gold and others added 3 commits August 6, 2024 10:56

(fix): actually track keys instead of relying on deafultdict behavior

eb1237c

(chore): test unconsolidated warning

6724c62

Update pyproject.toml

53796a0

Co-authored-by: Philipp A. <[email protected]>

ilan-gold added 10 commits February 19, 2025 13:49

(fix): materialize dask array

3733dba

(feat): finish handling of nullable-string-array

c58f0cc

(fix): silence warning by using StringDtype directly instead of `st…

19973b0

…ring`

(refactor): simplify xarray dim -- pandas index interplay

d7435c0

(chore): rename test files

5317718

(refactor): clarify test_concat_to_memory_var

ac144f9

(fix): throw away zeros for numpy-backed dask array

7728063

Merge branch 'main' into ig/xarray_compat

29b5914

(chore): loosen restriction on merged part

b8080f9

Merge branch 'ig/xarray_compat' of github.com:scverse/anndata into ig…

99acfc5

…/xarray_compat

flying-sheep reviewed Feb 20, 2025

View reviewed changes

ilan-gold added 5 commits February 20, 2025 14:18

(feat): drop requirement that indices must match

1b9fab3

(fix): mark tests properly

4e6bb60

(fix): string dtype issues

f25afa1

(fix): ensure dummy indices are string typed instead of range for ind…

b46ae71

…exing ops

(chore): raise TypeError not ValueError

f0c182d

ilan-gold commented Feb 25, 2025

View reviewed changes

Merge branch 'main' into ig/xarray_compat

51021dd

Merge branch 'main' into ig/xarray_compat

ebbd43a

ilan-gold added 3 commits February 25, 2025 16:22

(chore): pyproject.toml comment

070aab9

(refactor): use doctest_needs marker

20e2d1b

(fix): remove runtime pytest dep

0b52461

flying-sheep approved these changes Feb 25, 2025

View reviewed changes

src/testing/anndata/_pytest.py Outdated Show resolved Hide resolved

flying-sheep and others added 3 commits February 25, 2025 18:02

skip conditionally

2d0b4f7

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc1c599

for more information, see https://pre-commit.ci

(fix): pytest skipping

04dc77e

ilan-gold mentioned this pull request Feb 27, 2025

initial support for Dask DataFrames in obsm/varm #1880

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold commented Nov 30, 2023 •

edited by ivirshup

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

ilan-gold commented Jul 23, 2024 •

edited

Loading

ilan-gold commented Jul 23, 2024

flying-sheep left a comment •

edited

Loading

flying-sheep Feb 20, 2025

ilan-gold Feb 20, 2025

flying-sheep Feb 25, 2025

ilan-gold Feb 25, 2025 •

edited

Loading

flying-sheep Feb 25, 2025

flying-sheep commented Feb 25, 2025 •

edited

Loading

ilan-gold commented Feb 25, 2025

flying-sheep left a comment

ilan-gold commented Feb 26, 2025

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Are you sure you want to change the base?

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Conversation

ilan-gold commented Nov 30, 2023 • edited by ivirshup Loading

codecov bot commented Dec 7, 2023 • edited Loading

Codecov Report

ilan-gold commented Jul 23, 2024 • edited Loading

ilan-gold commented Jul 23, 2024

flying-sheep left a comment • edited Loading

Choose a reason for hiding this comment

flying-sheep Feb 20, 2025

Choose a reason for hiding this comment

ilan-gold Feb 20, 2025

Choose a reason for hiding this comment

flying-sheep Feb 25, 2025

Choose a reason for hiding this comment

ilan-gold Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

flying-sheep Feb 25, 2025

Choose a reason for hiding this comment

flying-sheep commented Feb 25, 2025 • edited Loading

ilan-gold commented Feb 25, 2025

flying-sheep left a comment

Choose a reason for hiding this comment

ilan-gold commented Feb 26, 2025

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold commented Nov 30, 2023 •

edited by ivirshup

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

ilan-gold commented Jul 23, 2024 •

edited

Loading

flying-sheep left a comment •

edited

Loading

ilan-gold Feb 25, 2025 •

edited

Loading

flying-sheep commented Feb 25, 2025 •

edited

Loading