Skip to content

Reading remote .zarr store has issue with reading shapes.parquet file #878

Open
@adkinsrs

Description

@adkinsrs

Local representation of the SpatialData object when read in locally. This is a Visium HD dataset that I created originally using spatialdata_io.visium_hd + some post-processing stuff.

SpatialData object, with associated Zarr store: /<path>/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr
├── Images
│     ├── 'spatialdata_hires_image': DataArray[cyx] (3, 4352, 6000)
│     └── 'spatialdata_lowres_image': DataArray[cyx] (3, 435, 600)
├── Shapes
│     └── 'spatialdata_square_008um': GeoDataFrame shape: (127839, 1) (2D shapes)
└── Tables
      ├── 'square_008um': AnnData (127839, 19059)
      └── 'table': AnnData (127839, 19059)
with coordinate systems:
    ▸ 'downscaled_hires', with elements:
        spatialdata_hires_image (Images), spatialdata_square_008um (Shapes)
    ▸ 'downscaled_lowres', with elements:
        spatialdata_lowres_image (Images), spatialdata_square_008um (Shapes)
    ▸ 'global', with elements:
        spatialdata_square_008um (Shapes)

Recommendation: attach a minimal working example
Generally, the easier it is for us to reproduce the issue, the faster we can work on it. It is not required, but if you can, please:

Reproducible example

This is a public dataset and the datastore should be downloadable

import spatialdata as sd
rem_path = "https://devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr"
sdata = sd.read_zarr(rem_path)
# ERROR

This will work read in fine, but has other issues (which I will document in separate tickets)

sdata = sd.read_zarr(rem_path, selection=["images", "tables"])

Describe the bug
When I attempt to read in a publicly accessible remote Zarr dataset, it seems that Pyarrow is dropping one of the "/" in the https URI when it comes to the "shapes.parquet" file. I'm not sure if this is an downstream issue on that package's end, or more upstream (including something on my end).

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 653, in _read_parquet_schema_and_metadata
    schema = parquet.ParquetDataset(path, filesystem=filesystem, **kwargs).schema
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1348, in __init__
    finfo = filesystem.get_file_info(path_or_paths)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI: 'https:/devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr/shapes/spatialdata_square_008um/shapes.parquet'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_core/spatialdata.py", line 1850, in read
    return read_zarr(file_path, selection=selection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_io/io_zarr.py", line 121, in read_zarr
    shapes[subgroup_name] = _read_shapes(f_elem_store)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/spatialdata/_io/io_shapes.py", line 54, in _read_shapes
    geo_df = read_parquet(path)
             ^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 751, in _read_parquet
    schema, metadata = _read_parquet_schema_and_metadata(path, filesystem)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/geopandas/io/arrow.py", line 655, in _read_parquet_schema_and_metadata
    schema = parquet.read_schema(path, filesystem=filesystem)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 2339, in read_schema
    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pyarrow/fs.py", line 179, in _resolve_filesystem_and_path
    filesystem, path = FileSystem.from_uri(path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 477, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: https:/devel.umgear.org/datasets/spatial/11692b64-b34a-4dbe-adc9-784a87a7a856.zarr/shapes/spatialdata_square_008um/shapes.parquet

Expected behavior
The SpatialData object is successfully created

Desktop (optional):

  • Tested in MacOS Sequoia 15.3 as well as a Dockerized Ubuntu:jammy image

Additional context
Relevant package versions. If you need me to go into a deeper dive, let me know

Python 3.12.7

spatialdata==0.3.0
spatialdata_io==0.1.6
pandas==2.2.1
anndata==0.10.6

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions