Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add Pangeo/ESGF Cloud Zarr data as search-index. #44

Open
jbusecke opened this issue Apr 24, 2024 · 6 comments
Open

Proposal: Add Pangeo/ESGF Cloud Zarr data as search-index. #44

jbusecke opened this issue Apr 24, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@jbusecke
Copy link

Loved to discover this tool on the ESGF meeting.

I would like to follow up on the suggestion of @nocollier to add the data ingested as part of the Pangeo/ESGF Cloud working group

What we have.

  • Basically an intake-esm csv file, that has all the facets and a store (pointing to a gcs url). The file is fully public.

Requirements:

  • As I hear @nocollier speak on this I realize that currently you are (always?) caching to a local disk? Can we make this optional and enable just opening a dataset lazily?
@nocollier
Copy link
Member

At the moment I am only downloading but that was not my long term intention. In my mind we would have an interface such as to_dataset_dict(prefer_stream=True) or something similar. Then as we look at the responses at various access methods, instead of triggering the download we would just pass the appropriate handle to xarray. I say prefer because you may have datasets in your catalog without stream options and I wouldn't want things to fail. If you wanted to be sure you only ever streamed data, then you could add access='kerchunk' or similar to your search to get only records with a particular access method. What do you think?

@nocollier nocollier added the enhancement New feature or request label Apr 30, 2024
@jbusecke
Copy link
Author

jbusecke commented May 1, 2024

Could we just selectively bypass the _move_data step? This could be paired with passing some kwargs to the xarray open logic.

@nocollier
Copy link
Member

Sorry we are dragging our heels a bit. The problem is that we need to refactor to_dataset_dict() and don't have time just now. I just need time with some other projects I have been ignoring prior to the ESGF meeting and then this is top priority.

I wrote the package without looking too deeply at intake-esm and also mainly thinking about download. So to_dataset_dict() grew into something that is messy and harder to refactor. In my mind a rework is coming that would:

  • Allow for the intake-esm indices to be configured as any other index currently supported. You may wish to include them as options, you may wish to have those be the only index you use.
  • Provide options to allow streaming. Maybe to_dataset_dict(prefer_streaming=True) with a configuration intake_esgf.conf.set(streaming_preference=['zarr','kerchunk','opendap']). I say prefer because some datasets in your index records may not have streaming access. Perhaps there is another option to set if you only ever want to stream and error if no streaming link is available.
  • Provide a cat.to_local_file_list() interface for those who need to download but either don't want to use xarray or need the local file locations for another reason. Our IPSL collaborators have no internet access on their HPC and so they must download and then use a special transfer method to get files where they need to be. This would let them use this and push data to their resources.
  • Provide a cat.to_link_list() for users like yourself, who don't like whatever default we embed in to_dataset_dict() and may just want the link to the kerchunk file (or whatever, I am still green with these technologies) so they can handle the call to xarray as they wish.

That is the plan in my mind--just need some get clear of some other things to work on this and I need Max to help me reconcile what intake-esm is doing. If you have suggestions/comments particular on the interface, I would welcome them.

@nocollier
Copy link
Member

It also strikes me that while I see no compelling reason not to include all indices as options, that ESGF project managers may push back because data hosted elsewhere is not quality controlled. Of course, even our own holdings have lots of issues, but I think they are more concerned with version updates. The messaging queue in the works will help with this. However, I would counter that by including these community-built indices as options, we give their maintainers a way to compare with other data in the ESGF index and stay better up to date. Just another concern to think about.

@jbusecke
Copy link
Author

jbusecke commented May 2, 2024

This all sounds great! And for my part this is not immediately time sensitive! I would also be happy with providing the index as a plug in (and warning the user accordingly, that this is not 'official' esgf data).

In pseudocode:

from intake-esgf.index import CSVIndex
from intake-esgf import ESGFCatalog

csv_index = CSVIndex('s3://path/to/csv_file', ...)
cat = ESGFCatalog(custom_index=csv_index, ...)
# this will always issue a warning when using non-official indicies

Separately from that it would be nice to get our data 'checked and approved' to that we can fix all the things necessary and then maybe become an official index, but that is a secondary goal IMO.

Please feel free to ping me if you find some cycles and Id be happy to help with this if that would be useful.

@nocollier
Copy link
Member

I just wanted to ping this to let you know we are alive, and still want to make this a priority. I have had a huge family problem that has delayed me, but we have a first step complete. In #68 we now allow to_dataset_dict(prefer_streaming=True). At the moment, since we are querying our Solr/Globus indices, this will return OPeNDAP links, but it lays the groundwork to accept other links as well. In my mind we would have a configuration variable streaming_priority=['VirtualiZarr','OPENDAP'] or something similar. Also note that we provide a to_dataset_path() which will return just the paths and works with the streaming option. This way if our constructors don't pass the options you need, you can do what you want with the links.

This is in main, but not a release yet. We just need to work on deciding how we want to ingest CSV/JSON style "indices". Stay tuned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants