Enable shallow and deep queries #53

nocollier · 2024-04-30T14:19:34Z

When looking for data in ESGF, a common mode of working is to first search for a few facets and then use the unique column values to refine your search iteratively. This is currently very slow for initial queries that will return many records. This is what we currently implement when you call search():

Query each index node with your search and build a pandas dataframe from the responses
Merge this information into a single dataframe, removing older versions and populating lists of dataset_ids which contain location information
Call df.unique() to get the unique facet columns and return as the __repr__ of the catalog

The Solr indices will take a long time to return the complete response and even if Globus is faster, it consumes a lot of resources for information we really didn't need in early stages of the search.

Instead, we could have search perform what I will call a shallow query. That is, we return 0 records, but ask the index for the unique facets that are part of the search. This response we use to manually build up the unique facet columns and the underlying dataframe remains empty initially.

When the user makes reference to cat.df (either directly or indirectly by calling something that uses it, such as to_dataset_dict()), then we pay the price of the full search, hoping that you have a better idea of what you need at this point.

The text was updated successfully, but these errors were encountered:

nocollier added the enhancement New feature or request label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable shallow and deep queries #53

Enable shallow and deep queries #53

nocollier commented Apr 30, 2024

Enable shallow and deep queries #53

Enable shallow and deep queries #53

Comments

nocollier commented Apr 30, 2024