Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR search index results not filtering as expected #253

Closed
mwengren opened this issue Apr 8, 2024 · 3 comments
Closed

SOLR search index results not filtering as expected #253

mwengren opened this issue Apr 8, 2024 · 3 comments

Comments

@mwengren
Copy link
Member

mwengren commented Apr 8, 2024

Separating this issue out from #252

We're seeing poor filtering results from the Solr index.

If I try to search by an individual GCOOS dataset id (see this search for 'Data for ioos-station-wmo-42400'), I get essentially a full list of datasets returned (~76.068 total datasets). The dataset order does appear to be sorted at least (most relevant results at top), but there is essentially no filtering happening on the count in the results set.

Testing today on the simple search string: 'M01':

Without quotes: M01 yields ~60,590 results: https://data.ioos.us/dataset/?q=M01&sort=score+desc%2C+metadata_modified+desc&ext_timerange_start=&ext_timerange_end=&ext_min_depth=&ext_max_depth=&ext_bbox=

With quotes: "M01" yields: 33 results: https://data.ioos.us/dataset/?q=%22M01%22&sort=score+desc%2C+metadata_modified+desc&ext_timerange_start=&ext_timerange_end=&ext_min_depth=&ext_max_depth=&ext_bbox=

Another example, searching for osu592-20230524T1813-delayed and org=Glider DAC without and with quotes changes results from 6877 datasets to 2 datasets.

Results are more reasonable for other simple phrase searches like 'Mote' or 'NERACOOS':

Search for 'NERACOOS' ~397 results: https://data.ioos.us/dataset/?q=NERACOOS&sort=score+desc%2C+metadata_modified+desc&ext_timerange_start=&ext_timerange_end=&ext_min_depth=&ext_max_depth=&ext_bbox=

@mwengren
Copy link
Member Author

mwengren commented Apr 8, 2024

@benjwadams says it might be a query syntax issue or a proximity (like phrases) issue with how Solr is configured.

@benjwadams
Copy link
Contributor

It's very likely how the free-text search is configured in the stock CKAN schema

If you remove the "T" from the ISO8601 date strings in the search you will get much more reasonable results. Letters adjacent to numbers appear to be getting tokenized separately. Quoting will also work, but this may not be immediately obvious.

@benjwadams
Copy link
Contributor

Addressed in ioos/catalog-docker-base@61ad2bc. Issues with glider names returning exorbitant numbers of search results have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants