Long score set search result optimization

Making score set search faster has been a lingering need for some time, and some work has been done to address it (e.g. #343). This work may prove useful, but the biggest problem with search speed is the handling of large result sets, and particularly what happens when a user visits the search page with no pre-populated search.

In the latter case, the search returns every published, non-superseded score set. The database query speed is reasonable in this case, and the response size is what causes problems. This could be handled either by (a) refraining from running empty searches or by (b) limiting the number of results. (a) will not help when a nonempty query is run that still returns a large number of data sets; and it would also require changing the way that search filter options are populated. (They are based on data sets in the current search result.)

So I propose adopting option (b). This entails the following:

- Modify the score set search endpoint:
  - Accept a limit parameter. Default to a reasonable number (say, 100), and ignore limit parameters above this.
  - Instead of just returning a list of SavedScoreSet, return a new score set search response that includes the list together with the number of matching score sets.
  - Use limit+1 in the query, and if more than limit results are returned, run a second count query to obtain the total result count. The two queries' "where" classes will be identical, but the count query does not need to eagerly fetch joined tables.
- Add a new score set search filters endpoint. It should accept the same search query as the search endpoint, and it should return lists of target names, types, organisms, and accessions, as well as publication authors, databases, and journals.
- Modify the search screen to query both endpoints.
  - Display the total number of results, with a suggestion to narrow the search if not all results can be shown.
  - Use the new filters endpoint to populate options in the filter boxes.

In order to be able to run count and filter option queries efficiently, a few other changes may be necessary:
- Replace ScoreSet.superseded_score_set_id with ScoreSet.superseded_by_score_set_id. This may not be necessary, since we produce a join by using `ScoreSet.superseding_score_set.is_(None)` in queries, and perhaps this is efficient enough.
- Stop fetching superseding score sets after running the main search query. That is, modify the main search query to only return non-superseded score sets.
- Do something to handle the case where a superseding score set is not yet published. In this case, the score set it will supersede should still be returned. It seems to me that a good option is to move superseded_score_set_id into a different property for unpublished data and only set it when the score set is published.
- Some changes may be needed in the way we search for published or unpublished score sets depending on whether `_search_score_sets` is called from `search_score_sets` or `search_my_score_sets`. The goal should be to continue to apply access restrictions in the query itself, instead of calling `has_permission` on the search results, as we do in other endpoints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long score set search result optimization #524

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long score set search result optimization #524

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions