add aggregation commands to queries:run endpoint #447

PeopleMakeCulture · 2024-01-22T23:39:40Z

Allowing aggregation commands, either through queries: run or a new endpoint, would be very helpful towards this milestone and as an interim solution until we can have graph-based endpoints to do the traversing for users.

As things stand now you have to 1) download collections, 2) figure out the next field you want to query, then 3) run another API query.

This came up when discussing figuring out how to get some of the annotation files and proteomics raw data that come from a matched biosample, the combination of which is required as the input to the proteomics pipeline.

Typical starting identifier would be a biosample_set id, a omics_processing_set id where omics_type.has_raw_value=Proteomics, or a metaproteomics_analysis_activity_set id. Michal wrote a bit of python code do to this but it would be nice to be able to do this in a single API query.

From @dwinston:

There is some silliness re: mongo find command cursors not being valid for the mongo getMore command, but for some reason aggregate command cursors work fine. See below. So, I think we can allow aggregation commands with proper paging via POST /queries:run.

rv = mdb.command({"aggregate": "biosample_set", "pipeline":[{"$match": {}}], "cursor":{"batchSize": 10})
rv = mdb.command({"getMore": rv['cursor']['id'], "collection": "biosample_set", "batchSize": 10})
# etc., until cursor id is `0`.

The text was updated successfully, but these errors were encountered:

PeopleMakeCulture · 2024-01-24T18:48:13Z

@aclum We have started developing this endpoint and can currently return the results of one aggregate command. That means it can return results of up to 16MB. Would it be helpful for you to have access to this interim stage queries:run endpoint for now, as we build out the paging functionality?

PeopleMakeCulture · 2024-01-24T18:58:37Z

Link to relevant PR: https://github.com/microbiomedata/nmdc-runtime/compare/422-add-aggregation-command

dwinston · 2024-01-24T19:09:06Z

so, it turns out that cursors for aggregate commands don't persist either -- my best guess at why it worked for me via pymongo in a python shell session is that pymongo still starts an implicit session for commands, even though I thought that was discontinued.

The approach I think we'll take for this now is:

append an $out stage to the user-supplied aggregation pipeline, to send results to a temporary mongo collection.
call nmdc_runtime.api.endpoints.util.find_resources to use our custom cursor functionality that is currently in service for the find endpoints, so that one can retrieve all aggregation results if they exceed 16MB (the mongodb bson document size limit).
ensure the temporary collection is cleaned up (e.g. via a dagster schedule)

aclum · 2024-01-24T20:13:35Z

Yes, that would be useful.

closes #447

PeopleMakeCulture · 2024-01-29T20:39:39Z

New ticket to extend aggregate query:run with paging: #460

* create query models for aggregate command and response * fix typo * add usage example for aggregate query run * 437 co-locate API usage documentation on SwaggerUI (#455) * starter * style for #437 * add api docs for find * update docs * add schema metadata endpoints docs * add docs for metadata endpoints * add formatting for API endpoints * remove copy pasta --------- Co-authored-by: Donny Winston <[email protected]> * feat: retain docker-build.sh without committing * feat: get plaintext orcid jwt via cookie (#458) closes #457 * feat: aggregate cmd via POST /queries:run closes #447 --------- Co-authored-by: Donny Winston <[email protected]>

PeopleMakeCulture added this to Polyneme mixset Jan 22, 2024

PeopleMakeCulture converted this from a draft issue Jan 22, 2024

PeopleMakeCulture moved this from Lineup to At bat in Polyneme mixset Jan 22, 2024

PeopleMakeCulture self-assigned this Jan 22, 2024

PeopleMakeCulture added the unplanned-task not accounted for during sprint planning, but time sensitive label Jan 22, 2024

PeopleMakeCulture moved this from At bat to Lineup in Polyneme mixset Jan 22, 2024

dwinston mentioned this issue Jan 23, 2024

Milestone - Add support for all queries available in the data portal available via the public API (4.8) microbiomedata/issues#496

Closed

PeopleMakeCulture moved this from Lineup to At bat in Polyneme mixset Jan 24, 2024

PeopleMakeCulture moved this from At bat to On base in Polyneme mixset Jan 24, 2024

PeopleMakeCulture mentioned this issue Jan 24, 2024

create aggregate command for query:run endpoint #454

Merged

dwinston added a commit that referenced this issue Jan 25, 2024

feat: aggregate cmd via POST /queries:run

8d24e80

closes #447

dwinston closed this as completed in 1b7ef24 Jan 25, 2024

github-project-automation bot moved this from On base to Scored in Polyneme mixset Jan 25, 2024

PeopleMakeCulture mentioned this issue Jan 29, 2024

extend aggregate query:run with paging #460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add aggregation commands to queries:run endpoint #447

add aggregation commands to queries:run endpoint #447

PeopleMakeCulture commented Jan 22, 2024 •

edited

Loading

PeopleMakeCulture commented Jan 24, 2024

PeopleMakeCulture commented Jan 24, 2024

dwinston commented Jan 24, 2024 •

edited

Loading

aclum commented Jan 24, 2024

PeopleMakeCulture commented Jan 29, 2024 •

edited

Loading

add aggregation commands to queries:run endpoint #447

add aggregation commands to queries:run endpoint #447

Comments

PeopleMakeCulture commented Jan 22, 2024 • edited Loading

PeopleMakeCulture commented Jan 24, 2024

PeopleMakeCulture commented Jan 24, 2024

dwinston commented Jan 24, 2024 • edited Loading

aclum commented Jan 24, 2024

PeopleMakeCulture commented Jan 29, 2024 • edited Loading

PeopleMakeCulture commented Jan 22, 2024 •

edited

Loading

dwinston commented Jan 24, 2024 •

edited

Loading

PeopleMakeCulture commented Jan 29, 2024 •

edited

Loading