Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add aggregation commands to queries:run endpoint #447

Closed
PeopleMakeCulture opened this issue Jan 22, 2024 · 5 comments
Closed

add aggregation commands to queries:run endpoint #447

PeopleMakeCulture opened this issue Jan 22, 2024 · 5 comments
Assignees
Labels
unplanned-task not accounted for during sprint planning, but time sensitive

Comments

@PeopleMakeCulture
Copy link
Collaborator

PeopleMakeCulture commented Jan 22, 2024

Split from: microbiomedata/issues#496

From @aclum:

Allowing aggregation commands, either through queries: run or a new endpoint, would be very helpful towards this milestone and as an interim solution until we can have graph-based endpoints to do the traversing for users.

As things stand now you have to 1) download collections, 2) figure out the next field you want to query, then 3) run another API query.

This came up when discussing figuring out how to get some of the annotation files and proteomics raw data that come from a matched biosample, the combination of which is required as the input to the proteomics pipeline.

Typical starting identifier would be a biosample_set id, a omics_processing_set id where omics_type.has_raw_value=Proteomics, or a metaproteomics_analysis_activity_set id. Michal wrote a bit of python code do to this but it would be nice to be able to do this in a single API query.

From @dwinston:

There is some silliness re: mongo find command cursors not being valid for the mongo getMore command, but for some reason aggregate command cursors work fine. See below. So, I think we can allow aggregation commands with proper paging via POST /queries:run.

rv = mdb.command({"aggregate": "biosample_set", "pipeline":[{"$match": {}}], "cursor":{"batchSize": 10})
rv = mdb.command({"getMore": rv['cursor']['id'], "collection": "biosample_set", "batchSize": 10})
# etc., until cursor id is `0`.
@PeopleMakeCulture PeopleMakeCulture converted this from a draft issue Jan 22, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from Lineup to At bat in Polyneme mixset Jan 22, 2024
@PeopleMakeCulture PeopleMakeCulture self-assigned this Jan 22, 2024
@PeopleMakeCulture PeopleMakeCulture added the unplanned-task not accounted for during sprint planning, but time sensitive label Jan 22, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from At bat to Lineup in Polyneme mixset Jan 22, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from Lineup to At bat in Polyneme mixset Jan 24, 2024
@PeopleMakeCulture
Copy link
Collaborator Author

@aclum We have started developing this endpoint and can currently return the results of one aggregate command. That means it can return results of up to 16MB. Would it be helpful for you to have access to this interim stage queries:run endpoint for now, as we build out the paging functionality?

@PeopleMakeCulture
Copy link
Collaborator Author

@PeopleMakeCulture PeopleMakeCulture moved this from At bat to On base in Polyneme mixset Jan 24, 2024
@dwinston
Copy link
Collaborator

dwinston commented Jan 24, 2024

so, it turns out that cursors for aggregate commands don't persist either -- my best guess at why it worked for me via pymongo in a python shell session is that pymongo still starts an implicit session for commands, even though I thought that was discontinued.

The approach I think we'll take for this now is:

  1. append an $out stage to the user-supplied aggregation pipeline, to send results to a temporary mongo collection.
  2. call nmdc_runtime.api.endpoints.util.find_resources to use our custom cursor functionality that is currently in service for the find endpoints, so that one can retrieve all aggregation results if they exceed 16MB (the mongodb bson document size limit).
  3. ensure the temporary collection is cleaned up (e.g. via a dagster schedule)

@aclum
Copy link
Contributor

aclum commented Jan 24, 2024

Yes, that would be useful.

@PeopleMakeCulture
Copy link
Collaborator Author

PeopleMakeCulture commented Jan 29, 2024

New ticket to extend aggregate query:run with paging: #460

dwinston added a commit that referenced this issue Feb 2, 2024
* create query models for aggregate command and response

* fix typo

* add usage example for aggregate query run

* 437 co-locate API usage documentation on SwaggerUI  (#455)

* starter

* style

for #437

* add api docs for find

* update docs

* add schema metadata endpoints docs

* add docs for metadata endpoints

* add formatting for API endpoints

* remove copy pasta

---------

Co-authored-by: Donny Winston <[email protected]>

* feat: retain docker-build.sh without committing

* feat: get plaintext orcid jwt via cookie (#458)

closes #457

* feat: aggregate cmd via POST /queries:run

closes #447

---------

Co-authored-by: Donny Winston <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unplanned-task not accounted for during sprint planning, but time sensitive
Projects
Archived in project
Development

No branches or pull requests

3 participants