Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PXP-5529): implement GET /objects endpoint #15

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

johnfrancismccann
Copy link
Contributor

@johnfrancismccann johnfrancismccann commented Dec 16, 2020

Jira Ticket: PXP-5529

Implement GET /objects endpoint with filtering. Not ready to be merged yet. Please see get_objects docstring for more information.

The PR was becoming a bit long, so I was hoping to get feedback on the filtering interface before adding tests.

New Features

  • Implement GET /objects endpoint with filtering

Breaking Changes

Bug Fixes

Improvements

Dependency updates

  • use parsimonious = "0.8.1"

Deployment changes

src/mds/objects.py Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
src/mds/objects.py Outdated Show resolved Hide resolved
src/mds/query.py Outdated
Comment on lines 187 to 188
- GET /objects?filter=(message,:eq,"morning") returns "2"
- GET /objects?filter=(counts.1,:eq,3) returns "3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the design doc doesn't use a filter param, and the syntax is more simple - we might want to accept {record_or_metadata}.{arbitrary_key}={value} as param like the design doc describes, and build the filter object based on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request still maintains the {arbitrary_key}={value} param syntax for the GET /metadata endpoint. I also don't think it would be too bad to support that syntax for the GET /objects endpoint in addition to the filter=(counts.1,:eq,3) style.

src/mds/query.py Outdated Show resolved Hide resolved
src/mds/query.py Outdated
# operator
# (e.g. GET objects?filter=(_resource_paths,:any,(,:like,"/programs/%")) )
return operators[operator_name]["sql_clause"](
json_object, other["op"], other["val"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the filter stays user-provided, we might want to add a syntax validation step somewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like parsimonious does that for us

src/mds/objects.py Outdated Show resolved Hide resolved
src/mds/query.py Outdated Show resolved Hide resolved
src/mds/objects.py Outdated Show resolved Hide resolved
Copy link
Contributor

@paulineribeyre paulineribeyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, didn't mean to approve

@themarcelor
Copy link
Contributor

erm... just double-checking.
But is that backwards compatible with the existing /mds/metadata endpoint?
Just double-check due to the the current logic in: https://github.com/uc-cdis/gen3-qa/blob/master/suites/apis/metadataIngestionTest.js

@johnfrancismccann
Copy link
Contributor Author

Hey @themarcelor this is backwards-compatible with the existing /mds/metadata endpoint in the sense that it leaves that endpoint in place and doesn't modify it's behavior at all. However, the new /mds/objects endpoint has it's own filtering interface that is completely different to that used by /mds/metadata (i.e. using the same filter parameters for both endpoints won't work).

request: Request,
data: bool = Query(
True,
description="Switch to returning a list of GUIDs (false), "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description="Switch to returning a list of GUIDs (false), "
description="Switch to return a list of GUIDs (false), "

At least, I think that's right? 😅 🔤

src/mds/objects.py Outdated Show resolved Hide resolved
@@ -226,6 +226,161 @@ async def create_object_for_id(
return JSONResponse(response, HTTP_201_CREATED)


@mod.get("/objects")
async def get_objects(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it from the design doc linked in PXP-5529, this is sort of moving us towards the idea of rebranding MDS as a generalized "object management" service, is that right? (Just want to make sure I'm getting the context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that’s right. For the GET /objects endpoint in particular, note that Indexd records corresponding with metadata objects are returned in the response. I think the idea with the Object Management Service is to bring together various components of Gen3 (MDS, Fence, Indexd, SSJDispatcher, Indexs3client) to support a submission flow in which PFB files can be uploaded to the data lake (which consists of a single s3 bucket at the moment) by running a single gen3-client command.

During submission, both a Metadata object and an Indexd record pair are created and populated with info by an Indexs3client job, the Indexd record with calculated hashes, size, and urls info and the metadata object with _bucket, _filename, _file_extension, and _upload_status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. That makes a lot of sense to me. Thanks!

src/mds/objects.py Outdated Show resolved Hide resolved
src/mds/query.py Outdated

# json_value and below was taken from
# https://github.com/erikrose/parsimonious/pull/23/files
grammar = Grammar(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think parsing the grammar might be expensive. Can we do this once on app init and maintain a singleton grammar somewhere? Then just call grammar.parse() here?

Copy link
Contributor

@williamhaley williamhaley Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should say compiling above rather than parsing. I think the Grammar constructor would be re-compiling the definition every time we call parse_filter https://github.com/erikrose/parsimonious/blob/master/parsimonious/grammar.py#L68

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice catch!

williamhaley
williamhaley previously approved these changes Apr 2, 2021
Copy link
Contributor

@williamhaley williamhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! 👍

I think socializing/teaching the query/filtering language to others will be important to make sure people know it exists and we can use that consistently in other places where needed in the future.

filter: str = Query(
"",
description="The filter(s) that will be applied to the "
"result (more detail in the docstring).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"result (more detail in the docstring).",
"result (more detail in the endpoint description).",

this is super nitpicky but "docstring" doesn't make sense in the swagger doc

) -> JSONResponse:
"""
XXX comments
Returns a list of objects and their corresponding Indexd records (please
see URL query documentation for more info on which objects get returned).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i'm reading the swagger doc for this endpoint, where is the "URL query documentation"?

Returns a list of objects and their corresponding Indexd records (please
see URL query documentation for more info on which objects get returned).

The filtering functionality was primarily driven by the requirement that a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way this doc is written is aimed at a dev reading the code, not at a user reading the swagger API docs. i'm referring to things like:

  • "primarily driven by the requirement that a user be able"
  • "how do we design a filtering interface that allows the user to"
  • "that's what Postgres uses"

This information is valuable but IMO ideally docstrings would be written for API users (describe how to use the filter), since the docs are generated from docstrings automatically, and technical details would be somewhere else. Not sure where, maybe at the top of the file?

guid: {"record": records[guid] if guid in records else {}, "metadata": o}
for guid, o in metadata_objects.items()
"items": [
{"record": records[guid] if guid in records else {}, "metadata": o}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"record": records[guid] if guid in records else {}, "metadata": o}
{"record": records.get(guid, {}), "metadata": o}

src/mds/query.py Outdated
Comment on lines 348 to 350
except (IncompleteParseError, ParseError, VisitationError):
raise HTTPException(
HTTP_400_BAD_REQUEST, f"filter URL query param syntax is invalid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can provide more details to the user? do the IncompleteParseError, ParseError, VisitationError return useful error messages?
if we don't want to return them to the user for some reason, i think we should at least log them to help us debug (maybe just traceback.print_exc())

assert len(resp_json["items"]) == 1
assert resp_json["items"][0]["metadata"]["message"] == "morning"
finally:
tear_down_metadata_objects(client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could use a fixture to avoid adding a try/finally to each test?
maybe something like this, without "autouse"?

@github-actions
Copy link

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants