-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(PXP-5529): implement GET /objects endpoint #15
base: master
Are you sure you want to change the base?
Changes from 19 commits
2d7666a
fe95adf
da7f330
d639b2b
6a6093f
683ed5d
05c879c
bf3a6f6
06cc2f0
a8d24e5
f86f036
4e0191d
f48a015
e774189
a87bfaa
a75914d
71506e6
79e2c7d
80a12d5
879e64a
1b00631
9a350c4
7c90096
20f00ff
81aedb8
aec9b9e
167d39a
60b9be5
4ff094d
3e574f8
5d7deba
f688cd3
df595e8
e35d9bf
d313286
e82e87c
23cb565
8f91573
7ed158e
8636930
82cb87e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -24,7 +24,7 @@ | |||||
|
||||||
from . import config, logger | ||||||
from .models import Metadata | ||||||
from .query import get_metadata, search_metadata_helper | ||||||
from .query import get_metadata, search_metadata_objects | ||||||
|
||||||
mod = APIRouter() | ||||||
|
||||||
|
@@ -231,64 +231,157 @@ async def create_object_for_id( | |||||
@mod.get("/objects") | ||||||
async def get_objects( | ||||||
request: Request, | ||||||
# XXX would be nice to be able to specify whether to return indexd records | ||||||
# (e.g GET objects?data=metadata would only return mds objects) | ||||||
data: bool = Query( | ||||||
False, | ||||||
description="Switch to returning a list of GUIDs (false), " | ||||||
"or GUIDs mapping to their metadata (true).", | ||||||
True, | ||||||
description="Switch to return a list of GUIDs (false), " | ||||||
"or metadata objects (true).", | ||||||
), | ||||||
page: int = Query( | ||||||
0, | ||||||
description="The offset for what objects are returned " | ||||||
"(zero-indexed). The exact offset will be equal to " | ||||||
"page*limit (e.g. with page=1, limit=15, 15 objects " | ||||||
"beginning at index 15 will be returned).", | ||||||
), | ||||||
limit: int = Query( | ||||||
johnfrancismccann marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
10, description="Maximum number of records returned. (max: 2000)" | ||||||
10, | ||||||
description="Maximum number of objects returned (max: 1024). " | ||||||
"Also used with page to determine page size.", | ||||||
), | ||||||
filter: str = Query( | ||||||
"", | ||||||
description="The filter(s) that will be applied to the " | ||||||
"result (more detail in the docstring).", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
this is super nitpicky but "docstring" doesn't make sense in the swagger doc |
||||||
), | ||||||
offset: int = Query(0, description="Return results at this given offset."), | ||||||
# XXX description | ||||||
# XXX how to name this python variable something other than filter but | ||||||
# still have client use "filter" as URL query param? (bc filter is already | ||||||
# built in to Python) | ||||||
filter: str = Query("", description="Filters to apply."), | ||||||
) -> JSONResponse: | ||||||
""" | ||||||
XXX comments | ||||||
Returns a list of objects and their corresponding Indexd records (please | ||||||
see URL query documentation for more info on which objects get returned). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when i'm reading the swagger doc for this endpoint, where is the "URL query documentation"? |
||||||
|
||||||
The filtering functionality was primarily driven by the requirement that a | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the way this doc is written is aimed at a dev reading the code, not at a user reading the swagger API docs. i'm referring to things like:
This information is valuable but IMO ideally docstrings would be written for API users (describe how to use the filter), since the docs are generated from docstrings automatically, and technical details would be somewhere else. Not sure where, maybe at the top of the file? |
||||||
user be able to get all objects having an authz resource matching a | ||||||
user-supplied pattern at any index in the "_resource_paths" array. | ||||||
|
||||||
For example, given the following metadata objects: | ||||||
|
||||||
{ | ||||||
"0": { | ||||||
"message": "hello", | ||||||
"_uploader_id": "100", | ||||||
"_resource_paths": [ | ||||||
"/programs/a", | ||||||
"/programs/b" | ||||||
], | ||||||
"pet": "dog", | ||||||
"pet_age": 1 | ||||||
}, | ||||||
"1": { | ||||||
"message": "greetings", | ||||||
"_uploader_id": "101", | ||||||
"_resource_paths": [ | ||||||
"/open", | ||||||
"/programs/c/projects/a" | ||||||
], | ||||||
"pet": "ferret", | ||||||
"pet_age": 5, | ||||||
"sport": "soccer" | ||||||
}, | ||||||
"2": { | ||||||
"message": "morning", | ||||||
"_uploader_id": "102", | ||||||
"_resource_paths": [ | ||||||
"/programs/d", | ||||||
"/programs/e" | ||||||
], | ||||||
"counts": [42, 42, 42], | ||||||
"pet": "ferret", | ||||||
"pet_age": 10, | ||||||
"sport": "soccer" | ||||||
}, | ||||||
"3": { | ||||||
"message": "evening", | ||||||
"_uploader_id": "103", | ||||||
"_resource_paths": [ | ||||||
"/programs/f/projects/a", | ||||||
"/admin" | ||||||
], | ||||||
"counts": [1, 3, 5], | ||||||
"pet": "ferret", | ||||||
"pet_age": 15, | ||||||
"sport": "basketball" | ||||||
} | ||||||
} | ||||||
|
||||||
how do we design a filtering interface that allows the user to get all | ||||||
objects having an authz string matching the pattern | ||||||
"/programs/%/projects/%" at any index in its "_resource_paths" array? (% | ||||||
has been used as the wildcard so far because that's what Postgres uses as | ||||||
the wildcard for LIKE) In this case, the "1" and "3" objects should be | ||||||
returned. | ||||||
|
||||||
The filter syntax that was arrived at ending up following the syntax | ||||||
specified by a [Node JS implementation](https://www.npmjs.com/package/json-api#filtering) of the [JSON:API | ||||||
specification](https://jsonapi.org/). | ||||||
|
||||||
The format for this syntax is filter=(field_name,operator,value), in which | ||||||
the field_name is a json key without quotes, operator is one of :eq, :ne, | ||||||
:gt, :gte, :lt, :lte, :like, :all, :any (see operators dict), and value is | ||||||
a typed json value against which the operator is run. | ||||||
|
||||||
Examples: | ||||||
|
||||||
GET /objects?filter=(message,:eq,"morning") returns "2" | ||||||
williamhaley marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
GET /objects?filter=(counts.1,:eq,3) returns "3" | ||||||
GET /objects?filter=(pet_age,:lte,5) returns "0" and "1" | ||||||
GET /objects?filter=(pet_age,:gt,5) returns "2" and "3" | ||||||
|
||||||
Compound expressions are supported: | ||||||
|
||||||
GET /objects?filter=(_resource_paths,:any,(,:like,"/programs/%/projects/%")) returns "1" and "3" | ||||||
GET /objects?filter=(counts,:all,(,:eq,42)) returns "2" | ||||||
|
||||||
Boolean expressions are also supported: | ||||||
|
||||||
GET /objects?filter=(or,(_uploader_id,:eq,"101"),(_uploader_id,:eq,"102")) returns "1" and "2" | ||||||
GET /objects?filter=(or,(and,(pet,:eq,"ferret"),(sport,:eq,"soccer")),(message,:eq,"hello")) returns "0", "1", and "2" | ||||||
""" | ||||||
|
||||||
metadata_objects = await search_metadata_helper( | ||||||
data=data, limit=limit, offset=offset, filter=filter | ||||||
metadata_objects = await search_metadata_objects( | ||||||
data=data, page=page, limit=limit, filter=filter | ||||||
) | ||||||
|
||||||
records = {} | ||||||
if metadata_objects: | ||||||
if data and metadata_objects: | ||||||
try: | ||||||
endpoint_path = "/bulk/documents" | ||||||
full_endpoint = config.INDEXING_SERVICE_ENDPOINT.rstrip("/") + endpoint_path | ||||||
guids = ( | ||||||
list(metadata_objects.keys()) | ||||||
if hasattr(metadata_objects, "keys") | ||||||
else metadata_objects | ||||||
full_indexd_url = ( | ||||||
config.INDEXING_SERVICE_ENDPOINT.rstrip("/") + endpoint_path | ||||||
) | ||||||
# XXX /bulk/documents endpoint in indexd currently doesn't support | ||||||
# filters | ||||||
response = await request.app.async_client.post(full_endpoint, json=guids) | ||||||
guids = list(guid for guid, _ in metadata_objects) | ||||||
|
||||||
response = await request.app.async_client.post(full_indexd_url, json=guids) | ||||||
response.raise_for_status() | ||||||
records = {r["did"]: r for r in response.json()} | ||||||
except httpx.HTTPError as err: | ||||||
logger.debug(err, exc_info=True) | ||||||
if err.response: | ||||||
logger.error( | ||||||
"indexd `POST %s` endpoint returned a %s HTTP status code", | ||||||
endpoint_path, | ||||||
full_indexd_url, | ||||||
err.response.status_code, | ||||||
) | ||||||
else: | ||||||
logger.error( | ||||||
"Unable to get a response from indexd `POST %s` endpoint", | ||||||
endpoint_path, | ||||||
full_indexd_url, | ||||||
) | ||||||
|
||||||
if type(metadata_objects) is dict: | ||||||
if data: | ||||||
response = { | ||||||
guid: {"record": records[guid] if guid in records else {}, "metadata": o} | ||||||
for guid, o in metadata_objects.items() | ||||||
"items": [ | ||||||
{"record": records[guid] if guid in records else {}, "metadata": o} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
for guid, o in metadata_objects | ||||||
] | ||||||
} | ||||||
else: | ||||||
response = metadata_objects | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it from the design doc linked in PXP-5529, this is sort of moving us towards the idea of rebranding MDS as a generalized "object management" service, is that right? (Just want to make sure I'm getting the context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that’s right. For the
GET /objects
endpoint in particular, note that Indexd records corresponding with metadata objects are returned in the response. I think the idea with the Object Management Service is to bring together various components of Gen3 (MDS, Fence, Indexd, SSJDispatcher, Indexs3client) to support a submission flow in which PFB files can be uploaded to the data lake (which consists of a single s3 bucket at the moment) by running a single gen3-client command.During submission, both a Metadata object and an Indexd record pair are created and populated with info by an Indexs3client job, the Indexd record with calculated
hashes
,size
, andurls
info and the metadata object with_bucket
,_filename
,_file_extension
, and_upload_status
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. That makes a lot of sense to me. Thanks!