Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PXP-5529): implement GET /objects endpoint #15

Open
wants to merge 41 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
2d7666a
feat(endpoint): add GET /objects
johnfrancismccann Nov 19, 2020
fe95adf
feat(GET /objects): use search_metadata_helper
johnfrancismccann Dec 3, 2020
da7f330
feat(GET /objects): filter resource paths LIKE val
johnfrancismccann Dec 8, 2020
d639b2b
feat(GET /objects): map filter operators w/ dict
johnfrancismccann Dec 10, 2020
6a6093f
feat(GET /objects): filter all of array where op
johnfrancismccann Dec 12, 2020
683ed5d
feat(GET /objects): POST /bulk/documents in indexd
johnfrancismccann Dec 13, 2020
05c879c
feat(GET /objects): filter w/ () syntax (x,:eq,42)
johnfrancismccann Dec 14, 2020
bf3a6f6
feat(GET /objects): parse filter with parsimonious
johnfrancismccann Dec 15, 2020
06cc2f0
feat(GET /objects): DRY up sqlalchemy clauses
johnfrancismccann Dec 15, 2020
a8d24e5
feat(GET /objects): remove unnecessary comments
johnfrancismccann Dec 15, 2020
f86f036
feat(GET /objects): support boolean SQL clauses
johnfrancismccann Dec 16, 2020
4e0191d
feat(GET /objects): use filter_dict
johnfrancismccann Dec 16, 2020
f48a015
feat(GET /objects): account for empty filter param
johnfrancismccann Dec 16, 2020
e774189
feat(GET /objects): remove commented code
johnfrancismccann Dec 16, 2020
a87bfaa
feat(GET /objects): document examples
johnfrancismccann Dec 16, 2020
a75914d
Apply automatic documentation changes
johnfrancismccann Dec 16, 2020
71506e6
feat(GET /objects): use data param
johnfrancismccann Mar 8, 2021
79e2c7d
feat(GET /objects): return items list in response
johnfrancismccann Mar 9, 2021
80a12d5
feat(GET /objects): restore search_metadata
johnfrancismccann Mar 9, 2021
879e64a
feat(GET /objects): clean up parsing grammar
johnfrancismccann Mar 11, 2021
1b00631
test(GET /objects): add filter tests
johnfrancismccann Mar 12, 2021
9a350c4
test(GET /objects): use resp_json variable
johnfrancismccann Mar 14, 2021
7c90096
docs(GET /objects): add function docstrings
johnfrancismccann Mar 15, 2021
20f00ff
Apply automatic documentation changes
johnfrancismccann Mar 15, 2021
81aedb8
feat(GET /objects): use advanced_search_metadata
johnfrancismccann Mar 16, 2021
aec9b9e
Apply automatic documentation changes
johnfrancismccann Mar 16, 2021
167d39a
feat(GET /objects): change limit from 2000 to 1024
johnfrancismccann Mar 16, 2021
60b9be5
Apply automatic documentation changes
johnfrancismccann Mar 16, 2021
4ff094d
docs(GET /objects): add docstrings to tests
johnfrancismccann Mar 16, 2021
3e574f8
test(GET /objects): add :lte and :gt tests
johnfrancismccann Mar 16, 2021
5d7deba
feat(GET /objects): use search_metadata_objects
johnfrancismccann Mar 17, 2021
f688cd3
Apply automatic documentation changes
johnfrancismccann Mar 17, 2021
df595e8
Merge branch 'master' into feat/get-objects
johnfrancismccann Apr 1, 2021
e35d9bf
feat(GET /objects): compile filter grammar once
johnfrancismccann Apr 2, 2021
d313286
Apply automatic documentation changes
johnfrancismccann Apr 2, 2021
e82e87c
Merge branch 'master' into feat/get-objects
johnfrancismccann Oct 8, 2021
23cb565
docs(GET /objects): move dev info from swagger
johnfrancismccann Oct 8, 2021
8f91573
style(GET /objects): use get on records dict
johnfrancismccann Oct 8, 2021
7ed158e
test(GET /objects): use fixture to set up objects
johnfrancismccann Oct 10, 2021
8636930
feat(GET /objects): add detail in filter error res
johnfrancismccann Oct 11, 2021
82cb87e
Apply automatic documentation changes
johnfrancismccann Oct 11, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,113 @@ paths:
tags:
- Index
/objects:
get:
description: "The filtering functionality was primarily driven by the requirement\
\ that a\nuser be able to get all objects having an authz resource matching\
\ a\nuser-supplied pattern at any index in the \"_resource_paths\" array.\n\
\nFor example, given the following metadata objects:\n\n {\n \"\
0\": {\n \"message\": \"hello\",\n \"_uploader_id\"\
: \"100\",\n \"_resource_paths\": [\n \"/programs/a\"\
,\n \"/programs/b\"\n ],\n \"pet\": \"\
dog\",\n \"pet_age\": 1\n },\n \"1\": {\n \
\ \"message\": \"greetings\",\n \"_uploader_id\": \"101\",\n\
\ \"_resource_paths\": [\n \"/open\",\n \
\ \"/programs/c/projects/a\"\n ],\n \"pet\":\
\ \"ferret\",\n \"pet_age\": 5,\n \"sport\": \"soccer\"\
\n },\n \"2\": {\n \"message\": \"morning\",\n \
\ \"_uploader_id\": \"102\",\n \"_resource_paths\": [\n\
\ \"/programs/d\",\n \"/programs/e\"\n \
\ ],\n \"counts\": [42, 42, 42],\n \"pet\": \"\
ferret\",\n \"pet_age\": 10,\n \"sport\": \"soccer\"\
\n },\n \"3\": {\n \"message\": \"evening\",\n \
\ \"_uploader_id\": \"103\",\n \"_resource_paths\": [\n\
\ \"/programs/f/projects/a\",\n \"/admin\"\n\
\ ],\n \"counts\": [1, 3, 5],\n \"pet\":\
\ \"ferret\",\n \"pet_age\": 15,\n \"sport\": \"basketball\"\
\n }\n }\n\nhow do we design a filtering interface that allows the\
\ user to get all\nobjects having an authz string matching the pattern\n\"\
/programs/%/projects/%\" at any index in its \"_resource_paths\" array? (%\n\
has been used as the wildcard so far because that's what Postgres uses as\n\
the wildcard for LIKE) In this case, the \"1\" and \"3\" objects should be\n\
returned.\n\nThe filter syntax that was arrived at ending up following the\
\ syntax\nspecified by a [Node JS implementation](https://www.npmjs.com/package/json-api#filtering)\
\ of the [JSON:API\nspecification](https://jsonapi.org/).\n\nThe format for\
\ this syntax is filter=(field_name,operator,value), in which\nthe field_name\
\ is a json key without quotes, operator is one of :eq, :ne,\n:gt, :gte, :lt,\
\ :lte, :like, :all, :any (see operators dict), and value is\na typed json\
\ value against which the operator is run.\n\nExamples:\n\n GET /objects?filter=(message,:eq,\"\
morning\") returns \"2\"\n GET /objects?filter=(counts.1,:eq,3) returns\
\ \"3\"\n GET /objects?filter=(pet_age,:lte,5) returns \"0\" and \"1\"\n\
\ GET /objects?filter=(pet_age,:gt,5) returns \"2\" and \"3\"\n\nCompound\
\ expressions are supported:\n\n GET /objects?filter=(_resource_paths,:any,(,:like,\"\
/programs/%/projects/%\")) returns \"1\" and \"3\"\n GET /objects?filter=(counts,:all,(,:eq,42))\
\ returns \"2\"\n\nBoolean expressions are also supported:\n\n GET /objects?filter=(or,(_uploader_id,:eq,\"\
101\"),(_uploader_id,:eq,\"102\")) returns \"1\" and \"2\"\n GET /objects?filter=(or,(and,(pet,:eq,\"\
ferret\"),(sport,:eq,\"soccer\")),(message,:eq,\"hello\")) returns \"0\",\
\ \"1\", and \"2\""
operationId: get_objects_objects_get
parameters:
- description: Switch to returning a list of GUIDs (false), or metadata objects
(true).
in: query
name: data
required: false
schema:
default: true
description: Switch to returning a list of GUIDs (false), or metadata objects
(true).
title: Data
type: boolean
- description: The offset for what objects are returned (zero-indexed). The
exact offset will be equal to page*limit (e.g. with page=1, limit=15, 15
objects beginning at index 15 will be returned).
in: query
name: page
required: false
schema:
default: 0
description: The offset for what objects are returned (zero-indexed). The
exact offset will be equal to page*limit (e.g. with page=1, limit=15,
15 objects beginning at index 15 will be returned).
title: Page
type: integer
- description: 'Maximum number of objects returned (max: 1024). Also used with
page to determine page size.'
in: query
name: limit
required: false
schema:
default: 10
description: 'Maximum number of objects returned (max: 1024). Also used
with page to determine page size.'
title: Limit
type: integer
- description: The filter(s) that will be applied to the result (more detail
in the docstring).
in: query
name: filter
required: false
schema:
default: ''
description: The filter(s) that will be applied to the result (more detail
in the docstring).
title: Filter
type: string
responses:
'200':
content:
application/json:
schema: {}
description: Successful Response
'422':
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
description: Validation Error
summary: Get Objects
tags:
- Object
post:
description: "Create object placeholder and attach metadata, return Upload url\
\ to the user.\n\nArgs:\n body (CreateObjInput): input body for create\
Expand Down
16 changes: 15 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ indexclient = "^2.1.0"
httpx = "^0.12.1"
authutils = "^5.0.4"
cdislogging = "^1.0.0"
parsimonious = "0.8.1"
paulineribeyre marked this conversation as resolved.
Show resolved Hide resolved

[tool.poetry.dev-dependencies]
pytest = "^5.3"
Expand Down
159 changes: 157 additions & 2 deletions src/mds/objects.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from authutils.token.fastapi import access_token
from asyncpg import UniqueViolationError
from fastapi import HTTPException, APIRouter, Security
from fastapi import HTTPException, APIRouter, Security, Query
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import httpx
from pydantic import BaseModel
Expand All @@ -22,7 +22,7 @@

from . import config, logger
from .models import Metadata
from .query import get_metadata
from .query import get_metadata, search_metadata_objects

mod = APIRouter()

Expand Down Expand Up @@ -226,6 +226,161 @@ async def create_object_for_id(
return JSONResponse(response, HTTP_201_CREATED)


@mod.get("/objects")
async def get_objects(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it from the design doc linked in PXP-5529, this is sort of moving us towards the idea of rebranding MDS as a generalized "object management" service, is that right? (Just want to make sure I'm getting the context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that’s right. For the GET /objects endpoint in particular, note that Indexd records corresponding with metadata objects are returned in the response. I think the idea with the Object Management Service is to bring together various components of Gen3 (MDS, Fence, Indexd, SSJDispatcher, Indexs3client) to support a submission flow in which PFB files can be uploaded to the data lake (which consists of a single s3 bucket at the moment) by running a single gen3-client command.

During submission, both a Metadata object and an Indexd record pair are created and populated with info by an Indexs3client job, the Indexd record with calculated hashes, size, and urls info and the metadata object with _bucket, _filename, _file_extension, and _upload_status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. That makes a lot of sense to me. Thanks!

request: Request,
data: bool = Query(
True,
description="Switch to returning a list of GUIDs (false), "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description="Switch to returning a list of GUIDs (false), "
description="Switch to return a list of GUIDs (false), "

At least, I think that's right? 😅 🔤

"or metadata objects (true).",
),
page: int = Query(
0,
description="The offset for what objects are returned "
"(zero-indexed). The exact offset will be equal to "
"page*limit (e.g. with page=1, limit=15, 15 objects "
"beginning at index 15 will be returned).",
),
limit: int = Query(
johnfrancismccann marked this conversation as resolved.
Show resolved Hide resolved
10,
description="Maximum number of objects returned (max: 1024). "
"Also used with page to determine page size.",
),
filter: str = Query(
"",
description="The filter(s) that will be applied to the "
"result (more detail in the docstring).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"result (more detail in the docstring).",
"result (more detail in the endpoint description).",

this is super nitpicky but "docstring" doesn't make sense in the swagger doc

),
) -> JSONResponse:
"""
The filtering functionality was primarily driven by the requirement that a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way this doc is written is aimed at a dev reading the code, not at a user reading the swagger API docs. i'm referring to things like:

  • "primarily driven by the requirement that a user be able"
  • "how do we design a filtering interface that allows the user to"
  • "that's what Postgres uses"

This information is valuable but IMO ideally docstrings would be written for API users (describe how to use the filter), since the docs are generated from docstrings automatically, and technical details would be somewhere else. Not sure where, maybe at the top of the file?

user be able to get all objects having an authz resource matching a
user-supplied pattern at any index in the "_resource_paths" array.

For example, given the following metadata objects:

{
"0": {
"message": "hello",
"_uploader_id": "100",
"_resource_paths": [
"/programs/a",
"/programs/b"
],
"pet": "dog",
"pet_age": 1
},
"1": {
"message": "greetings",
"_uploader_id": "101",
"_resource_paths": [
"/open",
"/programs/c/projects/a"
],
"pet": "ferret",
"pet_age": 5,
"sport": "soccer"
},
"2": {
"message": "morning",
"_uploader_id": "102",
"_resource_paths": [
"/programs/d",
"/programs/e"
],
"counts": [42, 42, 42],
"pet": "ferret",
"pet_age": 10,
"sport": "soccer"
},
"3": {
"message": "evening",
"_uploader_id": "103",
"_resource_paths": [
"/programs/f/projects/a",
"/admin"
],
"counts": [1, 3, 5],
"pet": "ferret",
"pet_age": 15,
"sport": "basketball"
}
}

how do we design a filtering interface that allows the user to get all
objects having an authz string matching the pattern
"/programs/%/projects/%" at any index in its "_resource_paths" array? (%
has been used as the wildcard so far because that's what Postgres uses as
the wildcard for LIKE) In this case, the "1" and "3" objects should be
returned.

The filter syntax that was arrived at ending up following the syntax
specified by a [Node JS implementation](https://www.npmjs.com/package/json-api#filtering) of the [JSON:API
specification](https://jsonapi.org/).

The format for this syntax is filter=(field_name,operator,value), in which
the field_name is a json key without quotes, operator is one of :eq, :ne,
:gt, :gte, :lt, :lte, :like, :all, :any (see operators dict), and value is
a typed json value against which the operator is run.

Examples:

GET /objects?filter=(message,:eq,"morning") returns "2"
williamhaley marked this conversation as resolved.
Show resolved Hide resolved
GET /objects?filter=(counts.1,:eq,3) returns "3"
GET /objects?filter=(pet_age,:lte,5) returns "0" and "1"
GET /objects?filter=(pet_age,:gt,5) returns "2" and "3"

Compound expressions are supported:

GET /objects?filter=(_resource_paths,:any,(,:like,"/programs/%/projects/%")) returns "1" and "3"
GET /objects?filter=(counts,:all,(,:eq,42)) returns "2"

Boolean expressions are also supported:

GET /objects?filter=(or,(_uploader_id,:eq,"101"),(_uploader_id,:eq,"102")) returns "1" and "2"
GET /objects?filter=(or,(and,(pet,:eq,"ferret"),(sport,:eq,"soccer")),(message,:eq,"hello")) returns "0", "1", and "2"
"""

metadata_objects = await search_metadata_objects(
data=data, page=page, limit=limit, filter=filter
)

records = {}
if data and metadata_objects:
try:
endpoint_path = "/bulk/documents"
full_endpoint = config.INDEXING_SERVICE_ENDPOINT.rstrip("/") + endpoint_path
williamhaley marked this conversation as resolved.
Show resolved Hide resolved
guids = list(guid for guid, _ in metadata_objects)

response = await request.app.async_client.post(full_endpoint, json=guids)
response.raise_for_status()
records = {r["did"]: r for r in response.json()}
except httpx.HTTPError as err:
logger.debug(err, exc_info=True)
if err.response:
logger.error(
"indexd `POST %s` endpoint returned a %s HTTP status code",
endpoint_path,
err.response.status_code,
)
else:
logger.error(
"Unable to get a response from indexd `POST %s` endpoint",
endpoint_path,
)

if data:
response = {
"items": [
{"record": records[guid] if guid in records else {}, "metadata": o}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"record": records[guid] if guid in records else {}, "metadata": o}
{"record": records.get(guid, {}), "metadata": o}

for guid, o in metadata_objects
]
}
else:
response = metadata_objects
return response


@mod.get("/objects/{guid:path}/download")
async def get_object_signed_download_url(
guid: str,
Expand Down
Loading