-
Notifications
You must be signed in to change notification settings - Fork 51
Add Deltalake query support #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
some notes:
api/mp_api/client/core/client.py Lines 573 to 579 in aee0f8c
COUNT(*) ... WHERE NOT IN ... is slow4. wasn't sure how to emit messages to the user, warnings might not be the best choice: api/mp_api/client/core/client.py Lines 532 to 535 in aee0f8c
api/mp_api/client/core/client.py Lines 631 to 636 in aee0f8c
5. On the fence if MPDataset should inherit user's choice of use_document_model or default to False, its extra overhead when Trueapi/mp_api/client/core/client.py Line 642 in aee0f8c
6. re: document model's, wasn't sure if making an MPDataDoc model was the right route so the emmet model is just passed through now.7. @esoteric-ephemera, is this how coercing user input to AlphaIDs should go? Do you want to do something different?
MPAPIClientSettings the right place for these? Not sure if the user has the ability to adjust these if needed:api/mp_api/client/core/settings.py Lines 90 to 102 in aee0f8c
|
|
ah and based on the failing test for trajectories, I assumed returning the pymatgen object was correct, should the
|
|
@tsmathis think the API was set up to return the Either way yeah I guess it returned the For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked: or |
either way on this works for me, just want to make sure I stick to the intended usage (edit: or that we're at least consistent across the client)
Was going to say we could stick to whatever the frontend was expecting, but looking now the frontend doesn't even use the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1023 +/- ##
==========================================
- Coverage 67.29% 66.10% -1.20%
==========================================
Files 50 50
Lines 2770 2894 +124
==========================================
+ Hits 1864 1913 +49
- Misses 906 981 +75 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tschaume
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Looking forward to rolling this out 😄
tschaume
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates @tsmathis ! Found a few more potential issues/improvements
|
good catches @tschaume you're obviously free to keep adding changes for testing, but you can also just ping me if you want me to update things as changes come in upstream |
This comment was marked as outdated.
This comment was marked as outdated.
mp_api/client/core/settings.py
Outdated
| ) | ||
|
|
||
| LOCAL_DATASET_CACHE: str = Field( | ||
| os.path.expanduser("~") + "/mp_datasets", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may want to change to just os.path.expanduser("~/mp_datasets") so that os can resolve non-unix-like separators. Or just use pathlib.Path("~/mp_datasets").expanduser()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
7ee5515
mp_api/client/core/settings.py
Outdated
| ) | ||
|
|
||
| DATASET_FLUSH_THRESHOLD: int = Field( | ||
| 100000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this a byte threshold in memory with pyarrow.Table.get_total_buffer_size? Would be an overestimate but that's probably safe for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that would work exactly since the in memory accumulator is a pylist of pa.RecordBatchs.
I'll look around for something that's more predictable for the flush threshold than just number of rows since row sizes can vary drastically across different data products.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some looking, RecordBatch also has get_total_buffer_size()
What do you think a good threshold would be in this case? For the first 100k rows for the tasks table I got 2770781904 bytes (2.7 GB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding on disk size (compressed w/ zstd) for that first 100k rows is 422M
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.5-2.75 GB spill is probably good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Working great for me! Full task download in ~6 min which is crazy compared to before General discussion quetion: Now that we're working with just bare |
|
I think for the core tasks collection we're going prefix-less, correct? All the others will get prefixes at parse/build time. |
|
re: the iterating, indexing into the local dataset, etc I am a little conflicted on what the best route for the python-like implementation/behavior of the local As an example: # doesn't work currently, would have to update iterating to match Aaron's review comment first
>>> tasks = mpr.materials.tasks.search()
>>> non_metallic_r2scan_structures = [
x.structure
for x in tasks
if x.output.bandgap > 0 and x.run_type == "r2SCAN"
]compared to: >>> import pyarrow.compute as pc
>>> tasks_ds = tasks.pyarrow_dataset
>>> expr = (pc.field(("output", "bandgap")) > 0) & (pc.field("run_type") == "r2SCAN")
>>> non_metallic_r2scan_structures = tasks_ds.to_table(columns=["structure"], filter=expr)which is sub-second execution on my machine I am obviously biased on this front since I am comfortable with arrow's usage patterns, not sure if the average client user would be willing to go down that route. Ideally though we should be guiding users towards a "golden path". |
|
Yeah it's hard to say what's best in this case. We'd probably want to prioritize user experience across endpoints, or just throw a specialized warning for full task retrieval that the return type is different If |
should just work™️