Add Deltalake query support #1023

tsmathis · 2025-10-23T15:42:27Z

should just work™️

tsmathis · 2025-10-23T15:56:25Z

some notes:

~~1. depends on #1021~~
~~2. could supersede #974~~
~~3. this will need to be addressed for the progress bar to be accurate in the case were has_gnome_access=False:~~

api/mp_api/client/core/client.py

Lines 573 to 579 in aee0f8c

    
           # TODO: Update tasks (+ others?) resource to have emmet-api BatchIdQuery operator 
        
           #   -> need to modify BatchIdQuery operator to handle root level 
        
           #      batch_id, not only builder_meta.batch_id 
        
           # if not has_gnome_access: 
        
           #     num_docs_needed = self.count( 
        
           #         {"batch_id_neq_any": SETTINGS.ACCESS_CONTROLLED_BATCH_IDS} 
        
           #     )

the count can be retrieved from s3, but the COUNT(*) ... WHERE NOT IN ... is slow
4. wasn't sure how to emit messages to the user, warnings might not be the best choice:

api/mp_api/client/core/client.py

Lines 532 to 535 in aee0f8c

    
           warnings.warn( 
        
               f"Dataset for {suffix} already exists at {target_path}, delete or move existing dataset " 
        
               "or re-run search query with MPRester(force_renew=True)", 
        
               MPLocalDatasetWarning,

api/mp_api/client/core/client.py

Lines 631 to 636 in aee0f8c

    
           warnings.warn( 
        
               f"Dataset for {suffix} written to {target_path}. It is recommended to optimize " 
        
               "the table according to your usage patterns prior to running intensive workloads, " 
        
               "see: https://delta-io.github.io/delta-rs/delta-lake-best-practices/#optimizing-table-layout", 
        
               MPLocalDatasetWarning, 
        
           )

5. On the fence if MPDataset should inherit user's choice of use_document_model or default to False, its extra overhead when True

api/mp_api/client/core/client.py

Line 642 in aee0f8c

use_document_model=self.use_document_model,

6. re: document model's, wasn't sure if making an MPDataDoc model was the right route so the emmet model is just passed through now.
7. @esoteric-ephemera, is this how coercing user input to AlphaIDs should go? Do you want to do something different?

api/mp_api/client/routes/materials/tasks.py

Line 34 in aee0f8c

as_alpha = str(AlphaID(task_id, padlen=8)).split("-")[-1]

~~8. Is MPAPIClientSettings the right place for these? Not sure if the user has the ability to adjust these if needed:~~

api/mp_api/client/core/settings.py

Lines 90 to 102 in aee0f8c

    
           LOCAL_DATASET_CACHE: str = Field( 
        
               os.path.expanduser("~") + "/mp_datasets", 
        
               description="Target directory for downloading full datasets", 
        
           ) 
        
           DATASET_FLUSH_THRESHOLD: int = Field( 
        
               100000, 
        
               description="Threshold number of rows to accumulate in memory before flushing dataset to disk", 
        
           ) 
        
           ACCESS_CONTROLLED_BATCH_IDS: list[str] = Field( 
        
               ["gnome_r2scan_statics"], description="Batch ids with access restrictions" 
        
           )

tsmathis · 2025-10-23T15:58:45Z

ah and based on the failing test for trajectories, I assumed returning the pymatgen object was correct, should the dict be returned? @esoteric-ephemera

api/mp_api/client/routes/materials/tasks.py

Line 57 in aee0f8c

return RelaxTrajectory(**traj_data[0]).to_pmg()

esoteric-ephemera · 2025-10-23T16:43:12Z

@tsmathis think the API was set up to return the jsanitized trajectory info:
https://github.com/materialsproject/emmet/blob/3447c5af4746d539f1f4faf26b97715cb119c85d/emmet-api/emmet/api/routes/materials/tasks/query_operators.py#L73

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

"a"*(x._padlen-len(x._identifier)) + x._identifier

or

if (alpha := AlphaID(task_id, padlen=8))._separator:
  padded = str(alpha).rsplit(alpha._separator)[-1] 
else:
  padded = str(alpha)

tsmathis · 2025-10-23T17:39:39Z

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

either way on this works for me, just want to make sure I stick to the intended usage (edit: or that we're at least consistent across the client)

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

Was going to say we could stick to whatever the frontend was expecting, but looking now the frontend doesn't even use the tasks.get_trajectory(...) function so it will need to be rewritten either way. The frontend does end up making a dataframe from the trajectory dict, so maybe just returning the dict will be best

codecov-commenter · 2025-10-23T17:52:28Z

Codecov Report

❌ Patch coverage is 42.10526% with 77 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.10%. Comparing base (52a3c57) to head (b2a832f).

Files with missing lines	Patch %	Lines
mp_api/client/core/client.py	26.38%	53 Missing ⚠️
mp_api/client/core/utils.py	47.82%	24 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
- Coverage   67.29%   66.10%   -1.20%     
==========================================
  Files          50       50              
  Lines        2770     2894     +124     
==========================================
+ Hits         1864     1913      +49     
- Misses        906      981      +75

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tschaume

Very nice! Looking forward to rolling this out 😄

mp_api/client/core/client.py

mp_api/client/core/settings.py

tschaume

thanks for the updates @tsmathis ! Found a few more potential issues/improvements

mp_api/client/core/client.py

tsmathis · 2025-11-05T18:28:03Z

good catches @tschaume

you're obviously free to keep adding changes for testing, but you can also just ping me if you want me to update things as changes come in upstream

esoteric-ephemera · 2025-11-12T17:29:06Z

mp_api/client/core/settings.py

    )

+    LOCAL_DATASET_CACHE: str = Field(
+        os.path.expanduser("~") + "/mp_datasets",


may want to change to just os.path.expanduser("~/mp_datasets") so that os can resolve non-unix-like separators. Or just use pathlib.Path("~/mp_datasets").expanduser()

good point
7ee5515

esoteric-ephemera · 2025-11-12T17:31:52Z

mp_api/client/core/settings.py

+    )
+
+    DATASET_FLUSH_THRESHOLD: int = Field(
+        100000,


Could we make this a byte threshold in memory with pyarrow.Table.get_total_buffer_size? Would be an overestimate but that's probably safe for this case

Not sure if that would work exactly since the in memory accumulator is a pylist of pa.RecordBatchs.

I'll look around for something that's more predictable for the flush threshold than just number of rows since row sizes can vary drastically across different data products.

After some looking, RecordBatch also has get_total_buffer_size()

What do you think a good threshold would be in this case? For the first 100k rows for the tasks table I got 2770781904 bytes (2.7 GB)

The corresponding on disk size (compressed w/ zstd) for that first 100k rows is 422M

2.5-2.75 GB spill is probably good

mp_api/client/core/utils.py

esoteric-ephemera · 2025-11-12T17:48:14Z

Working great for me! Full task download in ~6 min which is crazy compared to before

General discussion quetion: Now that we're working with just bare AlphaID (e.g., aaaaaaft), we may need to manually insert prefixes by endpoint right? Once the various materials endpoints are delta_backed, they should just get mp-. Do we want to manually insert a mpt- prefix or task- for the tasks?

tsmathis · 2025-11-12T18:58:18Z

I think for the core tasks collection we're going prefix-less, correct? All the others will get prefixes at parse/build time.

tsmathis · 2025-11-12T23:02:56Z

re: the iterating, indexing into the local dataset, etc

I am a little conflicted on what the best route for the python-like implementation/behavior of the local MPDatasets should be. Mainly because as soon as we leave arrow-land we're neutering the performance that can be achieved.

As an example:
Regardless of how we do the iteration behavior, this is dog water:

# doesn't work currently, would have to update iterating to match Aaron's review comment first
>>> tasks = mpr.materials.tasks.search()
>>> non_metallic_r2scan_structures = [
    x.structure 
    for x in tasks 
    if x.output.bandgap > 0 and x.run_type == "r2SCAN"
]

compared to:

>>> import pyarrow.compute as pc
>>> tasks_ds = tasks.pyarrow_dataset
>>> expr = (pc.field(("output", "bandgap")) > 0) & (pc.field("run_type") == "r2SCAN")
>>> non_metallic_r2scan_structures = tasks_ds.to_table(columns=["structure"], filter=expr)

which is sub-second execution on my machine

I am obviously biased on this front since I am comfortable with arrow's usage patterns, not sure if the average client user would be willing to go down that route. Ideally though we should be guiding users towards a "golden path".

esoteric-ephemera · 2025-11-13T01:31:57Z

Yeah it's hard to say what's best in this case. We'd probably want to prioritize user experience across endpoints, or just throw a specialized warning for full task retrieval that the return type is different

If pandas is a no-op from parquet (not sure if that's also true for the dataset or just an individual table/array) then that could be a viable alternative? Feel like pandas will be more familiar than arrow datasets

tschaume and others added 3 commits October 22, 2025 13:12

exclude gnome for full downloads if needed

9d2048e

query s3 for trajectories

505ddfe

add deltalake query support

aee0f8c

linting + mistaken sed replace on 'where'

d5a25b1

tsmathis force-pushed the deltalake branch from 8c59af4 to d5a25b1 Compare October 23, 2025 16:04

tschaume mentioned this pull request Oct 23, 2025

exclude gnome for full downloads if needed #974

Closed

tsmathis added 3 commits October 23, 2025 10:40

return trajectory as pmg dict

2de051d

update trajectory test

7d0b8b7

correct docstrs

7195adf

tschaume reviewed Oct 23, 2025

View reviewed changes

mp_api/client/core/client.py Outdated Show resolved Hide resolved

mp_api/client/core/client.py Outdated Show resolved Hide resolved

mp_api/client/core/settings.py Outdated Show resolved Hide resolved

tsmathis mentioned this pull request Oct 23, 2025

add BatchIDQuery to tasks_resource materialsproject/emmet#1330

Merged

tschaume and others added 4 commits October 24, 2025 12:48

Merge branch 'main' into deltalake

33b787f

get access controlled batch ids from heartbeat

2664fcd

refactor

b498a76

Merge branch 'main' into deltalake

7da6984

tschaume requested changes Nov 5, 2025

View reviewed changes

mp_api/client/core/client.py Outdated Show resolved Hide resolved

mp_api/client/core/client.py Outdated Show resolved Hide resolved

mp_api/client/core/client.py Outdated Show resolved Hide resolved

github-actions and others added 5 commits November 4, 2025 16:05

auto dependency upgrades

948c108

Update testing.yml

b0aed4f

rm overlooked access of removed settings param

a35bcb7

refactor: consolidate requests to heartbeat for meta info

9460601

lint

05f1d0e

tsmathis added 2 commits November 5, 2025 10:30

fix incomplete docstr

e685445

typo

bb0b238

This comment was marked as outdated.

Sign in to view

tsmathis added 3 commits November 10, 2025 09:36

Merge branch 'main' into deltalake

dc0c949

revert testing endpoint

fb84d73

no parallel on batch_id_neq_any

5bdacf5

esoteric-ephemera reviewed Nov 12, 2025

View reviewed changes

mp_api/client/core/utils.py Show resolved Hide resolved

esoteric-ephemera reviewed Nov 12, 2025

View reviewed changes

mp_api/client/core/utils.py Show resolved Hide resolved

tsmathis added 3 commits November 12, 2025 10:36

more resilient dataset path expansion

7ee5515

missed field annotation update

ae7674d

coerce Path to str for deltalake lib

5538c74

tsmathis added 5 commits November 13, 2025 17:22

flush based on bytes

f39c0d3

iterate over individual rows for local dataset

a965255

missed bounds check for updated iteration behavior

03b38e7

opt for module level logging over warnings lib

3a44b4f

lint

b2a832f

Add Deltalake query support #1023

Are you sure you want to change the base?

Add Deltalake query support #1023

Uh oh!

Conversation

tsmathis commented Oct 23, 2025

Uh oh!

tsmathis commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsmathis commented Oct 23, 2025

Uh oh!

esoteric-ephemera commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsmathis commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tschaume left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tschaume left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tsmathis commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

esoteric-ephemera Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsmathis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

esoteric-ephemera Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

tsmathis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

tsmathis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

tsmathis Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

esoteric-ephemera Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

tsmathis Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

esoteric-ephemera commented Nov 12, 2025

Uh oh!

tsmathis commented Nov 12, 2025

Uh oh!

tsmathis commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esoteric-ephemera commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

tsmathis commented Oct 23, 2025 •

edited

Loading

esoteric-ephemera commented Oct 23, 2025 •

edited

Loading

tsmathis commented Oct 23, 2025 •

edited

Loading

codecov-commenter commented Oct 23, 2025 •

edited

Loading

tsmathis commented Nov 5, 2025 •

edited

Loading

esoteric-ephemera Nov 12, 2025 •

edited

Loading

tsmathis commented Nov 12, 2025 •

edited

Loading