DoMINO Performance Optimizations #1133

coreyjadams · 2025-09-30T15:38:32Z

PhysicsNeMo Pull Request

This PR is to close gaps in DoMINO performance. In includes some level of refactor, as well, with a little more to go. Opening this PR as a draft while tracking down some final little pieces, which will get listed below.

Description

This is a large PR. It covers a lot of pieces which essentially have to be brought in together. In no particular order, this PR includes:

DataPipe and DataSet separation

DrivaerML CAE DataSet

Note Renaming is in progress.

This PR includes a separation of IO from pre-processing in the DoMINO datapipe. There is now one IO interface "cae_ml_dataset" (open to a name change, here) that generically will read keys from a dictionary-like object. The dataset will infer, based on file properties, how to read the data:

.npy files are supported (pickled numpy dictionaries) though the entire file has to be read, and then the requested keys are returned as torch cpu tensors.
.npz files can read directly the tensors requested.
.zarr files will read with zarr 3.0, unless tensorstore is installed.
If tensorstore is installed, the zarr files will be read with tensorstore instead. It is significantly faster than plain zarr, with no extra CPU overhead on the python interpreter. Previous methods with threading added overhead to the python runtime causing cpu-based stalls in model code.
If none of those matches are made, it will check if every directory passed contains only .stl, .vtp, .vtu, or .csv files. In this case, it will use pyvista to read these.

The dataset will, as part of it's pipeline, pin tensors and move them to the GPU. Ideally if we're pinning, we would preallocate and write directly to those buffers, but tensorstore does not support that. If the output device is GPU, then the data transfer will happen in a separate stream. Optionally, if this is a multi-stream application, the user can pass a consumer_stream (defaults to torch.cuda.default_stream()) that will be used to ensure proper stream ordering.

The dataset object has ability to preload indexes in python threads (up to a configurable preload depth) to async load data from disk into torch tensors. Note that aggressive preloading will accumulate GPU storage usage if the dataset is outputting to GPU (configured by setting the output device in dataset construction).

Construction of these datasets looks like this:

dataset = DrivaerMLDataset(
                data_dir=self.config.data_path,
                keys_to_read=self.keys_to_read,
                output_device=self.preproc_device,
                pin_memory=pin_memory,
                consumer_stream=torch.cuda.default_stream(),
            )

Calling __getitem__ (aka, dataset[i]) will fetch the i'th entry of the dataset. There is also a preload function that will manually queue up the target data. Further, if you iterate over the dataset (for i, batch in enumerate(dataset)) the preloading will happen automatically and transparently.

Optimization of Volumetric Reads

Because the volumetric data is so large, per file, we have an optimization that can be enabled in the dataset. The dataset can read contiguous sub-slices of the data, for volumetric data, such that it reads only a small fraction from disk each iteration. This requires preprocessing: the dataset has to be shuffled, on disk, prior to training or inference. This is expected to be supported in curator in the next release.

Domino DataPipe

The datapipe for domino has been refactored entirely: the entire datapipe is implemented to operate on torch tensors. As of opening this PR, the datapipe can contain a reference to the cae_dataset object (though doesn't, by default). If it does have the dataset object, you can iterate over the datapipe. This is the generally supported method for optimal training pipelines.

The datapipe itself looks very similar to the previous iteration, minus the IO pieces, but has a few additions:

The datapipe already has built in to it transforms on the output fields. In inference mode, these are passed by if no output target fields are supplied.
Because it does the normalization one way, it makes sense to do the normalization the other way in the datapipe too. The datapipe can now be used to un-normalize or un-standardize outputs, and will pick the right path based on the datapipe configuration.

All logic based on computing scaling factors is removed from the datapipe. See below.

The CachedDominoDataset is largely unchanged.

DoMINO Example ReOrg

Inside the src/ folder of the DoMINO training example, there are the following changes:

the loss functions from training have moved to loss.py but are otherwise unchanged. They could use a clean up after the physics loss additions.
A utils.py file has been added to contain script-specific logic, like extracting the number of variables for the model from the config and a dataclass to contain scaling factor info (see below).
The test.py script is updated but not optimized.
train_sharded.py is to be deprecated - it's functionality will be supported directly in train.py
The train.py script has been updated only in that pieces have been moved into other scripts and imported, to shorten the script length and complexity.
A temporary script, benchmark_dataset.py is present for development but needs to be removed before merge.
Inference_on_stl is covered below.

Scaling Factors

Currently, in Domino, computing the scaling factors is done implicitly and somewhat slowly, since it uses the pipeline inefficiently. Instead, there is a compute_scaling_factors function that will use the dataset (not datapipe) object, loop over the data reading only the necessary fields, if possible, and return the mean/std/min/max of each field read. By default, for drivaerML, this will not only use the output fields but also the surface and volume coordinates. This can help determine bounding boxes on new datasets, etc.

To facilitate the scaling factor usage more easily, a new script is added to the src/ dir in the examples, compute_statistics.py. It is meant to be run on a single device, or even possibly the CPU if desired, and it saves the scaling factors into a pickled object that is reloaded in other scripts. I'm open to changes in this design, but the idea was to make the scale factors more stable, more portable, and computing them stand-alone from the training script.

Inference on STL

The entire inference on stl pipeline has been refactored and ported to use GPUs, reusing the dataset and datapipe from above. Performance is significantly better on GPU. Two pieces of the script are missing (called out below).

Other components

RMM by default

The RAPIDS memory manager is now used, by default, to share memory between the kNN in the preprocessing step and the main model code. This accelerates performance (avoiding cudaFree and cudaMalloc) and also reduces memory usage overall - since the two phases are out of sync with each other, sharing memory makes sense. It can only be opted out with an environment variable, now documented in the README.

The interface is quite simple, at the top of the file if you want to use it you do:

from physicsnemo.utils.memory import unified_gpu_memory

SDF is torch only now

The Signed distance function now only accepts torch inputs and returns torch outputs. The output signature is fixed to the distance itself, and the closest point on the mesh (which is not usually a vertex). Test have been updated, since they were slightly incorrect.

utils/domino/utils.py is torch only

The cupy/numpy backend of those functions is removed and it is torch only, now. The non-array functionality in utils for vtk reading has been moved to utils/domino/vtk_file_utils.py and is unchanged, otherwise.

Missing pieces will be marked below as issues to resolve before merge, however, here's the condensed list as I see it:

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Separate the dataloading from the data processing in DoMINO datapipe.

… data set for IO. This is reaching IO throughputs of about 5GB/s on ORD, so getting better.

… the kernel precision. The test had some expected numbers that, I believe, were incorrect.

Update domino_datapipe2 (temporary name).

…training script a little, simply by moving things around ...

…e) and fix a few details in the new pipeline. Use new pipeline in training script

some tweaks to enable the preprocess pipeline for inference.

examples/cfd/external_aerodynamics/domino/src/inference_on_stl.py

RishikeshRanade

LGTM

coreyjadams · 2025-10-17T19:54:07Z

This PR is huge and triggered a bunch of reviewers. ChatGPT helped me script up a way to figure out who was tagged, per file, and why:

docs/img/domino/combined-training-curve.png -> @megnvidia, @ktangsali
docs/img/domino/drag-r2.jpg -> @megnvidia, @ktangsali
docs/img/domino/lift-r2.jpg -> @megnvidia, @ktangsali
docs/img/domino/surface-training-curve.png -> @megnvidia, @ktangsali
docs/img/domino_perf.png -> @megnvidia, @ktangsali
examples/cfd/external_aerodynamics/domino/README.md -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/requirements.txt -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/benchmark_dataloader.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/compute_statistics.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/conf/config.yaml -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/README.md -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/inference_on_stl.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/openfoam_datapipe.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/retraining.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/train_sharded.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/inference_on_stl.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/loss.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/shuffle_volumetric_curator_output.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/test.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/train.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/utils.py -> @RishikeshRanade
physicsnemo/datapipes/cae/cae_dataset.py -> @RishikeshRanade, @coreyjadams, @peterdsharpe, @mnabian, @Alexey-Kamenev
physicsnemo/datapipes/cae/domino_datapipe.py -> @RishikeshRanade, @coreyjadams, @peterdsharpe, @mnabian, @Alexey-Kamenev
physicsnemo/datapipes/cae/domino_sharded_datapipe.py -> @RishikeshRanade, @coreyjadams, @peterdsharpe, @mnabian, @Alexey-Kamenev
physicsnemo/models/domino/encodings.py -> @RishikeshRanade
physicsnemo/models/domino/geometry_rep.py -> @RishikeshRanade
physicsnemo/models/domino/mlps.py -> @RishikeshRanade
physicsnemo/models/domino/model.py -> @RishikeshRanade
physicsnemo/models/domino/solutions.py -> @RishikeshRanade
physicsnemo/models/layers/__init__.py -> No owner
physicsnemo/models/layers/ball_query.py -> No owner
physicsnemo/models/layers/fourier_layers.py -> No owner
physicsnemo/models/layers/mlp_layers.py -> No owner
physicsnemo/utils/domino/utils.py -> @RishikeshRanade
physicsnemo/utils/domino/vtk_file_utils.py -> @RishikeshRanade
physicsnemo/utils/memory.py -> No owner
physicsnemo/utils/neighbors/radius_search/_torch_impl.py -> @coreyjadams, @peterdsharpe
test/datapipes/test_domino_datapipe.py -> No owner
test/distributed/shard_tensor/ops/test_radius_search.py -> No owner
test/models/data/domino_output-conv.pth -> No owner
test/models/data/domino_output-unet.pth -> No owner
test/models/data/domino_output.pth -> No owner
test/models/data/mlp_output.pth -> No owner
test/models/domino/__init__.py -> No owner
test/models/domino/conftest.py -> No owner
test/models/domino/test_domino.py -> No owner
test/models/domino/test_domino_encodings.py -> No owner
test/models/domino/test_domino_geometry_rep.py -> No owner
test/models/domino/test_domino_mlps.py -> No owner
test/models/domino/test_domino_solutions.py -> No owner
test/models/domino/utils.py -> No owner
test/models/test_mlp_layers.py -> No owner
test/utils/test_domino_utils.py -> No owner

Owner summary (files touched):
  @RishikeshRanade: 26
  @megnvidia: 5
  @ktangsali: 5
  @coreyjadams: 4
  @peterdsharpe: 4
  @mnabian: 3
  @Alexey-Kamenev: 3

I'm going to open a PR for tweaking code owners that should clear out the image updates and remove @megnvidia . I'll make the data pipes directory more fine grained to free up @mnabian and @Alexey-Kamenev .

That will leave one review, from @peterdsharpe , to look at

physicsnemo/utils/neighbors/radius_search/_torch_impl.py -> @coreyjadams, @peterdsharpe

Peter would you mind reviewing that file? Everything else has been vetted by me and @RishikeshRanade over a few weeks now.

peterdsharpe

Review completed for physicsnemo/utils/neighbors/radius_search/_torch_impl.py; LGTM

coreyjadams · 2025-10-17T20:10:35Z

/blossom-ci

coreyjadams · 2025-10-17T20:18:42Z

/blossom-ci

coreyjadams · 2025-10-17T23:11:16Z

/blossom-ci

coreyjadams · 2025-10-20T13:39:09Z

/blossom-ci

coreyjadams added 30 commits August 25, 2025 14:53

Relax cuml constraints

e559c46

Port sdf function to use only torch inputs. No changes to tests yet.

e38ecdf

Porting some domino utils function to pure torch interface

dd7b3cf

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

7888f29

Add new dataset to read DrivaerML like data in various formats.

8590afd

Separate the dataloading from the data processing in DoMINO datapipe.

Adding a torch-centric domino datapipe and a separated, data-agnostic…

caf0290

… data set for IO. This is reaching IO throughputs of about 5GB/s on ORD, so getting better.

Rename datapipe file to dataset.

7fb5f8e

Update SDF function and test. Auto convert higher precisions to match…

0fb0ed2

… the kernel precision. The test had some expected numbers that, I believe, were incorrect.

Add IO benchmark

60c3535

Minor bug fixes

0c668d9

Few bug fixes

70e6130

Merge branch 'main' into physicsnemo-domino-inference

b99dcfe

A few more fixes for domino.

4c26ae1

Fix pre-commit issues

45100ef

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

1301293

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

9489732

Port domino utils from cupy/numpy to pure torch.

675c546

Update domino_datapipe2 (temporary name).

update training script for new datapipe

4578975

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

8f22a6c

Add abillity to pin memory, optionally.

9a5d8ed

Snapshot updates of cleanups and minor fixes

c57f985

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

af5d7c9

Most datapipe tests passing. Add compute_statistics script. Clean up …

02b03a0

…training script a little, simply by moving things around ...

Update tests for the new pipeline (mostly fix indexing from batch siz…

ff185b3

…e) and fix a few details in the new pipeline. Use new pipeline in training script

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

296a58b

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

aceb1f1

Merge branch 'NVIDIA:main' into physicsnemo-domino-inference

e51bf4c

add a utility to sample on a mesh with torch.

c7c94cb

some tweaks to enable the preprocess pipeline for inference.

Add revamped inference script. Doesn't write outputs yet.

471aae9

Update inference script for Domino STL inference.

3864830

ktangsali reviewed Oct 17, 2025

View reviewed changes

examples/cfd/external_aerodynamics/domino/src/inference_on_stl.py Outdated Show resolved Hide resolved

coreyjadams and others added 6 commits October 17, 2025 08:04

Merge branch 'main' into physicsnemo-domino-inference

f4b54d4

Hopefully fix inference script

e91f263

fixes to scaling and adding configs

e9dbac9

Update README.md for domino

6706558

Remove unneeded plots.

d8a4901

uupdate r2

01a0c15

RishikeshRanade approved these changes Oct 17, 2025

View reviewed changes

RishikeshRanade marked this pull request as ready for review October 17, 2025 19:22

RishikeshRanade requested a review from megnvidia as a code owner October 17, 2025 19:22

peterdsharpe approved these changes Oct 17, 2025

View reviewed changes

Fix ruff issues

edae436

Merge branch 'main' into physicsnemo-domino-inference

29de3bf

coreyjadams removed request for Alexey-Kamenev and mnabian October 17, 2025 20:58

coreyjadams added 5 commits October 17, 2025 21:09

test fix.

6c04d88

Fix dict item with normalization off.

bd74ce3

Change codeowners order, exclusion goes last.

9023575

Undo file order so it can get fixed elsewhere.

b24b0d4

Merge branch 'main' into physicsnemo-domino-inference

e82d0ce

coreyjadams removed the request for review from megnvidia October 17, 2025 23:11

Update doctstring tests.

b8db738

coreyjadams self-assigned this Oct 20, 2025

coreyjadams merged commit aac1ce3 into NVIDIA:main Oct 20, 2025
1 check passed

coreyjadams deleted the physicsnemo-domino-inference branch October 23, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DoMINO Performance Optimizations #1133

DoMINO Performance Optimizations #1133

Uh oh!

coreyjadams commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

RishikeshRanade left a comment

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

peterdsharpe left a comment

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

coreyjadams commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DoMINO Performance Optimizations #1133

DoMINO Performance Optimizations #1133

Uh oh!

Conversation

coreyjadams commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PhysicsNeMo Pull Request

Description

DataPipe and DataSet separation

DrivaerML CAE DataSet

Optimization of Volumetric Reads

Domino DataPipe

DoMINO Example ReOrg

Scaling Factors

Inference on STL

Other components

RMM by default

SDF is torch only now

utils/domino/utils.py is torch only

Checklist

Dependencies

Uh oh!

Uh oh!

RishikeshRanade left a comment

Choose a reason for hiding this comment

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

peterdsharpe left a comment

Choose a reason for hiding this comment

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

coreyjadams commented Oct 17, 2025

Uh oh!

coreyjadams commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coreyjadams commented Sep 30, 2025 •

edited

Loading