Skip to content

Conversation

@coreyjadams
Copy link
Collaborator

@coreyjadams coreyjadams commented Sep 30, 2025

PhysicsNeMo Pull Request

This PR is to close gaps in DoMINO performance. In includes some level of refactor, as well, with a little more to go. Opening this PR as a draft while tracking down some final little pieces, which will get listed below.

Description

This is a large PR. It covers a lot of pieces which essentially have to be brought in together. In no particular order, this PR includes:

DataPipe and DataSet separation

DrivaerML CAE DataSet

Note Renaming is in progress.

This PR includes a separation of IO from pre-processing in the DoMINO datapipe. There is now one IO interface "cae_ml_dataset" (open to a name change, here) that generically will read keys from a dictionary-like object. The dataset will infer, based on file properties, how to read the data:

  • .npy files are supported (pickled numpy dictionaries) though the entire file has to be read, and then the requested keys are returned as torch cpu tensors.
  • .npz files can read directly the tensors requested.
  • .zarr files will read with zarr 3.0, unless tensorstore is installed.
  • If tensorstore is installed, the zarr files will be read with tensorstore instead. It is significantly faster than plain zarr, with no extra CPU overhead on the python interpreter. Previous methods with threading added overhead to the python runtime causing cpu-based stalls in model code.
  • If none of those matches are made, it will check if every directory passed contains only .stl, .vtp, .vtu, or .csv files. In this case, it will use pyvista to read these.

The dataset will, as part of it's pipeline, pin tensors and move them to the GPU. Ideally if we're pinning, we would preallocate and write directly to those buffers, but tensorstore does not support that. If the output device is GPU, then the data transfer will happen in a separate stream. Optionally, if this is a multi-stream application, the user can pass a consumer_stream (defaults to torch.cuda.default_stream()) that will be used to ensure proper stream ordering.

The dataset object has ability to preload indexes in python threads (up to a configurable preload depth) to async load data from disk into torch tensors. Note that aggressive preloading will accumulate GPU storage usage if the dataset is outputting to GPU (configured by setting the output device in dataset construction).

Construction of these datasets looks like this:

dataset = DrivaerMLDataset(
                data_dir=self.config.data_path,
                keys_to_read=self.keys_to_read,
                output_device=self.preproc_device,
                pin_memory=pin_memory,
                consumer_stream=torch.cuda.default_stream(),
            )

Calling __getitem__ (aka, dataset[i]) will fetch the i'th entry of the dataset. There is also a preload function that will manually queue up the target data. Further, if you iterate over the dataset (for i, batch in enumerate(dataset)) the preloading will happen automatically and transparently.

Optimization of Volumetric Reads

Because the volumetric data is so large, per file, we have an optimization that can be enabled in the dataset. The dataset can read contiguous sub-slices of the data, for volumetric data, such that it reads only a small fraction from disk each iteration. This requires preprocessing: the dataset has to be shuffled, on disk, prior to training or inference. This is expected to be supported in curator in the next release.

Domino DataPipe

The datapipe for domino has been refactored entirely: the entire datapipe is implemented to operate on torch tensors. As of opening this PR, the datapipe can contain a reference to the cae_dataset object (though doesn't, by default). If it does have the dataset object, you can iterate over the datapipe. This is the generally supported method for optimal training pipelines.

The datapipe itself looks very similar to the previous iteration, minus the IO pieces, but has a few additions:

  • The datapipe already has built in to it transforms on the output fields. In inference mode, these are passed by if no output target fields are supplied.
  • Because it does the normalization one way, it makes sense to do the normalization the other way in the datapipe too. The datapipe can now be used to un-normalize or un-standardize outputs, and will pick the right path based on the datapipe configuration.

All logic based on computing scaling factors is removed from the datapipe. See below.

The CachedDominoDataset is largely unchanged.

DoMINO Example ReOrg

Inside the src/ folder of the DoMINO training example, there are the following changes:

  • the loss functions from training have moved to loss.py but are otherwise unchanged. They could use a clean up after the physics loss additions.
  • A utils.py file has been added to contain script-specific logic, like extracting the number of variables for the model from the config and a dataclass to contain scaling factor info (see below).
  • The test.py script is updated but not optimized.
  • train_sharded.py is to be deprecated - it's functionality will be supported directly in train.py
  • The train.py script has been updated only in that pieces have been moved into other scripts and imported, to shorten the script length and complexity.
  • A temporary script, benchmark_dataset.py is present for development but needs to be removed before merge.
  • Inference_on_stl is covered below.

Scaling Factors

Currently, in Domino, computing the scaling factors is done implicitly and somewhat slowly, since it uses the pipeline inefficiently. Instead, there is a compute_scaling_factors function that will use the dataset (not datapipe) object, loop over the data reading only the necessary fields, if possible, and return the mean/std/min/max of each field read. By default, for drivaerML, this will not only use the output fields but also the surface and volume coordinates. This can help determine bounding boxes on new datasets, etc.

To facilitate the scaling factor usage more easily, a new script is added to the src/ dir in the examples, compute_statistics.py. It is meant to be run on a single device, or even possibly the CPU if desired, and it saves the scaling factors into a pickled object that is reloaded in other scripts. I'm open to changes in this design, but the idea was to make the scale factors more stable, more portable, and computing them stand-alone from the training script.

Inference on STL

The entire inference on stl pipeline has been refactored and ported to use GPUs, reusing the dataset and datapipe from above. Performance is significantly better on GPU. Two pieces of the script are missing (called out below).

Other components

RMM by default

The RAPIDS memory manager is now used, by default, to share memory between the kNN in the preprocessing step and the main model code. This accelerates performance (avoiding cudaFree and cudaMalloc) and also reduces memory usage overall - since the two phases are out of sync with each other, sharing memory makes sense. It can only be opted out with an environment variable, now documented in the README.

The interface is quite simple, at the top of the file if you want to use it you do:

from physicsnemo.utils.memory import unified_gpu_memory

SDF is torch only now

The Signed distance function now only accepts torch inputs and returns torch outputs. The output signature is fixed to the distance itself, and the closest point on the mesh (which is not usually a vertex). Test have been updated, since they were slightly incorrect.

utils/domino/utils.py is torch only

The cupy/numpy backend of those functions is removed and it is torch only, now. The non-array functionality in utils for vtk reading has been moved to utils/domino/vtk_file_utils.py and is unchanged, otherwise.

Missing pieces will be marked below as issues to resolve before merge, however, here's the condensed list as I see it:

  • Datapipe async Preprocessing is to be removed from the datapipe, and put only in the dataset. No streams in the datapipe due to warp / torch issues. The set_indices functionality is to be moved to the dataset.
  • The dataset constructor should not be used in the datapipe but passed to the datapipe optionally. the create_domino_dataset function can do this transparently to keep it as a "drop in" replacement
  • The SDF function in the data preprocessing operates on unscaled data, which might later be scaled. This should be addressed but isn't quite P0.
  • ShardTensor integration is completely missing. It shouldn't be difficult to integrate but isn't done yet.
  • The SDF function shows up in other places, still. Some are reimplementations, some are using this. It needs to be checked if we can accommodate that easily. Done, except in re-implementations which are out of scope here.
  • The test.py script must be updated.
  • The retraining script must be updated. It's been deprecated
  • The old versions of inference_on_stl.py and domino_datapipe.py must be replaced with the new versions.
  • Some sort of check of how this affects the cached versions of the datasets should happen.
  • I don't know how useful the openfoam_datapipe.py is - if it's still in use, we should merge that logic into this and unify it. Deprecated

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Separate the dataloading from the data processing in DoMINO datapipe.
… data set

for IO.

This is reaching IO throughputs of about 5GB/s on ORD, so getting better.
… the kernel precision.

The test had some expected numbers that, I believe, were incorrect.
Update domino_datapipe2 (temporary name).
…training

script a little, simply by moving things around ...
…e) and fix a few details in the new pipeline. Use new pipeline in training script
some tweaks to enable the preprocess pipeline for inference.
Copy link
Collaborator

@RishikeshRanade RishikeshRanade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RishikeshRanade RishikeshRanade marked this pull request as ready for review October 17, 2025 19:22
@coreyjadams
Copy link
Collaborator Author

This PR is huge and triggered a bunch of reviewers. ChatGPT helped me script up a way to figure out who was tagged, per file, and why:

docs/img/domino/combined-training-curve.png -> @megnvidia, @ktangsali
docs/img/domino/drag-r2.jpg -> @megnvidia, @ktangsali
docs/img/domino/lift-r2.jpg -> @megnvidia, @ktangsali
docs/img/domino/surface-training-curve.png -> @megnvidia, @ktangsali
docs/img/domino_perf.png -> @megnvidia, @ktangsali
examples/cfd/external_aerodynamics/domino/README.md -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/requirements.txt -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/benchmark_dataloader.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/compute_statistics.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/conf/config.yaml -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/README.md -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/inference_on_stl.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/openfoam_datapipe.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/retraining.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/deprecated/train_sharded.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/inference_on_stl.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/loss.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/shuffle_volumetric_curator_output.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/test.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/train.py -> @RishikeshRanade
examples/cfd/external_aerodynamics/domino/src/utils.py -> @RishikeshRanade
physicsnemo/datapipes/cae/cae_dataset.py -> @RishikeshRanade, @coreyjadams, @peterdsharpe, @mnabian, @Alexey-Kamenev
physicsnemo/datapipes/cae/domino_datapipe.py -> @RishikeshRanade, @coreyjadams, @peterdsharpe, @mnabian, @Alexey-Kamenev
physicsnemo/datapipes/cae/domino_sharded_datapipe.py -> @RishikeshRanade, @coreyjadams, @peterdsharpe, @mnabian, @Alexey-Kamenev
physicsnemo/models/domino/encodings.py -> @RishikeshRanade
physicsnemo/models/domino/geometry_rep.py -> @RishikeshRanade
physicsnemo/models/domino/mlps.py -> @RishikeshRanade
physicsnemo/models/domino/model.py -> @RishikeshRanade
physicsnemo/models/domino/solutions.py -> @RishikeshRanade
physicsnemo/models/layers/__init__.py -> No owner
physicsnemo/models/layers/ball_query.py -> No owner
physicsnemo/models/layers/fourier_layers.py -> No owner
physicsnemo/models/layers/mlp_layers.py -> No owner
physicsnemo/utils/domino/utils.py -> @RishikeshRanade
physicsnemo/utils/domino/vtk_file_utils.py -> @RishikeshRanade
physicsnemo/utils/memory.py -> No owner
physicsnemo/utils/neighbors/radius_search/_torch_impl.py -> @coreyjadams, @peterdsharpe
test/datapipes/test_domino_datapipe.py -> No owner
test/distributed/shard_tensor/ops/test_radius_search.py -> No owner
test/models/data/domino_output-conv.pth -> No owner
test/models/data/domino_output-unet.pth -> No owner
test/models/data/domino_output.pth -> No owner
test/models/data/mlp_output.pth -> No owner
test/models/domino/__init__.py -> No owner
test/models/domino/conftest.py -> No owner
test/models/domino/test_domino.py -> No owner
test/models/domino/test_domino_encodings.py -> No owner
test/models/domino/test_domino_geometry_rep.py -> No owner
test/models/domino/test_domino_mlps.py -> No owner
test/models/domino/test_domino_solutions.py -> No owner
test/models/domino/utils.py -> No owner
test/models/test_mlp_layers.py -> No owner
test/utils/test_domino_utils.py -> No owner

Owner summary (files touched):
  @RishikeshRanade: 26
  @megnvidia: 5
  @ktangsali: 5
  @coreyjadams: 4
  @peterdsharpe: 4
  @mnabian: 3
  @Alexey-Kamenev: 3

I'm going to open a PR for tweaking code owners that should clear out the image updates and remove @megnvidia . I'll make the data pipes directory more fine grained to free up @mnabian and @Alexey-Kamenev .

That will leave one review, from @peterdsharpe , to look at

physicsnemo/utils/neighbors/radius_search/_torch_impl.py -> @coreyjadams, @peterdsharpe

Peter would you mind reviewing that file? Everything else has been vetted by me and @RishikeshRanade over a few weeks now.

Copy link
Collaborator

@peterdsharpe peterdsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed for physicsnemo/utils/neighbors/radius_search/_torch_impl.py; LGTM

@coreyjadams
Copy link
Collaborator Author

/blossom-ci

@coreyjadams
Copy link
Collaborator Author

/blossom-ci

@coreyjadams coreyjadams removed the request for review from megnvidia October 17, 2025 23:11
@coreyjadams
Copy link
Collaborator Author

/blossom-ci

@coreyjadams
Copy link
Collaborator Author

/blossom-ci

@coreyjadams coreyjadams self-assigned this Oct 20, 2025
@coreyjadams coreyjadams merged commit aac1ce3 into NVIDIA:main Oct 20, 2025
1 check passed
@coreyjadams coreyjadams deleted the physicsnemo-domino-inference branch October 23, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants