-
Notifications
You must be signed in to change notification settings - Fork 463
DoMINO Performance Optimizations #1133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DoMINO Performance Optimizations #1133
Conversation
Separate the dataloading from the data processing in DoMINO datapipe.
… data set for IO. This is reaching IO throughputs of about 5GB/s on ORD, so getting better.
… the kernel precision. The test had some expected numbers that, I believe, were incorrect.
Update domino_datapipe2 (temporary name).
…training script a little, simply by moving things around ...
…e) and fix a few details in the new pipeline. Use new pipeline in training script
some tweaks to enable the preprocess pipeline for inference.
examples/cfd/external_aerodynamics/domino/src/inference_on_stl.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
This PR is huge and triggered a bunch of reviewers. ChatGPT helped me script up a way to figure out who was tagged, per file, and why: I'm going to open a PR for tweaking code owners that should clear out the image updates and remove @megnvidia . I'll make the data pipes directory more fine grained to free up @mnabian and @Alexey-Kamenev . That will leave one review, from @peterdsharpe , to look at Peter would you mind reviewing that file? Everything else has been vetted by me and @RishikeshRanade over a few weeks now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review completed for physicsnemo/utils/neighbors/radius_search/_torch_impl.py; LGTM
|
/blossom-ci |
|
/blossom-ci |
|
/blossom-ci |
|
/blossom-ci |
PhysicsNeMo Pull Request
This PR is to close gaps in DoMINO performance. In includes some level of refactor, as well, with a little more to go. Opening this PR as a draft while tracking down some final little pieces, which will get listed below.
Description
This is a large PR. It covers a lot of pieces which essentially have to be brought in together. In no particular order, this PR includes:
DataPipe and DataSet separation
DrivaerMLCAE DataSetNote Renaming is in progress.
This PR includes a separation of IO from pre-processing in the DoMINO datapipe. There is now one IO interface "cae_ml_dataset" (open to a name change, here) that generically will read keys from a dictionary-like object. The dataset will infer, based on file properties, how to read the data:
.npyfiles are supported (pickled numpy dictionaries) though the entire file has to be read, and then the requested keys are returned as torch cpu tensors..npzfiles can read directly the tensors requested..zarrfiles will read with zarr 3.0, unlesstensorstoreis installed.The dataset will, as part of it's pipeline, pin tensors and move them to the GPU. Ideally if we're pinning, we would preallocate and write directly to those buffers, but tensorstore does not support that. If the output device is GPU, then the data transfer will happen in a separate stream. Optionally, if this is a multi-stream application, the user can pass a
consumer_stream(defaults totorch.cuda.default_stream()) that will be used to ensure proper stream ordering.The dataset object has ability to preload indexes in python threads (up to a configurable preload depth) to async load data from disk into torch tensors. Note that aggressive preloading will accumulate GPU storage usage if the dataset is outputting to GPU (configured by setting the output device in dataset construction).
Construction of these datasets looks like this:
Calling
__getitem__(aka,dataset[i]) will fetch the i'th entry of the dataset. There is also apreloadfunction that will manually queue up the target data. Further, if you iterate over the dataset (for i, batch in enumerate(dataset)) the preloading will happen automatically and transparently.Optimization of Volumetric Reads
Because the volumetric data is so large, per file, we have an optimization that can be enabled in the dataset. The dataset can read contiguous sub-slices of the data, for volumetric data, such that it reads only a small fraction from disk each iteration. This requires preprocessing: the dataset has to be shuffled, on disk, prior to training or inference. This is expected to be supported in curator in the next release.
Domino DataPipe
The datapipe for domino has been refactored entirely: the entire datapipe is implemented to operate on torch tensors. As of opening this PR, the datapipe can contain a reference to the
cae_datasetobject (though doesn't, by default). If it does have the dataset object, you can iterate over the datapipe. This is the generally supported method for optimal training pipelines.The datapipe itself looks very similar to the previous iteration, minus the IO pieces, but has a few additions:
All logic based on computing scaling factors is removed from the datapipe. See below.
The CachedDominoDataset is largely unchanged.
DoMINO Example ReOrg
Inside the src/ folder of the DoMINO training example, there are the following changes:
loss.pybut are otherwise unchanged. They could use a clean up after the physics loss additions.utils.pyfile has been added to contain script-specific logic, like extracting the number of variables for the model from the config and adataclassto contain scaling factor info (see below).train.pybenchmark_dataset.pyis present for development but needs to be removed before merge.Scaling Factors
Currently, in Domino, computing the scaling factors is done implicitly and somewhat slowly, since it uses the pipeline inefficiently. Instead, there is a compute_scaling_factors function that will use the dataset (not datapipe) object, loop over the data reading only the necessary fields, if possible, and return the mean/std/min/max of each field read. By default, for drivaerML, this will not only use the output fields but also the surface and volume coordinates. This can help determine bounding boxes on new datasets, etc.
To facilitate the scaling factor usage more easily, a new script is added to the
src/dir in the examples,compute_statistics.py. It is meant to be run on a single device, or even possibly the CPU if desired, and it saves the scaling factors into a pickled object that is reloaded in other scripts. I'm open to changes in this design, but the idea was to make the scale factors more stable, more portable, and computing them stand-alone from the training script.Inference on STL
The entire inference on stl pipeline has been refactored and ported to use GPUs, reusing the dataset and datapipe from above. Performance is significantly better on GPU. Two pieces of the script are missing (called out below).
Other components
RMM by default
The RAPIDS memory manager is now used, by default, to share memory between the kNN in the preprocessing step and the main model code. This accelerates performance (avoiding cudaFree and cudaMalloc) and also reduces memory usage overall - since the two phases are out of sync with each other, sharing memory makes sense. It can only be opted out with an environment variable, now documented in the README.
The interface is quite simple, at the top of the file if you want to use it you do:
SDF is torch only now
The Signed distance function now only accepts torch inputs and returns torch outputs. The output signature is fixed to the distance itself, and the closest point on the mesh (which is not usually a vertex). Test have been updated, since they were slightly incorrect.
utils/domino/utils.py is torch only
The cupy/numpy backend of those functions is removed and it is torch only, now. The non-array functionality in utils for vtk reading has been moved to
utils/domino/vtk_file_utils.pyand is unchanged, otherwise.Missing pieces will be marked below as issues to resolve before merge, however, here's the condensed list as I see it:
set_indicesfunctionality is to be moved to the dataset.create_domino_datasetfunction can do this transparently to keep it as a "drop in" replacementThe SDF function shows up in other places, still. Some are reimplementations, some are using this. It needs to be checked if we can accommodate that easily.Done, except in re-implementations which are out of scope here.The retraining script must be updated.It's been deprecatedinference_on_stl.pyanddomino_datapipe.pymust be replaced with the new versions.I don't know how useful theDeprecatedopenfoam_datapipe.pyis - if it's still in use, we should merge that logic into this and unify it.Checklist
Dependencies