Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
e559c46
Relax cuml constraints
coreyjadams Aug 25, 2025
e38ecdf
Port sdf function to use only torch inputs. No changes to tests yet.
coreyjadams Aug 25, 2025
dd7b3cf
Porting some domino utils function to pure torch interface
coreyjadams Aug 25, 2025
7888f29
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Aug 25, 2025
8590afd
Add new dataset to read DrivaerML like data in various formats.
coreyjadams Aug 25, 2025
caf0290
Adding a torch-centric domino datapipe and a separated, data-agnostic…
coreyjadams Aug 27, 2025
7fb5f8e
Rename datapipe file to dataset.
coreyjadams Aug 27, 2025
0fb0ed2
Update SDF function and test. Auto convert higher precisions to matc…
coreyjadams Aug 27, 2025
60c3535
Add IO benchmark
coreyjadams Aug 27, 2025
0c668d9
Minor bug fixes
coreyjadams Aug 27, 2025
70e6130
Few bug fixes
coreyjadams Aug 27, 2025
b99dcfe
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Aug 27, 2025
4c26ae1
A few more fixes for domino.
coreyjadams Aug 27, 2025
45100ef
Fix pre-commit issues
coreyjadams Aug 27, 2025
1301293
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Aug 27, 2025
9489732
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 2, 2025
675c546
Port domino utils from cupy/numpy to pure torch.
coreyjadams Sep 3, 2025
4578975
update training script for new datapipe
coreyjadams Sep 3, 2025
8f22a6c
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 4, 2025
9a5d8ed
Add abillity to pin memory, optionally.
coreyjadams Sep 4, 2025
c57f985
Snapshot updates of cleanups and minor fixes
coreyjadams Sep 5, 2025
af5d7c9
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 5, 2025
02b03a0
Most datapipe tests passing. Add compute_statistics script. Clean u…
coreyjadams Sep 5, 2025
ff185b3
Update tests for the new pipeline (mostly fix indexing from batch siz…
coreyjadams Sep 8, 2025
296a58b
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 8, 2025
aceb1f1
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 8, 2025
e51bf4c
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 9, 2025
c7c94cb
add a utility to sample on a mesh with torch.
coreyjadams Sep 10, 2025
471aae9
Add revamped inference script. Doesn't write outputs yet.
coreyjadams Sep 10, 2025
3864830
Update inference script for Domino STL inference.
coreyjadams Sep 10, 2025
0635e4d
Minor tweaks to the inference script.
coreyjadams Sep 10, 2025
db8cc98
Mark the docstring for updating.
coreyjadams Sep 10, 2025
2a190eb
Spin off the stl sampling and inference loop into it's own function,
coreyjadams Sep 10, 2025
6393b56
Ensure stl mesh itself gets processed too
coreyjadams Sep 10, 2025
f7e9ea2
Update docstring for inference file.
coreyjadams Sep 10, 2025
3134042
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 10, 2025
d784a80
Enable shard tensor for zarr datasets, both with or without tensorstore
coreyjadams Sep 15, 2025
7c27a8e
Updating and further documenting scripts
coreyjadams Sep 15, 2025
7f01ddc
Remove bug in sdf fake function
coreyjadams Sep 15, 2025
f0a1247
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 17, 2025
1d03ab7
Restructure datapipe to make the logical flow simpler and clearer.
coreyjadams Sep 17, 2025
b7b7a65
Ensure RMM is actually used...
coreyjadams Sep 17, 2025
c5e1db8
Add sharded implementations of both kNN and SDF, as well as tests for…
coreyjadams Sep 17, 2025
ee0c728
First domino refactor: consolidate all MLP implementations,
coreyjadams Sep 19, 2025
611dce4
Refactor the encodings stage of domino to standalone nn.Modules
coreyjadams Sep 19, 2025
4038ff3
Further refactor DoMINO to put solution calculations
coreyjadams Sep 19, 2025
5539c54
Refactor domino model and add significant test suite expansion.
coreyjadams Sep 22, 2025
260c240
Move geometry rep codes to a separate file for model simplicity too.
coreyjadams Sep 22, 2025
b3995c9
Merge branch 'main' into domino-refactor
coreyjadams Sep 22, 2025
5732199
This commit purges some code that was moved into another commit.
coreyjadams Sep 22, 2025
b2b10ad
Missed a piece of moved code.
coreyjadams Sep 22, 2025
d412e15
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Sep 22, 2025
4e87ece
Merge branch 'NVIDIA:main' into domino-refactor
coreyjadams Sep 22, 2025
8a34242
move sdf, knn, and radius_search torch interface and stream fixes to …
coreyjadams Sep 22, 2025
3c0a551
Move sdf test changes to a different PR
coreyjadams Sep 22, 2025
5f9c777
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Sep 22, 2025
737201f
Move minor model changes to the model refactor.
coreyjadams Sep 22, 2025
378218f
Fix minor errors in the datapipe
coreyjadams Sep 23, 2025
b0bd877
Move several components of the recipe to the deprecation bin.
coreyjadams Sep 23, 2025
614710e
Move and rename inference scripts
coreyjadams Sep 23, 2025
f172ce6
Update train, inference, and config files.
coreyjadams Sep 23, 2025
cdbe0ce
Update scaling factor configuration and location setting
coreyjadams Sep 24, 2025
fc5d32a
Make sure surface grid and sdf calculation always happens.
coreyjadams Sep 24, 2025
3f4f110
Update timing printouts for training.
coreyjadams Sep 24, 2025
9aec327
Merge branch 'NVIDIA:main' into domino-refactor
coreyjadams Sep 24, 2025
b0b4982
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Sep 24, 2025
1f423fd
Merge branch 'NVIDIA:main' into domino-refactor
coreyjadams Sep 26, 2025
2e3c696
Fix bug in output encoding when the number of upstream radii is diffe…
coreyjadams Sep 26, 2025
c791aab
Merge branch 'main' into domino-refactor
coreyjadams Sep 26, 2025
e062f49
Update CHANGELOG
coreyjadams Sep 22, 2025
6a26c95
Update changelog
coreyjadams Sep 29, 2025
10bdc95
resolving bug and optimizing GeoConvOut for memory
RishikeshRanade Sep 29, 2025
ffedfaa
Resolve most of the feedback from PR review.
coreyjadams Sep 29, 2025
a87f666
Align new datapipe with Rishi's
coreyjadams Sep 29, 2025
1c191b5
Use ones_like to create a tensor
coreyjadams Sep 29, 2025
e7032a2
Merge pull request #4 from coreyjadams/domino-refactor-rr
coreyjadams Sep 29, 2025
4310e52
Move old script to new location
coreyjadams Sep 29, 2025
1ac0044
Merge branch 'domino-refactor' into physicsnemo-domino-inference
coreyjadams Sep 29, 2025
cd64439
Update some tests to match the new datapipe structure
coreyjadams Sep 30, 2025
4b1a3fd
Fix dataloading error, and remove old datapipe
coreyjadams Sep 30, 2025
b996417
Remove printouts.
coreyjadams Sep 30, 2025
f7aab12
Add unified gpu memory interface that correctly places memory pools
coreyjadams Sep 30, 2025
e36bf9b
Fix indexing error in the dataset that was leading to GPU memory leak…
coreyjadams Oct 1, 2025
5240b33
fix in scaling factors calculation
RishikeshRanade Oct 1, 2025
fec26d5
small fixes in datapipe and model
RishikeshRanade Oct 2, 2025
073a3f9
Fix factor calculations
coreyjadams Oct 7, 2025
594b9ed
Enable sliced reading of volumetric data.
coreyjadams Oct 7, 2025
4c67de6
Update scaling factor calculation and loading ... much simpler now.
coreyjadams Oct 7, 2025
316dfe6
Fix volume encoding calculation. Make sure surface grid is normalized
coreyjadams Oct 8, 2025
cc9a566
fixing bugs and refactoring test
RishikeshRanade Oct 8, 2025
961d4ba
remove print command
RishikeshRanade Oct 8, 2025
14be02f
fixing issues in test
RishikeshRanade Oct 8, 2025
eb62dce
fixing errors in test.py
RishikeshRanade Oct 8, 2025
76bd6b4
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Oct 8, 2025
bac5365
Update volumetric sub sampling so that it is more robust when not rea…
coreyjadams Oct 8, 2025
2ceefc7
Make sure differentiable loss tensors are detached before transfer to…
coreyjadams Oct 8, 2025
d955d87
remove printouts.
coreyjadams Oct 8, 2025
d05e653
Increase data reading size, for sub-sampling.
coreyjadams Oct 9, 2025
06ca085
Add more tests to the datapipe for domino
coreyjadams Oct 10, 2025
8a91a18
Rename DrivaerMLDataset to CAE Dataset.
coreyjadams Oct 10, 2025
e151de0
Add metrics to printouts and tbfile.
coreyjadams Oct 10, 2025
6bdbe85
Merge branch 'NVIDIA:main' into physicsnemo-domino-inference
coreyjadams Oct 13, 2025
6b2e8d9
cleaning up test and datapipe
RishikeshRanade Oct 10, 2025
0b721fc
benchmarked code for accuracy, set configs, scaling factor calculatio…
RishikeshRanade Oct 14, 2025
992c087
fixing merge issue in datapipe
RishikeshRanade Oct 14, 2025
7675b14
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Oct 14, 2025
34997ed
Update readme to include shuffling and performance notes.
coreyjadams Oct 14, 2025
97e4354
Add domino perf plot
coreyjadams Oct 14, 2025
8fc358d
Update PR based on coderabbit review.
coreyjadams Oct 15, 2025
763d978
Remove error that breaks validation / inference.
coreyjadams Oct 16, 2025
0ea5f99
Update Domino model and tests: make sure pre-commit passes, remove un…
coreyjadams Oct 16, 2025
f4b54d4
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Oct 17, 2025
e91f263
Hopefully fix inference script
coreyjadams Oct 17, 2025
e9dbac9
fixes to scaling and adding configs
RishikeshRanade Oct 17, 2025
6706558
Update README.md for domino
coreyjadams Oct 17, 2025
d8a4901
Remove unneeded plots.
coreyjadams Oct 17, 2025
01a0c15
uupdate r2
coreyjadams Oct 17, 2025
edae436
Fix ruff issues
coreyjadams Oct 17, 2025
29de3bf
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Oct 17, 2025
6c04d88
test fix.
coreyjadams Oct 17, 2025
bd74ce3
Fix dict item with normalization off.
coreyjadams Oct 17, 2025
9023575
Change codeowners order, exclusion goes last.
coreyjadams Oct 17, 2025
b24b0d4
Undo file order so it can get fixed elsewhere.
coreyjadams Oct 17, 2025
e82d0ce
Merge branch 'main' into physicsnemo-domino-inference
coreyjadams Oct 17, 2025
b8db738
Update doctstring tests.
coreyjadams Oct 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Refactored DiTBlock to be more modular
- Added NATTEN 2D neighborhood attention backend for DiTBlock
- Migrated blood flow example to PyTorch Geometric.
- Refactored DoMINO model code and examples for performance optimizations and improved readability.
- Migrated HydroGraphNet example to PyTorch Geometric.
- Support for saving and loading nested `physicsnemo.Module`s. It is now
possible to create nested modules with `m = Module(submodule, ...)`, and save
Expand Down
Binary file added docs/img/domino/combined-training-curve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/domino/drag-r2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/domino/lift-r2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/domino/surface-training-curve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/domino_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
156 changes: 148 additions & 8 deletions examples/cfd/external_aerodynamics/domino/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,19 +77,24 @@ please refer to their [paper](https://arxiv.org/pdf/2408.11969).

#### Data Preprocessing

`PhysicsNeMo` has a related project to help with data processing, called [PhysicsNeMo-Curator](https://github.com/NVIDIA/physicsnemo-curator).
`PhysicsNeMo` has a related project to help with data processing, called
[PhysicsNeMo-Curator](https://github.com/NVIDIA/physicsnemo-curator).
Using `PhysicsNeMo-Curator`, the data needed to train a DoMINO model can be setup easily.
Please refer to [these instructions on getting started](https://github.com/NVIDIA/physicsnemo-curator?tab=readme-ov-file#what-is-physicsnemo-curator)
Please refer to
[these instructions on getting started](https://github.com/NVIDIA/physicsnemo-curator?tab=readme-ov-file#what-is-physicsnemo-curator)
with `PhysicsNeMo-Curator`.

Download the DrivAer ML dataset using the [provided instructions in PhysicsNeMo-Curator](https://github.com/NVIDIA/physicsnemo-curator/blob/main/examples/external_aerodynamics/domino/README.md#download-drivaerml-dataset).
Download the DrivAer ML dataset using the
[provided instructions in PhysicsNeMo-Curator](https://github.com/NVIDIA/physicsnemo-curator/blob/main/examples/external_aerodynamics/domino/README.md#download-drivaerml-dataset).
The first step for running the DoMINO pipeline requires processing the raw data
(vtp, vtu and stl) into either Zarr or NumPy format for training.
Each of the raw simulations files are downloaded in `vtp`, `vtu` and `stl` formats.
For instructions on running data processing to produce a DoMINO training ready dataset,
please refer to [How-to Curate data for DoMINO Model](https://github.com/NVIDIA/physicsnemo-curator/blob/main/examples/external_aerodynamics/domino/README.md).
please refer to
[How-to Curate data for DoMINO Model](https://github.com/NVIDIA/physicsnemo-curator/blob/main/examples/external_aerodynamics/domino/README.md).

Caching is implemented in [`CachedDoMINODataset`](https://github.com/NVIDIA/physicsnemo/blob/main/physicsnemo/datapipes/cae/domino_datapipe.py#L1250).
Caching is implemented in
[`CachedDoMINODataset`](https://github.com/NVIDIA/physicsnemo/blob/main/physicsnemo/datapipes/cae/domino_datapipe.py#L1250).
Optionally, users can run `cache_data.py` to save outputs
of DoMINO datapipe in the `.npy` files. The DoMINO datapipe is set up to calculate
Signed Distance Field and Nearest Neighbor interpolations on-the-fly during
Expand All @@ -101,6 +106,36 @@ processed files.
The final processed dataset should be divided and saved into 2 directories,
for training and validation.

#### Data Scaling factors

DoMINO has several data-specific configuration tools that rely on some
knowledge of the dataset:

- The output fields (the labels) are normalized during training to a mean
of zero and a standard deviation of one, averaged over the dataset.
The scaling is controlled by passing the `volume_factors` and
`surface_factors` values to the datapipe.
- The input locations are scaled by, and optionally cropped to, used defined
bounding boxes for both surface and volume. Whether cropping occurs, or not,
is controlled by the `sample_in_bbox` value of the datapipe. Normalization
to the bounding box is enabled with `normalize_coordinates`. By default,
both are set to true. The value of the boxes are configured in the
`config.yaml` file, and are configured separately for surface and volume.

> Note: The datapipe module has a helper function `create_domino_dataset`
> with sensible defaults to help create a Domino Datapipe.

To facilitate setting reasonable values of these, you can use the
`compute_statistics.py` script. This will load the core dataset as defined
in your `config.yaml` file, loop over several events (200, by default), and
both print and store the surface/volume field statistics as well as the
coordinate statistics.

> Note that, for volumetric fields especially, the min/max found may be
> significantly outside the surface region. Many simulations extend volumetric
> sampling to far field, and you may instead want to crop significant amounts
> of volumetric distance.

#### Training

Specify the training and validation data paths, bounding box sizes etc. in the
Expand Down Expand Up @@ -176,9 +211,6 @@ The `domain_size` represents the number of GPUs used for each batch - setting
but with extra overhead. `shard_grid` and `shard_points` will enable domain
parallelism over the latent space and input/output points, respectively.

Please see `src/train_sharded.py` for more details regarding the changes
from the standard training script required for domain parallel DoMINO training.

As one last note regarding domain-parallel training: in the phase of the DoMINO
where the output solutions are calculated, the model can used two different
techniques (numerically identical) to calculate the output. Due to the
Expand All @@ -189,6 +221,114 @@ launch overhead at the cost of more memory use. For non-sharded
training, the `two-loop` setting is more optimal. The difference in `one-loop`
or `two-loop` is purely computational, not algorithmic.

### Performance Optimizations

The training and inference scripts for DoMINO contain several performance
enhancements to accelerate the training and usage of the model. In this
section we'll highlight several of them, as well as how to customize them
if needed.

#### Memory Pool Optimizations

The preprocessor of DoMINO requires a computation of k Nearest Neighbors,
which is accelerated via the `cuml` Neighbors tool. By default, `cuml` and
`torch` both use memory allocation pools to speed up allocating tensors, but
they do not use the same pool. This means that during preprocessing, it's
possible for the kNN operation to spend a significant amount of time in
memory allocations - and further, it limits the available memory to `torch`.

To mitigate this, by default in DoMINO we use the Rapids Memory Manager
([`rmm`](https://github.com/rapidsai/rmm)). If, for some reason, you wish
to disable this you can do so with an environment variable:

```bash
export PHYSICSNEMO_DISABLE_RMM=True
```

Or remove this line from the training script:

```python
from physicsnemo.utils.memory import unified_gpu_memory
```

> Note - why not make it configurable? We have to set up the shared memory
> pool allocation very early in the program, before the config has even
> been read. So, we enable by default and the opt-out path is via the
> environment.

#### Reduced Volume Reads

The dataset size for volumetric data can be quite substantial - DrivAerML, for
example, has mesh sizes of 160M points per example. Even though the models
do not process all 160M points, in order to down sample dynamically they all
must be read from disk - which can exceed bandwidth and CPU decoding capacity
on nodes with multiple GPUs.

As a performance enhancement, DoMINO's data pipeline offers a mitigation: instead
of reading an entire volumetric mesh, during preprocessing we _shuffle_ the
volumetric inputs and outputs (in tandem) and subsequent reads choose random
slices of the volumetric data. By default, DoMINO will read about 100x more data
than necessary for the sampling size. This allows the pipeline to still apply
cuts for data inside of the bounding box, and further random sampling to improve
training stability. To enable/disable this parameter, set
`data.volume_sample_from_disk=True` (enable) or `False` (disable)

> Note - if you volumetric data is not larger than a few million mesh points,
> pre-shuffling and sampling from disk is likely not necessary for you.

`physicsnemo-curator` supports shuffling the volumetric data during preprocessing.
If, however, you've already preprocessed your data and just want to apply
shuffling, use the script at `src/shuffle_volumetric_curator_output.py`

The shuffling script will also apply sharding to the output files, which
improves IO performance. So, `zarr>=3.0` is required to use the outputs from
curator. `src/shuffle_volumetric_curator_output.py` is meant to be an example of how
to apply shuffling, so modify and update as you need for your dataset.

> If you have tensorstore installed (it's in `requirements.txt`), the data reader
> will work equally well with Zarr 2 or Zarr 3 files.

#### Overall Performance

DoMINO is a computationally complex and challenging workload. Over the course
of several releases, we have chipped away at performance bottlenecks to speed
up the training and inference time (with `inference_on_stl.py`). Overall
training performance has decreased from about 5 days to just over 4 hours, with
eight H100 GPUs. We hope these optimizations enable you to explore more
parameters and surrogate models; if there is a performance issue you see,
please open an issue on GitHub.

![Results from DoMINO for RTWT SC demo](../../../../docs/img/domino_perf.png)

### Example Training Results

To provide an example of what a successful training should look like, we include here
some example results. Training curves may look similar to this:

![Combined Training Curve](../../../../docs/img/domino/combined-training-curve.png)

And, when evaluating the results on the validation dataset, this particular
run had the following L2 and R2 Metrics:

| Metric | Surface Only | Combined |
|--------------------:|:------------:|:--------:|
| X Velocity | N/A | 0.086 |
| Y Velocity | N/A | 0.185 |
| Z Velocity | N/A | 0.197 |
| Volumetric Pressure | N/A | 0.106 |
| Turb. V | N/A | 0.134 |
| Surface Pressure | 0.101 | 0.105 |
| X-Tau (Shear) | 0.138 | 0.145 |
| Y-Tau (Shear) | 0.174 | 0.185 |
| Z-Tau (Shear) | 0.198 | 0.207 |
| Drag R2 | 0.983 | 0.975 |
| Lift R2 | 0.971 | 0.968 |

With the PhysicsNeMo CFD tool, you can create plots of the lift and drag
forces computed by domino vs. the CFD Solver. For example, here is the drag force:

![Draf Force R^2](../../../../docs/img/domino/drag-r2.jpg)

### Training with Physics Losses

DoMINO supports enforcing of PDE residuals as soft constraints. This can be used
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ warp-lang
tensorboard
cuml
einops
tensorstore
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This code defines a distributed pipeline for training the DoMINO model on
CFD datasets. It includes the computation of scaling factors, instantiating
the DoMINO model and datapipe, automatically loading the most recent checkpoint,
training the model in parallel using DistributedDataParallel across multiple
GPUs, calculating the loss and updating model parameters using mixed precision.
This is a common recipe that enables training of combined models for surface and
volume as well either of them separately. Validation is also conducted every epoch,
where predictions are compared against ground truth values. The code logs training
and validation metrics to TensorBoard. The train tab in config.yaml can be used to
specify batch size, number of epochs and other training parameters.
"""

import time
import os
import re
import torch
import torchinfo

from typing import Literal, Any


import hydra
from hydra.utils import to_absolute_path
from omegaconf import DictConfig, OmegaConf

# This will set up the cupy-ecosystem and pytorch to share memory pools
from physicsnemo.utils.memory import unified_gpu_memory


import torch.distributed as dist
from torch.cuda.amp import GradScaler, autocast
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torch.utils.tensorboard import SummaryWriter
from nvtx import annotate as nvtx_annotate
import torch.cuda.nvtx as nvtx


from physicsnemo.distributed import DistributedManager
from physicsnemo.launch.utils import load_checkpoint, save_checkpoint
from physicsnemo.launch.logging import PythonLogger, RankZeroLoggingWrapper

from physicsnemo.datapipes.cae.domino_datapipe import (
DoMINODataPipe,
compute_scaling_factors,
create_domino_dataset,
)
from physicsnemo.models.domino.model import DoMINO
from physicsnemo.utils.domino.utils import *

# This is included for GPU memory tracking:
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
import time

from utils import (
ScalingFactors,
get_keys_to_read,
coordinate_distributed_environment,
load_scaling_factors,
)


from physicsnemo.utils.profiling import profile, Profiler


def benchmark_io_epoch(
dataloader,
logger,
gpu_handle,
epoch_index,
device,
):
dist = DistributedManager()

# If you tell the dataloader the indices in advance, it will preload
# and pre-preprocess data
# dataloader.set_indices(indices)

gpu_start_info = nvmlDeviceGetMemoryInfo(gpu_handle)
start_time = time.perf_counter()
for i_batch, sample_batched in enumerate(dataloader):
# Gather data and report
elapsed_time = time.perf_counter() - start_time
start_time = time.perf_counter()
gpu_end_info = nvmlDeviceGetMemoryInfo(gpu_handle)
gpu_memory_used = gpu_end_info.used / (1024**3)
gpu_memory_delta = (gpu_end_info.used - gpu_start_info.used) / (1024**3)

logging_string = f"Device {device}, batch processed: {i_batch + 1}\n"
logging_string += f" GPU memory used: {gpu_memory_used:.3f} Gb\n"
logging_string += f" GPU memory delta: {gpu_memory_delta:.3f} Gb\n"
logging_string += f" Time taken: {elapsed_time:.2f} seconds\n"
logger.info(logging_string)
gpu_start_info = nvmlDeviceGetMemoryInfo(gpu_handle)

return


@hydra.main(version_base="1.3", config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
# initialize distributed manager
DistributedManager.initialize()
dist = DistributedManager()

# Initialize NVML
nvmlInit()

gpu_handle = nvmlDeviceGetHandleByIndex(dist.device.index)

model_type = cfg.model.model_type

logger = PythonLogger("Train")
logger = RankZeroLoggingWrapper(logger, dist)

logger.info(f"Config summary:\n{OmegaConf.to_yaml(cfg, sort_keys=True)}")

################################
# Get scaling factors
################################
vol_factors, surf_factors = load_scaling_factors(cfg)

keys_to_read, keys_to_read_if_available = get_keys_to_read(
cfg, model_type, get_ground_truth=True
)

domain_mesh, data_mesh, placements = coordinate_distributed_environment(cfg)

train_dataset = create_domino_dataset(
cfg,
phase="train",
keys_to_read=keys_to_read,
keys_to_read_if_available=keys_to_read_if_available,
vol_factors=vol_factors,
surf_factors=surf_factors,
device_mesh=domain_mesh,
placements=placements,
)
train_sampler = DistributedSampler(
train_dataset, num_replicas=data_mesh.size(), rank=data_mesh.get_local_rank()
)

for epoch in range(0, cfg.train.epochs):
start_time = time.perf_counter()
logger.info(f"Device {dist.device}, epoch {epoch}:")

train_sampler.set_epoch(epoch)

train_dataset.dataset.set_indices(list(train_sampler))

epoch_start_time = time.perf_counter()
with Profiler():
benchmark_io_epoch(
dataloader=train_dataset,
logger=logger,
gpu_handle=gpu_handle,
epoch_index=epoch,
device=dist.device,
)
epoch_end_time = time.perf_counter()
logger.info(
f"Device {dist.device}, Epoch {epoch} took {epoch_end_time - epoch_start_time:.3f} seconds"
)


if __name__ == "__main__":
main()
Loading