Caching Mechanism For DoMINO Training #805

Mx7f · 2025-03-06T21:30:53Z

Modulus Pull Request

Description

Summary

Introduces a CachedDoMINODataset for DoMINO which decreases the time of a training epoch (by over 50x in a common training scenario on a H100s). An example of a config set up to use caching is provided in cached.yaml.

Annotating a timeline of a surface training run of DoMINO on the DrivaerML dataset on an H100, we can see the overwhelming portion of the time is spent in the dataloader (within which most of the time is neighbor calculation):

In this single sample: 18.92 seconds in the __getitem__ call, 0.16s with the GPU active doing training (as the H100 powers through the kernels), so over 100x overhead. For 52 training samples on one H100 on the DrivaerML, this makes the training epoch take >15 minutes (983.24 seconds on a arbitrarily chosen epoch), so a training run of 500 epochs takes over 5 days.

This overhead is exacerbated by the fact the dataloader contains some GPU-based WARP operations, which mean we cannot have dataloader workers.

In order to mitigate this, we introduce CachedDoMINODataset, and a new cache_data.py stage. Essentially we do all of the preprocessing work formerly done in DoMINODataPipe's __getitem__ except for sampling in cache_data.py, then at train time just read in the cached data and sample it. In order to keep filesize relatively small we also only store the neighbor indices (and compute the neighbor properties, such as coordinates, at load).

Here is a timeline of the same sample but using CachedDoMINODataset:

~2.29s in dataloading for a single sample, which is over a 8x improvement. This brings the training time for a 500 epoch run down to under a day. Just as importantly, since the GPU-based WARP operations are now in the caching phase, we can now set num_workers to something other than 0, and unlock some parallelism and data pipelining. Setting to 12, we can get further improvement:

Here we see 12 samples being handled efficiently in sequence, then a gap as the dataloading workers finish preparing the next sample, then the next 12 samples handled efficiently. The 12 samples plus the wait for refill is ~2.69s, which is only a bit more than a single sample with num_workers=0. This brings a single training epoch (52 samples) down to ~13.7 seconds (since 12 doesn't divide 52 the 4 final remainder samples end up taking more time than the rest). This means an entire training run is brought down to ~2 hours (from over 5 days without caching).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

… hook

Mx7f added 9 commits March 6, 2025 18:56

Profiling code

5df436d

Fix typo

5cbd356

Factor out compute_scaling_factors to re-use in cache script

4f6a76e

Add caching capabilities

17b9c3b

Domino datapipe

6492deb

Missed removal

df0b978

Hotfixes

681a026

Modify cached config

beaa6ea

Prune unused imports, add requirements and formatting from pre-commit…

d9cd10e

… hook

RishikeshRanade self-requested a review March 12, 2025 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching Mechanism For DoMINO Training #805

Caching Mechanism For DoMINO Training #805

Mx7f commented Mar 6, 2025 •

edited

Loading

Caching Mechanism For DoMINO Training #805

Are you sure you want to change the base?

Caching Mechanism For DoMINO Training #805

Conversation

Mx7f commented Mar 6, 2025 • edited Loading

Modulus Pull Request

Description

Summary

Checklist

Dependencies

Mx7f commented Mar 6, 2025 •

edited

Loading