Cpu memory accumulation bug #20730

ved1beta · 2025-04-20T12:26:31Z

What does this PR do?

This PR addresses the memory leak issue during prediction in PyTorch Lightning. It adds proper memory management when return_predictions=False and includes comprehensive tests to verify the fix.

Fixes #19398

Key Changes:

Added garbage collection in prediction loop when return_predictions=False
Implemented memory leak tests with large dataset simulation
Added environment variable cleanup in tests
Fixed pre-commit formatting issues

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
- Yes, discussed in CPU-Memory keeps accumulating during trainer.predict #19398
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
- No documentation changes needed
Did you write any new necessary tests? (not for typos and docs)
- Yes, added memory leak tests
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
- No breaking changes
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)
- Yes, will add entry for memory leak fix

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20730.org.readthedocs.build/en/20730/

Borda · 2025-04-22T19:12:35Z

Implemented memory leak tests with large dataset simulation

Added environment variable cleanup in tests

seems this is not included in this PR yet

for more information, see https://pre-commit.ci

…orch-lightning into cpu_memory_accumulation_bug

deependujha

Hi @ved1beta, thanks for the contribution.

Added some reviews. Please try fixing the failing tests and also mypy issue.

deependujha · 2025-05-20T11:26:51Z

src/lightning/pytorch/loops/prediction_loop.py

+            # Clear memory if not returning predictions
+            import gc
+
+            gc.collect()


do you think it would be a good idea to have an argument collect_gc or something that users can toggle.

As Adrian said: it might be expensive in certain situations.

deependujha · 2025-05-20T11:36:33Z

src/lightning/pytorch/loops/prediction_loop.py

-        if predictions is None:
-            self._warning_cache.warn("predict returned None if it was on purpose, ignore this warning...")
+        step_args = self._build_step_args_from_hook_kwargs(hook_kwargs, "predict_step")
+        step_output = call._call_lightning_module_hook(trainer, "predict_step", *step_args)


why are you directly calling lightning module hook without calling strategy hook?

After couple of checks and precision_plugin context, it does call lightning_module's predict_step.

ved1beta added 3 commits April 20, 2025 17:21

fix: Add memory leak prevention in prediction loop

403b3ae

fix: Add memory leak prevention in prediction loop

ff5f9eb

test removed

65d38fe

ved1beta requested review from lantiga, Borda, tchaton, justusschock and ethanwharris as code owners April 20, 2025 12:26

github-actions bot added the pl Generic label for PyTorch Lightning package label Apr 20, 2025

ved1beta and others added 5 commits April 23, 2025 10:09

memory leak test

ce0897c

env var cleanu

f24ea83

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3cf136

for more information, see https://pre-commit.ci

precommit fix

6aa644a

Merge branch 'cpu_memory_accumulation_bug' of github.com:ved1beta/pyt…

6d6d04e

…orch-lightning into cpu_memory_accumulation_bug

deependujha reviewed May 20, 2025

View reviewed changes

Borda added the waiting on author Waiting on user action, correction, or update label May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cpu memory accumulation bug #20730

Cpu memory accumulation bug #20730

Uh oh!

ved1beta commented Apr 20, 2025 •

edited

Loading

Uh oh!

Borda commented Apr 22, 2025

Uh oh!

deependujha left a comment

Uh oh!

deependujha May 20, 2025

Uh oh!

deependujha May 20, 2025

Uh oh!

Uh oh!

Cpu memory accumulation bug #20730

Are you sure you want to change the base?

Cpu memory accumulation bug #20730

Uh oh!

Conversation

ved1beta commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key Changes:

Before submitting

PR review

Uh oh!

Borda commented Apr 22, 2025

Uh oh!

deependujha left a comment

Choose a reason for hiding this comment

Uh oh!

deependujha May 20, 2025

Choose a reason for hiding this comment

Uh oh!

deependujha May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ved1beta commented Apr 20, 2025 •

edited

Loading