compile optimizer #2623

IvanKobzarev · 2025-04-22T13:36:20Z

Stack from ghstack (oldest at bottom):

Compiling optimizer helps perf of Llama4 Scout Model
3.8 tokens_per_second -> 9 tokens_per_second (max value of tokens per second in the first ~10 iterations)
peak memory is the same

tune run --nproc_per_node 8 \
  full_finetune_distributed \
  --config recipes/configs/llama4/scout_17B_16E_full.yaml

PS:
Current repo compilation fails if to set skip_rope_interval=4,, have to test with skip_rope_interval=None,

[ghstack-poisoned]

pytorch-bot · 2025-04-22T13:36:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2623

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit af77178 with merge base 4bc5af2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Compiling optimizer helps perf of Llama4 Scout Model 3.8 tokens_per_second -> 9 tokens_per_second (max value of tokens per second in the first ~10 iterations) peak memory is the same ``` tune run --nproc_per_node 8 \ full_finetune_distributed \ --config recipes/configs/llama4/scout_17B_16E_full.yaml ``` PS: Current repo compilation fails if to set `skip_rope_interval=4,`, have to test with `skip_rope_interval=None,` [ghstack-poisoned]

joecummings

Since we're now compiling several things independently, it might make sense logically to have a section of the recipe where we compile everything after instantiation.

joecummings · 2025-04-22T14:51:01Z

torchtune/training/_compile.py

    return loss
+
+
+def compile_optimizer_step(optimizer_step_fn, verbose: bool = True):


I appreciate you wanting to keep this similar to how we were currently doing things; however, we only needed to this for the loss function b/c we were doing funky things with chunking.

We should just compile this directly in the recipe. Same goes for the other PR you have up.

joecummings · 2025-04-22T14:51:28Z

recipes/full_finetune_distributed.py

                            if isinstance(grad_norm, DTensor):
                                grad_norm = grad_norm.full_tensor()
-                        self._optimizer.step()
+                        optimizer_step_fn = self._optimizer.step


See comment below, we can just compile the optimizer step in the recipe directly.

Yeah, agree, just copied the previous setup. Will move compile to the recipe.

Noob q: is there a reason we need to compile self._optimizer.step every step? Why is it different than the model, which we compile one time upfront?

When I tried this out I found issues with setting up the LR scheduler which fails when attempting to wrap the optimizer step fn

Thanks, will check the compile optimizer error with LR scheduler.

joecummings · 2025-04-22T14:52:39Z

Ah sorry, did not mean to approve :)

Compiling optimizer helps perf of Llama4 Scout Model 3.8 tokens_per_second -> 9 tokens_per_second (max value of tokens per second in the first ~10 iterations) peak memory is the same ``` tune run --nproc_per_node 8 \ full_finetune_distributed \ --config recipes/configs/llama4/scout_17B_16E_full.yaml ``` PS: Current repo compilation fails if to set `skip_rope_interval=4,`, have to test with `skip_rope_interval=None,` [ghstack-poisoned]

codecov-commenter · 2025-04-22T17:03:41Z

Codecov Report

Attention: Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.

Please upload report for BASE (gh/IvanKobzarev/1/base@eed2665). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	15 Missing ⚠️

Additional details and impacted files

@@                    Coverage Diff                    @@
##             gh/IvanKobzarev/1/base    #2623   +/-   ##
=========================================================
  Coverage                          ?   63.97%           
=========================================================
  Files                             ?      399           
  Lines                             ?    24241           
  Branches                          ?        0           
=========================================================
  Hits                              ?    15507           
  Misses                            ?     8734           
  Partials                          ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

felipemello1 · 2025-04-23T18:08:00Z

recipes/full_finetune_distributed.py

+                        if self._compile:
+                            optimizer_step_fn = torch.compile(
+                                optimizer_step_fn,
+                                backend=self._compile_backend,
+                            )


some optimizers might not work with this, if i remember it correctly, like torchao/bnb. May need some testing. The safest option might be to add a compile flag per area, e.g.:

compile: loss: True model: True optimizer_step: False ```

Compiling optimizer helps perf of Llama4 Scout Model 3.8 tokens_per_second -> 9 tokens_per_second (max value of tokens per second in the first ~10 iterations) peak memory is the same ``` tune run --nproc_per_node 8 \ full_finetune_distributed \ --config recipes/configs/llama4/scout_17B_16E_full.yaml ``` PS: Current repo compilation fails if to set `skip_rope_interval=4,`, have to test with `skip_rope_interval=None,` [ghstack-poisoned]

IvanKobzarev · 2025-04-28T12:32:13Z

Changed to direct compilation of self.optimizer.step and it works :)
Updated the diff.

Just FYI for testing: compilation at the moment needs workarounds for 2 different problems:

There is some problem with rng states preservation which can be workarounded

diff --git a/torch/_dynamo/convert_frame.py b/torch/_dynamo/convert_frame.py
index 668353867ab..493883542f9 100644
--- a/torch/_dynamo/convert_frame.py
+++ b/torch/_dynamo/convert_frame.py
@@ -249,8 +249,8 @@ def preserve_global_state(fn: Callable[_P, _T]) -> Callable[_P, _T]:
             prior_dtype = torch.get_default_dtype()
             torch_rng_state = torch.random.get_rng_state()
             cuda_rng_state = None
-            if torch.cuda.is_available():
-                cuda_rng_state = torch.cuda.get_rng_state()
+            # if torch.cuda.is_available():
+            #     cuda_rng_state = torch.cuda.get_rng_state()
             allow_tf32 = torch._C._get_cublas_allow_tf32()
             prior_fwd_from_src = torch.fx.graph_module._forward_from_src
             torch.fx.graph_module._forward_from_src = fx_forward_from_src_skip_result
@@ -281,8 +281,8 @@ def preserve_global_state(fn: Callable[_P, _T]) -> Callable[_P, _T]:
                 )
                 if prior_mobile_allocator_state != curr_mobile_allocator_state:
                     torch._C._unset_default_mobile_cpu_allocator()
-                if cuda_rng_state is not None:
-                    torch.cuda.set_rng_state(cuda_rng_state)
+                # if cuda_rng_state is not None:
+                #     torch.cuda.set_rng_state(cuda_rng_state)
                 torch._C._set_cublas_allow_tf32(allow_tf32)
                 torch.fx.graph_module._forward_from_src = prior_fwd_from_src
                 assert guards.check(), (
diff --git a/torch/_dynamo/utils.py b/torch/_dynamo/utils.py
index b75b1d6c39f..7ca67523704 100644
--- a/torch/_dynamo/utils.py
+++ b/torch/_dynamo/utils.py
@@ -2110,15 +2110,15 @@ def preserve_rng_state():
     with disable_current_modes(), disable_functorch():
         rng_state = torch.clone(torch.random.get_rng_state())
         skip_frame_if_in_functorch_mode(rng_state)
-        if torch.cuda.is_available():
-            cuda_rng_state = torch.clone(torch.cuda.get_rng_state())
+        # if torch.cuda.is_available():
+        #     cuda_rng_state = torch.clone(torch.cuda.get_rng_state())
     try:
         yield
     finally:
         with torch.utils._python_dispatch._disable_current_modes():
             torch.random.set_rng_state(rng_state)
-            if torch.cuda.is_available():
-                torch.cuda.set_rng_state(cuda_rng_state)  # type: ignore[possibly-undefined]
+            # if torch.cuda.is_available():
+            #     torch.cuda.set_rng_state(cuda_rng_state)  # type: ignore[possibly-undefined]
 
 
 def is_jit_model(model0):

There is illegal memory access in Chunked flex Attention x Caching.

If to remove /tmp/torchinductor_${USER} before every run - then it does not fires (or disable pt2 cache)

Compiling optimizer helps perf of Llama4 Scout Model 3.8 tokens_per_second -> 9 tokens_per_second (max value of tokens per second in the first ~10 iterations) peak memory is the same ``` tune run --nproc_per_node 8 \ full_finetune_distributed \ --config recipes/configs/llama4/scout_17B_16E_full.yaml ``` PS: Current repo compilation fails if to set `skip_rope_interval=4,`, have to test with `skip_rope_interval=None,` [ghstack-poisoned]

joecummings

Just one nit on naming, but this looks good!

joecummings · 2025-04-28T15:18:47Z

recipes/configs/llama4/scout_17B_16E_full.yaml

 fsdp_cpu_offload: True
 compile: False # torch.compile, set to true for perf/memory improvement

+compile_components:


nit: could we match the argument to just "compile"? Then valid arguments would be "True", "False", or the specific components. If "True", then we compile everything. If "False", we compile nothing. If the argument has a dictionary with each component, then we follow those instructions.

Ok. Agree with this logic, will update to it.

Compiling optimizer helps perf of Llama4 Scout Model 3.8 tokens_per_second -> 9 tokens_per_second (max value of tokens per second in the first ~10 iterations) peak memory is the same ``` tune run --nproc_per_node 8 \ full_finetune_distributed \ --config recipes/configs/llama4/scout_17B_16E_full.yaml ``` PS: Current repo compilation fails if to set `skip_rope_interval=4,`, have to test with `skip_rope_interval=None,` [ghstack-poisoned]

…er (#2623)" This reverts commit 28dbc97.

compile optimizer

14736b8

[ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2025

IvanKobzarev mentioned this pull request Apr 22, 2025

scale_grads with foreach + compile #2624

Merged

IvanKobzarev requested review from ebsmothers and felipemello1 April 22, 2025 13:51

joecummings approved these changes Apr 22, 2025

View reviewed changes

IvanKobzarev requested a review from joecummings April 22, 2025 16:41

felipemello1 reviewed Apr 23, 2025

View reviewed changes

IvanKobzarev requested a review from felipemello1 April 28, 2025 13:54

joecummings approved these changes Apr 28, 2025

View reviewed changes

IvanKobzarev mentioned this pull request Apr 28, 2025

WIP-DEBUG-PROFILE torch.compile #2644

Open

IvanKobzarev added 2 commits April 28, 2025 11:59

IvanKobzarev merged commit 28dbc97 into gh/IvanKobzarev/1/base May 2, 2025
14 checks passed

IvanKobzarev added a commit that referenced this pull request May 2, 2025

Revert "llama4 full_fine_tune_distributed recipe: pt2 compile optimiz…

60b5c8b

…er (#2623)" This reverts commit 28dbc97.

IvanKobzarev mentioned this pull request May 2, 2025

llama4 distributed: compile optimizer #2659

Merged

		return loss


		def compile_optimizer_step(optimizer_step_fn, verbose: bool = True):

compile optimizer #2623

compile optimizer #2623

Uh oh!

Conversation

IvanKobzarev commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2623

✅ No Failures

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings commented Apr 22, 2025

Uh oh!

codecov-commenter commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IvanKobzarev commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

IvanKobzarev commented Apr 22, 2025 •

edited

Loading

pytorch-bot bot commented Apr 22, 2025 •

edited

Loading

codecov-commenter commented Apr 22, 2025 •

edited

Loading

IvanKobzarev commented Apr 28, 2025 •

edited

Loading