-
Notifications
You must be signed in to change notification settings - Fork 606
Add torch compile logic to LoRA trainer #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: ace-step/ACE-Step-1.5
Length of output: 1198
🌐 Web query:
torch.compile failure modes exceptions runtime errors fallback pattern💡 Result:
Common
torch.compilefailure modes (what you’ll see)fullgraph=False, Dynamo will graph-break, run the unsupported part in eager Python, then resume tracing (often just causing slowness, not failure). Typical causes: data-dependent control flow, many Python builtins/C functions, printing/logging, etc. [1]fullgraph=True, then any graph break becomes a hard failure (you’ll get an error instead of falling back to eager for that region). [2]torch._dynamo.exc.Unsupported: ...when Dynamo can’t trace something (examples include.item()-driven data dependence, dynamic slicing with Tensor arguments, unsupported Python constructs, skipped/marked functions, etc.). [1][3]torch._dynamo.exc.BackendCompilerFailedwrapping the real underlying error (e.g., toolchain/Triton compilation problems). [4]“Fail fast” vs “fallback” knobs (practical patterns)
A) Prefer partial compilation (default) to avoid hard failures
Graph breaks will run those regions eagerly instead of erroring. [1][2]
B) Explicitly exclude known-problem regions (most robust)
Use
@torch.compiler.disableto force eager execution for specific functions (and keep the rest compiled). [1]C) Last-resort “don’t crash, just run eager”
If you hit compiler crashes and want to keep training/inference running, set:
This suppresses compiler exceptions and falls back to eager for the failing compile attempt (not considered best practice long-term). [1][4]
D) Force errors on graph breaks (good for debugging / CI)
fullgraph=True(strongest) errors on any graph break. [2]torch._dynamo.error_on_graph_break(...)controls whether graph breaks error, butfullgraph=Truetakes precedence. [3]E) Escape hatch for “I know this is safe to trace non-strictly”
Use
torch._dynamo.nonstrict_tracewhen a function graph-breaks but is still traceable under non-strict rules. [5]How to identify what’s happening
TORCH_LOGS="graph_breaks"to see where and why graph breaks occur;TORCH_LOGS="recompiles"to diagnose recompilation churn. [1]tlparsefor a higher-level compilation report (graph breaks, recompiles, errors). [1]Sources
[1] PyTorch docs — torch.compile Troubleshooting (graph breaks, suppress_errors, disable, recompiles, logging)
[2] PyTorch docs — Use
fullgraph=Trueto Identify and Eliminate Graph Breaks[3] PyTorch docs — Toggling
error_on_graph_break[4] PyTorch GitHub issue — example
torch._dynamo.exc.BackendCompilerFailed+ note aboutsuppress_errorsfallback[5] PyTorch docs — Use
torch._dynamo.nonstrict_traceFix misleading warning message and handle torch.compile edge cases safely.
The warning message is inaccurate—it claims
torch.compileis unavailable when the check actually skips compilation due to non-CUDA devices, not missing PyTorch support. Additionally, whiletorch.compilewith defaultfullgraph=Falsegracefully handles most graph breaks by falling back to eager execution, backend or dynamic compilation failures can still occur at runtime without warning.🔧 Suggested fix (clearer log + safe fallback)
📝 Committable suggestion
🤖 Prompt for AI Agents