Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add torchcompile_xentropy executor #1655

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

riccardofelluga
Copy link
Collaborator

What does this PR do?

In an attempt to fix #1654 and partially #1552, this PR adds the necessary ops in the torchcompile_cat list to capture HF CausalLMLoss.

Before merging this PR needs testing with other models, I will write a message with the results of the benchmarks.

@IvanYashchuk
Copy link
Collaborator

This executor is meant for fusions with a cat operation. Cross entropy is usually quite far from RoPE (which uses cat). How does adding cross_entopy to torchcompile_cat help?

@riccardofelluga
Copy link
Collaborator Author

@IvanYashchuk Adding cross entropy does indeed not help with rope but it allows thunder to use a very efficient fused cross entropy triton kernel which perf currently cannot be matched with the other executors(not even apex). From lines 211 and 212, it does seem that while the set of ops was originally though for rope, it is suggested that other ops could have been added in the future making this a fusion executor that comes in rescue of nvFuser when it can improve performance. Do you think that it would be better to create another executor entry for inductor cross entropy only?

@riccardofelluga riccardofelluga changed the title [WIP] Add cross_entropy to torchcompile_cat executor [WIP] Add torchcompile_xentropy executor Jan 28, 2025
@riccardofelluga riccardofelluga changed the title [WIP] Add torchcompile_xentropy executor Add torchcompile_xentropy executor Jan 28, 2025
pre-commit-ci bot and others added 11 commits January 28, 2025 15:29
Reducing the models size to speedup CI and make it work in environments with constrained memory size.

I think this change should be fine since with this test we are verifying functionality more than memory footprint. Let me know what you think. 

With this change the peak memory of the test is ~2.6GB instead of ~14GB
@riccardofelluga riccardofelluga marked this pull request as ready for review January 30, 2025 07:37
@t-vi
Copy link
Collaborator

t-vi commented Jan 30, 2025

I have a hard time seeing another torch.compile executor as the solution here.
How are they supposed to interact?
Will we see consecutive torch compile regions as a result?

@riccardofelluga
Copy link
Collaborator Author

@t-vi
How are they supposed to interact?
Will we see consecutive torch compile regions as a result?

I'd piggyback on this explaination on the differences between torchcompile and torchcompile_cat executors:

# NOTE: [torch_compile_cat_ex vs torch_compile_ex]
# The former only relies on `torch.compile` for the operators where it shines the most and is meant to be used
# together with the nvfuser executor. Its current goal is only to fuse RoPE but the set of ops fused will change as each
# of the fusion backends evolve.
# The latter will try to `torch.compile` all the torch operators and is meant to be used without the nvfuser_executor
# since they would be competing over fusion opportunities. The advantage over simply doing `torch.compile` is that you
# still get all of Thunder's advantages, like enabling custom executors (e.g. with custom triton kernels) before it.

torchcompile_xentropy is treated the same way, it is not supposed to be used with torchcompile. On the other hand i can see the worry since it can be used together with torchcompile_cat, however, the combination of required+supported ops are different from torchcompile_cat so one won't claim the ops for the other.

An issue I see is that we don't strictly enforce this rule, so in case where the user specifies torchcompile and torchcompile_cat or torchcompile_xentropy, the former will claim all the ops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvFuser using more memory than inductor for HF CausalLMLoss
6 participants