-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] How to deal with multiple loss backward and optimize? #2729
Comments
I think it probably depends a bit on model architecture; I think it might help to describe what 'fused' means. The assumption is your model is using TorchRec components for embedding tables, and by default we 'fuse' the optimizer step with backward pass for the embedding modules (it saves the requirement to materialize the full gradient tensor in memory). However, this only applies to embedding modules, and not your other MLP/dense layers. Presumably for those layers you would have typical gradient accumulation (although i think you need to back propagate with So, Really the different is you would be calling optim.step() on the gradient propagated in each loss case; as opposed to aggregate of two losses. FWIW, you can disable optimizer fusion by constraining your Sharder to only support
|
Thanks for the suggestion. |
No, |
TorchRec will fuse the backward&optimize procedure for performance, but in some cases when some parts are not differentiable, I have to manually calculate part of the gradients and manually apply the gradients.
For example, in the code snippet below, there are two losses, one can be directly calculated using
nn.MSELoss
, another has non-differentiable part and must be calculated manually.My question is: now I perform 2 optimize, will it affect the convergence?
What is the best practice for such situation?
The text was updated successfully, but these errors were encountered: