Replies: 1 comment
-
Hi @TAplutos thanks for creating the discussion, sorry I missed this until now. Our tensor parallelism implementation is using columnwise and rowwise parallel, not sequence parallel. You can see how each individual module is parallelized here. I think SequenceParallel is primarily used for modules like RMSNorm, not for linear/LoRA layers. However there is also context parallelism, which will shard along sequence dimension and parallelize the SDPA calculation to enable long-context training. I suspect this is what you want, can you confirm if my understanding is correct here? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Note: This is all based on the assumption that the torchtune tensor_parallelism implementation applies sequence parallelism. If this is not the case please let me know.
The standard torchtune implementation for tensor parallelism works well. I've been able to smoothly apply it to the 70B llamas. However, I'm looking to implement a version which works for Qwen LoRA. Any clues as to how I could correctly divide up the lora adapters and the weight matrices they are applied to?
Referencing the equation
W'=W+BA
I was thinking something along the lines of "If W is row divided, then row divide B. If W is column divided, then column divide A." Or do I just want to apply the tensor parallelism to W and leave A and B alone?I'm trying to work with longer sequences here without offloading activations, so the benefits of sequence parallelism are what I am most interested in.
Beta Was this translation helpful? Give feedback.
All reactions