Tensor Parallel implementation of LoRA #2454

TAplutos · 2025-03-03T21:41:30Z

TAplutos
Mar 3, 2025

Note: This is all based on the assumption that the torchtune tensor_parallelism implementation applies sequence parallelism. If this is not the case please let me know.

The standard torchtune implementation for tensor parallelism works well. I've been able to smoothly apply it to the 70B llamas. However, I'm looking to implement a version which works for Qwen LoRA. Any clues as to how I could correctly divide up the lora adapters and the weight matrices they are applied to?

Referencing the equation W'=W+BA I was thinking something along the lines of "If W is row divided, then row divide B. If W is column divided, then column divide A." Or do I just want to apply the tensor parallelism to W and leave A and B alone?

I'm trying to work with longer sequences here without offloading activations, so the benefits of sequence parallelism are what I am most interested in.

ebsmothers · 2025-03-12T20:04:04Z

ebsmothers
Mar 12, 2025
Collaborator

Hi @TAplutos thanks for creating the discussion, sorry I missed this until now. Our tensor parallelism implementation is using columnwise and rowwise parallel, not sequence parallel. You can see how each individual module is parallelized here. I think SequenceParallel is primarily used for modules like RMSNorm, not for linear/LoRA layers. However there is also context parallelism, which will shard along sequence dimension and parallelize the SDPA calculation to enable long-context training. I suspect this is what you want, can you confirm if my understanding is correct here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Parallel implementation of LoRA #2454

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Tensor Parallel implementation of LoRA #2454

TAplutos Mar 3, 2025

Replies: 1 comment

ebsmothers Mar 12, 2025 Collaborator

TAplutos
Mar 3, 2025

ebsmothers
Mar 12, 2025
Collaborator