-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm testing long context processing so I padded a very short (less than 4k?) token dataset to 32k. I try to train it with context parallel but I got nan loss. Tensor parallel and Pipeline parallel works fine.
My suspicion is that context parallel didn't correctly handle when all the tokens on the same gpu are masked out.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working