Skip to content

Conversation

Stillerman
Copy link
Contributor

What does this PR do?

For newer flash attention versions (looks like 2.7.0 and onward) the bert_padding.unpad_input function returns an additional value, so the example llama training from README throws ValueError: too many values to unpack.

Fixes #251

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guidelines?
  • Did you write any new necessary tests?
  • Did you log the throughput and loss you get to ensure the PR works as expected in actual training?
  • Did you log the memory usage? you can use this tool to understand the memory usage breakdown in nanotron.
  • If you modified anything related to checkpoints, did you verify that saving and reloading checkpoints still works correctly?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Stillerman Stillerman changed the title Fix unpacking issue caused by newer Flash Attention [WIP] Fix unpacking issue caused by newer Flash Attention Mar 5, 2025
@Stillerman Stillerman changed the title [WIP] Fix unpacking issue caused by newer Flash Attention Fix unpacking issue caused by newer Flash Attention Mar 5, 2025
Copy link
Member

@NouamaneTazi NouamaneTazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to make it backward compatible!

@Stillerman
Copy link
Contributor Author

We can unpack unknown number of values with *_! Tested with

python -m torch.distributed.run --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1

Tested on flash_attn-2.6.3 and flash-attn-2.7.4.post1 and both are working. 2.7.4.post1 does not work on main right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot run the Model generated from the example script
2 participants