Skip to content

create samples with padding to avoid truncations #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Apr 1, 2025

Conversation

sohamparikh
Copy link
Member

@sohamparikh sohamparikh commented Mar 13, 2025

✨ Description

Provide an option to prevent document truncations while packing into sequences. Preventing truncations has shown to significantly boost downstream model quality, especially in supervised fine-tuning. Turning this on will hurt the training throughput.

Closes #192

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

@tscholak
Copy link
Collaborator

Awesome! Could we use these tests for unit tests?

@sohamparikh
Copy link
Member Author

sohamparikh commented Mar 26, 2025

@jlamypoirier i believe we would need to filter long documents for correct sampling (and bring back the index map). In cases with too many long documents, or when they're highly concentrated around some indices token_start_cumsum_index * TOKEN_CUMSUM_RATE can be low enough to re-sample some documents.

For example, with seqlen=4, doc_sizes=[8, 2, 9, 1, 6, 7, 4, 1, 2, 9, 10, 11, 4, 9, 9, ...] and token_cumsum_rate=3, we should have samples {0: [2, 1], 1: [4,1], 2: [2], ...}
Since token_cumsum=[0, 9, 20,...] here, at index=2 we have token_count=9, token_start=10 and document_sampling_index=3 which will pick [4,1] instead of [2]

Found these issues when I also included longer documents in my tests.

@jlamypoirier
Copy link
Collaborator

@jlamypoirier i believe we would need to filter long documents for correct sampling (and bring back the index map). In cases with too many long documents, or when they're highly concentrated around some indices token_start_cumsum_index * TOKEN_CUMSUM_RATE can be low enough to re-sample some documents.

That seems to be an error in build_padded_token_cumsum. It's excluding the long documents from the sample count but should include them (with length 0) so the index matches.

@tscholak
Copy link
Collaborator

Can we please make better tests that catch these things? @sohamparikh your testing is great because it already uncovered some issues, but it would be even better in a test suite

@sohamparikh
Copy link
Member Author

sohamparikh commented Mar 27, 2025

it would be even better in a test suite

Do you mean adding a unit test with the example I shared earlier, or something else?

test_sampling.py::test_gpt_sample_padding already covers these kind of cases, I just presented a simpler example

@tscholak
Copy link
Collaborator

You were saying earlier that you're doing some end-to-end testing with simulated sequences and a separate greedy implementation. I'd like to see that as a test in Fast-LLM.

@sohamparikh
Copy link
Member Author

test_gpt_sample_padding is exactly that.
We can simulate many more seeds (~1000) as well, but it might add take a minute or two to run. If that's okay, I can do that

@jlamypoirier
Copy link
Collaborator

I don't think that's a seed issue. The bug should have happened whenever there are discarded documents, and that clearly happens in the test, so something else must be missing...

@tscholak
Copy link
Collaborator

test_gpt_sample_padding is exactly that. We can simulate many more seeds (~1000) as well, but it might add take a minute or two to run. If that's okay, I can do that

We can have slow tests. We just need to mark them as such so that they become optional.

@sohamparikh
Copy link
Member Author

I don't think that's a seed issue. The bug should have happened whenever there are discarded documents, and that clearly happens in the test, so something else must be missing...

Adding more seeds can just cover more edge cases that aren't present in the few seeds. The latest bug didn't occur in the few initial seeds I'd tried, even though there were long documents

@jlamypoirier
Copy link
Collaborator

The bug discussed above is a common case, not an edge case. If the tests didn't fail before it means something is wrong with the test and adding more seeds won't change anything.

We can have slow tests (marked as such), but only if really needed as they can slow down development. (We need to make sure all tests pass or each PR)

def test_gpt_sample_padding(seed):
vocab_size = 131072
np.random.seed(seed)
num_sequences = np.random.randint(1, 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for such big numbers. O(10) seqlens and sequence counts should be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we wont need to mark it as slow in that case...

@sohamparikh
Copy link
Member Author

@jlamypoirier do the changes look ok now?

I don't see anything wrong with the python verification. To be clear about the bugs caught with it, the long documents issue wasn't caught in the first 1-2 seeds I tried, but saw it in after I added a few more manually. The larger simulations caught the bug where num_tokens_unshuffled for padding was computed the same way as before

)
]
num_tokens = out[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are throwing away long documents, we could end up not generating enough tokens. At the very least we need to add a check for it. Or maybe exclude long documents from tokens_per_epoch?

]
num_tokens = out[-1]
# TODO: should this instead be a low multiple of seqlen + 1, or a fraction of num_samples?
if num_tokens <= self._sequence_length + 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make sense. We need the whole self._num_samples * (self._sequence_length + 1) to be generated, otherwise training will fail. But the check can't be here because what we care about is the sum of shuffled and unshuffled tokens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made more sense to compute tokens_per_epoch like you said. Since it's a lower bound on the actual number of tokens (including padding), num_epochs is now an upper bound on the actual number of epochs needed and I don't think we need to catch it anymore. This could end up being wasteful in case of a lot of padding, but would avoid failure during training

@sohamparikh sohamparikh merged commit 58b6f8a into main Apr 1, 2025
4 checks passed
@sohamparikh sohamparikh deleted the soham/padded-sampler branch April 1, 2025 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to avoid truncations while packing
4 participants