-
Notifications
You must be signed in to change notification settings - Fork 28
create samples with padding to avoid truncations #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Awesome! Could we use these tests for unit tests? |
@jlamypoirier i believe we would need to filter long documents for correct sampling (and bring back the index map). In cases with too many long documents, or when they're highly concentrated around some indices For example, with Found these issues when I also included longer documents in my tests. |
That seems to be an error in |
Can we please make better tests that catch these things? @sohamparikh your testing is great because it already uncovered some issues, but it would be even better in a test suite |
Do you mean adding a unit test with the example I shared earlier, or something else?
|
You were saying earlier that you're doing some end-to-end testing with simulated sequences and a separate greedy implementation. I'd like to see that as a test in Fast-LLM. |
|
I don't think that's a seed issue. The bug should have happened whenever there are discarded documents, and that clearly happens in the test, so something else must be missing... |
We can have slow tests. We just need to mark them as such so that they become optional. |
Adding more seeds can just cover more edge cases that aren't present in the few seeds. The latest bug didn't occur in the few initial seeds I'd tried, even though there were long documents |
The bug discussed above is a common case, not an edge case. If the tests didn't fail before it means something is wrong with the test and adding more seeds won't change anything. We can have slow tests (marked as such), but only if really needed as they can slow down development. (We need to make sure all tests pass or each PR) |
tests/data/test_sampling.py
Outdated
def test_gpt_sample_padding(seed): | ||
vocab_size = 131072 | ||
np.random.seed(seed) | ||
num_sequences = np.random.randint(1, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for such big numbers. O(10) seqlens and sequence counts should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we wont need to mark it as slow in that case...
@jlamypoirier do the changes look ok now? I don't see anything wrong with the python verification. To be clear about the bugs caught with it, the long documents issue wasn't caught in the first 1-2 seeds I tried, but saw it in after I added a few more manually. The larger simulations caught the bug where |
) | ||
] | ||
num_tokens = out[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are throwing away long documents, we could end up not generating enough tokens. At the very least we need to add a check for it. Or maybe exclude long documents from tokens_per_epoch
?
fast_llm/data/dataset/gpt/sampled.py
Outdated
] | ||
num_tokens = out[-1] | ||
# TODO: should this instead be a low multiple of seqlen + 1, or a fraction of num_samples? | ||
if num_tokens <= self._sequence_length + 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't make sense. We need the whole self._num_samples * (self._sequence_length + 1)
to be generated, otherwise training will fail. But the check can't be here because what we care about is the sum of shuffled and unshuffled tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It made more sense to compute tokens_per_epoch
like you said. Since it's a lower bound on the actual number of tokens (including padding), num_epochs
is now an upper bound on the actual number of epochs needed and I don't think we need to catch it anymore. This could end up being wasteful in case of a lot of padding, but would avoid failure during training
✨ Description
Provide an option to prevent document truncations while packing into sequences. Preventing truncations has shown to significantly boost downstream model quality, especially in supervised fine-tuning. Turning this on will hurt the training throughput.
Closes #192
🔍 Type of change
Select all that apply:
Testing
Performance Impact