create samples with padding to avoid truncations #186

sohamparikh · 2025-03-13T00:05:52Z

✨ Description

Provide an option to prevent document truncations while packing into sequences. Preventing truncations has shown to significantly boost downstream model quality, especially in supervised fine-tuning. Turning this on will hurt the training throughput.

Closes #192

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

tscholak · 2025-03-26T10:48:49Z

Awesome! Could we use these tests for unit tests?

sohamparikh · 2025-03-26T19:10:29Z

@jlamypoirier i believe we would need to filter long documents for correct sampling (and bring back the index map). In cases with too many long documents, or when they're highly concentrated around some indices token_start_cumsum_index * TOKEN_CUMSUM_RATE can be low enough to re-sample some documents.

For example, with seqlen=4, doc_sizes=[8, 2, 9, 1, 6, 7, 4, 1, 2, 9, 10, 11, 4, 9, 9, ...] and token_cumsum_rate=3, we should have samples {0: [2, 1], 1: [4,1], 2: [2], ...}
Since token_cumsum=[0, 9, 20,...] here, at index=2 we have token_count=9, token_start=10 and document_sampling_index=3 which will pick [4,1] instead of [2]

Found these issues when I also included longer documents in my tests.

jlamypoirier · 2025-03-26T22:52:10Z

@jlamypoirier i believe we would need to filter long documents for correct sampling (and bring back the index map). In cases with too many long documents, or when they're highly concentrated around some indices token_start_cumsum_index * TOKEN_CUMSUM_RATE can be low enough to re-sample some documents.

That seems to be an error in build_padded_token_cumsum. It's excluding the long documents from the sample count but should include them (with length 0) so the index matches.

tscholak · 2025-03-27T02:30:02Z

Can we please make better tests that catch these things? @sohamparikh your testing is great because it already uncovered some issues, but it would be even better in a test suite

sohamparikh · 2025-03-27T02:36:30Z

it would be even better in a test suite

Do you mean adding a unit test with the example I shared earlier, or something else?

test_sampling.py::test_gpt_sample_padding already covers these kind of cases, I just presented a simpler example

tscholak · 2025-03-27T02:39:13Z

You were saying earlier that you're doing some end-to-end testing with simulated sequences and a separate greedy implementation. I'd like to see that as a test in Fast-LLM.

sohamparikh · 2025-03-27T02:43:09Z

test_gpt_sample_padding is exactly that.
We can simulate many more seeds (~1000) as well, but it might add take a minute or two to run. If that's okay, I can do that

jlamypoirier · 2025-03-27T19:31:37Z

I don't think that's a seed issue. The bug should have happened whenever there are discarded documents, and that clearly happens in the test, so something else must be missing...

tscholak · 2025-03-27T19:34:25Z

test_gpt_sample_padding is exactly that. We can simulate many more seeds (~1000) as well, but it might add take a minute or two to run. If that's okay, I can do that

We can have slow tests. We just need to mark them as such so that they become optional.

sohamparikh · 2025-03-27T20:38:56Z

I don't think that's a seed issue. The bug should have happened whenever there are discarded documents, and that clearly happens in the test, so something else must be missing...

Adding more seeds can just cover more edge cases that aren't present in the few seeds. The latest bug didn't occur in the few initial seeds I'd tried, even though there were long documents

jlamypoirier · 2025-03-27T20:57:55Z

The bug discussed above is a common case, not an edge case. If the tests didn't fail before it means something is wrong with the test and adding more seeds won't change anything.

We can have slow tests (marked as such), but only if really needed as they can slow down development. (We need to make sure all tests pass or each PR)

jlamypoirier · 2025-03-27T20:59:47Z

tests/data/test_sampling.py

+def test_gpt_sample_padding(seed):
+    vocab_size = 131072
+    np.random.seed(seed)
+    num_sequences = np.random.randint(1, 1000)


No need for such big numbers. O(10) seqlens and sequence counts should be enough.

we wont need to mark it as slow in that case...

sohamparikh · 2025-03-28T16:29:28Z

@jlamypoirier do the changes look ok now?

I don't see anything wrong with the python verification. To be clear about the bugs caught with it, the long documents issue wasn't caught in the first 1-2 seeds I tried, but saw it in after I added a few more manually. The larger simulations caught the bug where num_tokens_unshuffled for padding was computed the same way as before

jlamypoirier · 2025-03-28T22:24:37Z

fast_llm/data/dataset/gpt/sampled.py

            )
-        ]
+            num_tokens = out[-1]


Since we are throwing away long documents, we could end up not generating enough tokens. At the very least we need to add a check for it. Or maybe exclude long documents from tokens_per_epoch?

fast_llm/data/dataset/gpt/sampled.py

tests/data/test_sampling.py

jlamypoirier · 2025-03-31T22:47:26Z

fast_llm/data/dataset/gpt/sampled.py

-        ]
+            num_tokens = out[-1]
+            # TODO: should this instead be a low multiple of seqlen + 1, or a fraction of num_samples?
+            if num_tokens <= self._sequence_length + 1:


Doesn't make sense. We need the whole self._num_samples * (self._sequence_length + 1) to be generated, otherwise training will fail. But the check can't be here because what we care about is the sum of shuffled and unshuffled tokens.

ah yes, makes sense

It made more sense to compute tokens_per_epoch like you said. Since it's a lower bound on the actual number of tokens (including padding), num_epochs is now an upper bound on the actual number of epochs needed and I don't think we need to catch it anymore. This could end up being wasteful in case of a lot of padding, but would avoid failure during training

sohamparikh added 30 commits February 12, 2025 08:00

introduce flash_attn_varlen_func and docwise position ids

4f84301

merge with main

f20cf6b

make the basics work

dcb2bb1

use bos for separating docs

5331d8d

option to disable packing

4e256aa

fix

74c2d94

merge main

9f43df9

Merge branch 'main' into soham/cross-document-attn

cf5fc8a

revert doc truncation

90061b9

fix for sequence data parallel

ba0e649

make it work for abs positions and backup attn

7730a4c

pre-compute cumulative sequences, make position embeddings compatible

f3c540b

fix

cd3244c

move to GPU once

f86f469

Merge branch 'main' into soham/cross-document-attn

b98ba1b

single config flag, sequence_data_parallel works!

1b17f59

fix backupattn, absolute positions

b3ee7c4

move to preprocessor

fedbb07

imports, rename seqlens

8d768b7

remove unused config

79aca1a

add tests

801ac3e

comments

489a5aa

pad legacy sampler

56c1fd2

make it work

30da8bf

test

e6f405c

TODO for now

152d272

pad query lengths

9624a9e

rename config

481f870

fix

a2a98ec

maybe better naming

8778878

cleanup

e3780fd

sohamparikh added 2 commits March 26, 2025 19:07

sampling tests

1f21feb

length filter

77c46f3

clean

88c9aff

fix cpp cumsum

9e18838

jlamypoirier reviewed Mar 27, 2025

View reviewed changes

sohamparikh added 2 commits March 27, 2025 21:17

many small tests

c4586bc

minor

c202053

jlamypoirier reviewed Mar 28, 2025

View reviewed changes

sohamparikh added 2 commits March 29, 2025 00:36

review

17a14e3

fix low samples, test

c3aad50

jlamypoirier reviewed Mar 31, 2025

View reviewed changes

fix num_epochs

2cd5cce

jlamypoirier approved these changes Apr 1, 2025

View reviewed changes

Merge branch 'main' into soham/padded-sampler

ab5d751

sohamparikh merged commit 58b6f8a into main Apr 1, 2025
4 checks passed

sohamparikh deleted the soham/padded-sampler branch April 1, 2025 20:22

jlamypoirier mentioned this pull request Apr 3, 2025

Fix sampling, add timeouts for test suprocess and data loaders #221

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create samples with padding to avoid truncations #186

create samples with padding to avoid truncations #186

sohamparikh commented Mar 13, 2025 •

edited

Loading

tscholak commented Mar 26, 2025

sohamparikh commented Mar 26, 2025 •

edited

Loading

jlamypoirier commented Mar 26, 2025

tscholak commented Mar 27, 2025

sohamparikh commented Mar 27, 2025 •

edited

Loading

tscholak commented Mar 27, 2025

sohamparikh commented Mar 27, 2025

jlamypoirier commented Mar 27, 2025

tscholak commented Mar 27, 2025

sohamparikh commented Mar 27, 2025

jlamypoirier commented Mar 27, 2025

jlamypoirier Mar 27, 2025

sohamparikh Mar 27, 2025

sohamparikh commented Mar 28, 2025

jlamypoirier Mar 28, 2025

jlamypoirier Mar 31, 2025

sohamparikh Apr 1, 2025

sohamparikh Apr 1, 2025

create samples with padding to avoid truncations #186

create samples with padding to avoid truncations #186

Conversation

sohamparikh commented Mar 13, 2025 • edited Loading

✨ Description

🔍 Type of change

Testing

Performance Impact

tscholak commented Mar 26, 2025

sohamparikh commented Mar 26, 2025 • edited Loading

jlamypoirier commented Mar 26, 2025

tscholak commented Mar 27, 2025

sohamparikh commented Mar 27, 2025 • edited Loading

tscholak commented Mar 27, 2025

sohamparikh commented Mar 27, 2025

jlamypoirier commented Mar 27, 2025

tscholak commented Mar 27, 2025

sohamparikh commented Mar 27, 2025

jlamypoirier commented Mar 27, 2025

jlamypoirier Mar 27, 2025

Choose a reason for hiding this comment

sohamparikh Mar 27, 2025

Choose a reason for hiding this comment

sohamparikh commented Mar 28, 2025

jlamypoirier Mar 28, 2025

Choose a reason for hiding this comment

jlamypoirier Mar 31, 2025

Choose a reason for hiding this comment

sohamparikh Apr 1, 2025

Choose a reason for hiding this comment

sohamparikh Apr 1, 2025

Choose a reason for hiding this comment

sohamparikh commented Mar 13, 2025 •

edited

Loading

sohamparikh commented Mar 26, 2025 •

edited

Loading

sohamparikh commented Mar 27, 2025 •

edited

Loading