Concat prompt and completion cols for tokenizing #257

nitsanluke · 2025-05-08T22:52:31Z

✨ Description

Migrated from #248; this PR allows a dataset with prompt and completion specifically and in general any pair of text columns (eg: question and answer) to be combined and tokenized. (It is limited to two columns of input covering majority of use-cases). Further if the user needs loss mask span for based on prompt (part of the sequence) they can include them as well.
Note: Merge after #255

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Create a new data source schema for prompt & completion style datasets
Include concat func. and loss masking span creation

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Sample config:

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./debug_test/gsm8k
dataset:
  path: openai/gsm8k
  config_name: main
  split: train
  trust_remote_code: true
  source_schema:
    prompt_column: question
    completion_column: answer
    delimiter: " "

tokenizer:
  path: /mnt/slam_checkpoints/upstream/Mistral-Nemo-Base-2407/

…asses

…nfig_for_multi_source

Co-authored-by: Torsten Scholak <[email protected]>

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

Co-authored-by: Torsten Scholak <[email protected]>

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

…t_inputs

fast_llm/data/preparator/gpt_memmap/config.py

fast_llm/data/preparator/gpt_memmap/prepare.py

sohamparikh · 2025-05-09T16:00:21Z

fast_llm/data/preparator/gpt_memmap/prepare.py

+                        (0, len(str(example[source_schema.prompt_column])) - 1)
+                    ]# spans are inclusive
+                },
+                batched=False,


same comment as above about num_proc

@sohamparikh @nitsanluke, I was thinking about the tradeoff we're making here between these two options:

(right now) merge the two columns into one, add spans, then split it up again to be tokenized individually. the benefit is that we can reuse the processing for the default schema, so it's a form of input normalization. the downside is that this seems to be not very straightforward or explicit.

(could be) tokenize the two columns individually, then merge.

I feel like with more input schemas being added, some refactor of what we introduce here is unavoidable. Can we be more forward-looking?

I agree with tokenizing columns individually, can do the same when we support chat templates
#262

Co-authored-by: sohamparikh <[email protected]>

…t_inputs

jlamypoirier and others added 24 commits April 30, 2025 13:05

Generalize config classes

4b606b0

cli

4a67660

Merge branch 'main' into generalize_dynamic_classes

531f67d

misc

1823407

stuff

fe7acd9

combine data source inputs to data_source

94e56e1

Merge remote-tracking branch 'origin/main' into generalize_dynamic_cl…

bee7a4b

…asses

stuff

d41be60

Merge branch 'generalize_dynamic_classes' into restructure_dataset_co…

6a30d76

…nfig_for_multi_source

fixes

ec35a50

Update fast_llm/data/preparator/gpt_memmap/config.py

1dab7de

Co-authored-by: Torsten Scholak <[email protected]>

Update fast_llm/data/preparator/gpt_memmap/config.py

36b42b9

Co-authored-by: Torsten Scholak <[email protected]>

merge

c6876ac

Merge branch 'restructure_dataset_config_for_multi_source' of github.…

eadd49a

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

Update fast_llm/data/preparator/gpt_memmap/config.py

a5b06d8

Co-authored-by: Torsten Scholak <[email protected]>

Merge branch 'restructure_dataset_config_for_multi_source' of github.…

272c63f

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

remove duplicate

cbcde98

adding prompt completion config

49f9929

name change

694181f

merge

3e4b746

adding concat logic

f2ec355

adding ClassVar type

fdf44d3

Merge branch 'restructure_dataset_config_for_multi_source' into conca…

2c790a5

…t_inputs

updates to comments

4c84d20

nitsanluke mentioned this pull request May 8, 2025

combined columns to create a new field in prepare datasets #248

Closed

13 tasks

tscholak reviewed May 9, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/config.py Outdated Show resolved Hide resolved

include loss masking span always available

bc0cb30

sohamparikh reviewed May 9, 2025

View reviewed changes

nitsanluke and others added 2 commits May 9, 2025 14:58

Update fast_llm/data/preparator/gpt_memmap/prepare.py

903488e

Co-authored-by: sohamparikh <[email protected]>

address comments

c8e20ea

sohamparikh mentioned this pull request May 12, 2025

Support chat template in prepare #262

Open

4 tasks

nitsanluke added 3 commits May 14, 2025 15:00

rename to _text_column

0909768

remove default_factory for source_schema

1a6b78b

minor comment

662f318

nitsanluke force-pushed the concat_inputs branch from 62f84a5 to db24a57 Compare May 14, 2025 18:45

merge update

26eef54

nitsanluke force-pushed the concat_inputs branch from db24a57 to 26eef54 Compare May 14, 2025 18:47

nitsanluke added 6 commits June 3, 2025 18:31

Merge branch 'main' into restructure_dataset_config_for_multi_source

8457540

reset to main

0ce7571

Megatorn-LM reset to main

bc09402

remvoe comment

62bdeee

Merge branch 'restructure_dataset_config_for_multi_source' into conca…

10d4ccb

…t_inputs

update error

66edf33

Base automatically changed from restructure_dataset_config_for_multi_source to main June 16, 2025 14:47

nitsanluke added 4 commits June 16, 2025 15:20

Merge branch 'main' into concat_inputs

3b113d8

Merge branch 'main' into concat_inputs

236b908

Merge branch 'main' into concat_inputs

d8cb9f3

update masking spans col

dc6f7f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concat prompt and completion cols for tokenizing #257

Concat prompt and completion cols for tokenizing #257

Uh oh!

nitsanluke commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sohamparikh May 9, 2025

Uh oh!

tscholak May 9, 2025

Uh oh!

sohamparikh May 12, 2025

Uh oh!

Uh oh!

Concat prompt and completion cols for tokenizing #257

Are you sure you want to change the base?

Concat prompt and completion cols for tokenizing #257

Uh oh!

Conversation

nitsanluke commented May 8, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sohamparikh May 9, 2025

Choose a reason for hiding this comment

Uh oh!

tscholak May 9, 2025

Choose a reason for hiding this comment

Uh oh!

sohamparikh May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!