Skip to content

Concat prompt and completion cols for tokenizing #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

nitsanluke
Copy link
Contributor

✨ Description

Migrated from #248; this PR allows a dataset with prompt and completion specifically and in general any pair of text columns (eg: question and answer) to be combined and tokenized. (It is limited to two columns of input covering majority of use-cases). Further if the user needs loss mask span for based on prompt (part of the sequence) they can include them as well.
Note: Merge after #255

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Create a new data source schema for prompt & completion style datasets
  2. Include concat func. and loss masking span creation

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Sample config:

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./debug_test/gsm8k
dataset:
  path: openai/gsm8k
  config_name: main
  split: train
  trust_remote_code: true
  source_schema:
    prompt_column: question
    completion_column: answer
    delimiter: " "

tokenizer:
  path: /mnt/slam_checkpoints/upstream/Mistral-Nemo-Base-2407/

(0, len(str(example[source_schema.prompt_column])) - 1)
]# spans are inclusive
},
batched=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above about num_proc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohamparikh @nitsanluke, I was thinking about the tradeoff we're making here between these two options:

  1. (right now) merge the two columns into one, add spans, then split it up again to be tokenized individually. the benefit is that we can reuse the processing for the default schema, so it's a form of input normalization. the downside is that this seems to be not very straightforward or explicit.
  2. (could be) tokenize the two columns individually, then merge.

I feel like with more input schemas being added, some refactor of what we introduce here is unavoidable. Can we be more forward-looking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with tokenizing columns individually, can do the same when we support chat templates
#262

Base automatically changed from restructure_dataset_config_for_multi_source to main June 16, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants