-
Notifications
You must be signed in to change notification settings - Fork 32
Concat prompt and completion cols for tokenizing #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nfig_for_multi_source
Co-authored-by: Torsten Scholak <[email protected]>
Co-authored-by: Torsten Scholak <[email protected]>
…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source
Co-authored-by: Torsten Scholak <[email protected]>
…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source
(0, len(str(example[source_schema.prompt_column])) - 1) | ||
]# spans are inclusive | ||
}, | ||
batched=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above about num_proc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohamparikh @nitsanluke, I was thinking about the tradeoff we're making here between these two options:
- (right now) merge the two columns into one, add spans, then split it up again to be tokenized individually. the benefit is that we can reuse the processing for the default schema, so it's a form of input normalization. the downside is that this seems to be not very straightforward or explicit.
- (could be) tokenize the two columns individually, then merge.
I feel like with more input schemas being added, some refactor of what we introduce here is unavoidable. Can we be more forward-looking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with tokenizing columns individually, can do the same when we support chat templates
#262
Co-authored-by: sohamparikh <[email protected]>
✨ Description
Migrated from #248; this PR allows a dataset with prompt and completion specifically and in general any pair of text columns (eg: question and answer) to be combined and tokenized. (It is limited to two columns of input covering majority of use-cases). Further if the user needs loss mask span for based on prompt (part of the sequence) they can include them as well.
Note: Merge after #255
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Sample config: