Skip to content

Improve performance of synthetic dataset prompt creation to match a given number of tokens #187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
markurtz opened this issue Jun 13, 2025 · 0 comments

Comments

@markurtz
Copy link
Member

Currently, the synthetic dataset prompt creation process is using a binary search to converge on the desired prompt length based on the input prompt. This means that a large number of requests are being sent to the relatively expensive tokenize function for the processor. Recently, #162 was added, which enabled a significantly cheaper way to generate a prompt of a given length. This involves grabbing a prompt of a given length, tokenizing it, truncating the tokens array to the desired length, and reencoding it back to text. Provided we have a reasonable start multiplier (likely 3 or 4 tokens per word, which should be double-checked against the average for current tokenizers, or we dynamically calculate the tokens per word ratio as prompts are generated), then most will result in only a single tokenization call. If we add a safety that multiplies by a reasonable constant for the number of words to target if it doesn't meet the token length constraints after tokenizing, then we can guarantee proper convergence.

The test for this will be to ensure that the ratio of tokenize calls to prompts is approximately one and the average truncated number of tokens relative to the desired number of tokens is reasonably small (20%) while the number of tokens after running a benchmark through vLLM matches the number of prompt tokens desiredCurrently, our method for creating synthetic dataset prompts uses a binary search to establish the desired prompt length, leading to numerous requests to the expensive tokenization function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant