You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the synthetic dataset prompt creation process is using a binary search to converge on the desired prompt length based on the input prompt. This means that a large number of requests are being sent to the relatively expensive tokenize function for the processor. Recently, #162 was added, which enabled a significantly cheaper way to generate a prompt of a given length. This involves grabbing a prompt of a given length, tokenizing it, truncating the tokens array to the desired length, and reencoding it back to text. Provided we have a reasonable start multiplier (likely 3 or 4 tokens per word, which should be double-checked against the average for current tokenizers, or we dynamically calculate the tokens per word ratio as prompts are generated), then most will result in only a single tokenization call. If we add a safety that multiplies by a reasonable constant for the number of words to target if it doesn't meet the token length constraints after tokenizing, then we can guarantee proper convergence.
The test for this will be to ensure that the ratio of tokenize calls to prompts is approximately one and the average truncated number of tokens relative to the desired number of tokens is reasonably small (20%) while the number of tokens after running a benchmark through vLLM matches the number of prompt tokens desiredCurrently, our method for creating synthetic dataset prompts uses a binary search to establish the desired prompt length, leading to numerous requests to the expensive tokenization function.
The text was updated successfully, but these errors were encountered:
Currently, the synthetic dataset prompt creation process is using a binary search to converge on the desired prompt length based on the input prompt. This means that a large number of requests are being sent to the relatively expensive tokenize function for the processor. Recently, #162 was added, which enabled a significantly cheaper way to generate a prompt of a given length. This involves grabbing a prompt of a given length, tokenizing it, truncating the tokens array to the desired length, and reencoding it back to text. Provided we have a reasonable start multiplier (likely 3 or 4 tokens per word, which should be double-checked against the average for current tokenizers, or we dynamically calculate the tokens per word ratio as prompts are generated), then most will result in only a single tokenization call. If we add a safety that multiplies by a reasonable constant for the number of words to target if it doesn't meet the token length constraints after tokenizing, then we can guarantee proper convergence.
The test for this will be to ensure that the ratio of tokenize calls to prompts is approximately one and the average truncated number of tokens relative to the desired number of tokens is reasonably small (20%) while the number of tokens after running a benchmark through vLLM matches the number of prompt tokens desiredCurrently, our method for creating synthetic dataset prompts uses a binary search to establish the desired prompt length, leading to numerous requests to the expensive tokenization function.
The text was updated successfully, but these errors were encountered: