/kind enhancement
What happened
The token rate limiter currently uses:
tokenizer: tokenizer.NewSimpleEstimateTokenizer()
which estimates tokens using a /4 character heuristic instead of model-accurate tokenization.
This causes inaccurate input token rate limiting:
- code-heavy prompts are significantly over-counted
- some model families may be under-counted
- users can hit limits earlier or later than expected
The issue only affects input token accounting. Output token accounting is already accurate because it uses actual completion_tokens returned by the model response.
A TikToken-based tokenizer already exists in the codebase, but using it directly as default also introduces problems:
- TikToken is GPT/OpenAI specific
- token estimation becomes inaccurate for non-GPT models (Claude, LLaMA, Mistral, etc.)
- remote tokenizer calls can increase latency under concurrency
Related discussion:
[#1100 comment](#1100)
That discussion also highlights another issue:
- tokenizer requests currently randomly select vLLM pods
- tokenization latency becomes dependent on inference pod load
- under high concurrency this can create unpredictable admission latency
So the current local estimator solves latency and model-compatibility concerns, but its estimation error can still become large enough to affect practical quota enforcement.
/kind enhancement
What happened
The token rate limiter currently uses:
which estimates tokens using a
/4character heuristic instead of model-accurate tokenization.This causes inaccurate input token rate limiting:
The issue only affects input token accounting. Output token accounting is already accurate because it uses actual
completion_tokensreturned by the model response.A TikToken-based tokenizer already exists in the codebase, but using it directly as default also introduces problems:
Related discussion:
[#1100 comment](#1100)
That discussion also highlights another issue:
So the current local estimator solves latency and model-compatibility concerns, but its estimation error can still become large enough to affect practical quota enforcement.