Skip to content

Bug: Token rate limiter uses inaccurate heuristic tokenizer for input token accounting #1161

@nXtCyberNet

Description

@nXtCyberNet

/kind enhancement

What happened

The token rate limiter currently uses:

tokenizer: tokenizer.NewSimpleEstimateTokenizer()

which estimates tokens using a /4 character heuristic instead of model-accurate tokenization.

This causes inaccurate input token rate limiting:

  • code-heavy prompts are significantly over-counted
  • some model families may be under-counted
  • users can hit limits earlier or later than expected

The issue only affects input token accounting. Output token accounting is already accurate because it uses actual completion_tokens returned by the model response.

A TikToken-based tokenizer already exists in the codebase, but using it directly as default also introduces problems:

  • TikToken is GPT/OpenAI specific
  • token estimation becomes inaccurate for non-GPT models (Claude, LLaMA, Mistral, etc.)
  • remote tokenizer calls can increase latency under concurrency

Related discussion:
[#1100 comment](#1100)

That discussion also highlights another issue:

  • tokenizer requests currently randomly select vLLM pods
  • tokenization latency becomes dependent on inference pod load
  • under high concurrency this can create unpredictable admission latency

So the current local estimator solves latency and model-compatibility concerns, but its estimation error can still become large enough to affect practical quota enforcement.


Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions