Skip to content

Feature Request: top_logprobs pass-through on LLM Gateway (or explicit error instead of silent drop) #11

@dialethia

Description

@dialethia

TL;DR

The Bankr LLM Gateway silently strips top_logprobs from requests routed to
models whose upstreams natively support it (verified on grok-4.20). Clients
get a successful-looking response with no logprobs field and no warning,
blocking a ~2.5× Brier-score calibration improvement for probability-estimation
workloads.

Asking for one of two things, in preference order:

  1. Pass the parameter through to xAI, OpenAI, DeepSeek, Qwen (the
    OpenAI-compatible upstreams that already expose logprobs natively) and
    return the token-level probabilities on the Anthropic-shaped response.
  2. Minimum viable fix: return an explicit 400 invalid_request_error
    when the parameter is unsupported, instead of silently dropping it.
    The Anthropic-family models already do this — please make it uniform.

Full spec, reproducers, upstream support matrix, backwards-compatibility
notes, and arxiv references here:

👉 https://gist.github.com/dialethia/20261815225aa45dbb4bb0c25b397049

Reproducers

Silent drop (bug) — grok-4.20:

curl -sS https://llm.bankr.bot/v1/messages \
  -H "x-api-key: $BANKR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{"model":"grok-4.20","max_tokens":3,"messages":[{"role":"user","content":"YES or NO:"}],"top_logprobs":3}'

Returns HTTP 200 with a normal message response. No logprobs field. No warning header.

Explicit rejection (correct behaviour) — claude-sonnet-4.6:

curl -sS https://llm.bankr.bot/v1/messages \
  -H "x-api-key: $BANKR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{"model":"claude-sonnet-4.6","max_tokens":5,"messages":[{"role":"user","content":"YES or NO:"}],"top_logprobs":5}'

Returns 400 invalid_request_error: "top_logprobs: Extra inputs are not permitted". Client can detect and fall back.

Why it matters (briefly)

For probability-estimation workloads (prediction markets, classification,
hallucination detection, RLHF), extracting log P("YES") / log P("NO") from
token logprobs achieves Brier 0.186 (arxiv:2501.04880)
vs text-parsing at ~0.49. That's a ~2.5× calibration improvement at zero cost
— the upstreams return logprobs for free; the gateway is the bottleneck.

The full spec covers use cases, the upstream support matrix (xAI ✅,
OpenAI ✅, DeepSeek ✅, Anthropic ❌, Gemini ⚠️), and rollout/backwards-compat
options.

Happy to iterate on the design if there's appetite.


cc: @0xdeployer @igoryuzo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions