Problem
The OpenAI-compatible request schema includes max_tokens and temperature, and docs/comments indicate these are passed through to backend inference behavior.
However, current gateway flow constructs internal InferenceRequest without these fields, so both values are dropped silently.
Why this matters
- Client intent is lost without warning.
- Runtime behavior can diverge from OpenAI-compatible expectations.
- Troubleshooting is harder because requests appear accepted but generation controls are ignored.
- This weakens trust in API compatibility and can cause production regressions for callers relying on token caps / deterministic temperature settings.
Expected behavior
When max_tokens or temperature are provided in OpenAI-compatible requests, they should be propagated into internal inference request types so downstream orchestrator/provider
layers can consume them.
Actual behavior
Gateway paths currently build InferenceRequest from model/prompt/priority only; generation params are not preserved.
Scope
- Extend internal request contract with optional generation params.
- Propagate from API request -> internal inference request in all gateway request paths.
- Add regression tests proving no silent parameter drop.
———
Problem
The OpenAI-compatible request schema includes max_tokens and temperature, and docs/comments indicate these are passed through to backend inference behavior.
However, current gateway flow constructs internal InferenceRequest without these fields, so both values are dropped silently.
Why this matters
Expected behavior
When max_tokens or temperature are provided in OpenAI-compatible requests, they should be propagated into internal inference request types so downstream orchestrator/provider
layers can consume them.
Actual behavior
Gateway paths currently build InferenceRequest from model/prompt/priority only; generation params are not preserved.
Scope
———