Skip to content

fix(gateway): propagate generation params to inference request#1342

Open
Bhanudahiyaa wants to merge 1 commit intomofa-org:mainfrom
Bhanudahiyaa:fix/openai-generation-param-propagation
Open

fix(gateway): propagate generation params to inference request#1342
Bhanudahiyaa wants to merge 1 commit intomofa-org:mainfrom
Bhanudahiyaa:fix/openai-generation-param-propagation

Conversation

@Bhanudahiyaa
Copy link
Contributor

@Bhanudahiyaa Bhanudahiyaa commented Mar 17, 2026

Summary

This PR ensures OpenAI-compatible generation controls (max_tokens, temperature) are carried through gateway translation into internal InferenceRequest instead of being silently
discarded.

Motivation

The API contract already exposes these parameters, but runtime translation dropped them. That mismatch causes surprising behavior and makes the gateway less trustworthy for
OpenAI-compatible clients.

Fixes: #1341

What changed

  • Added optional generation fields to internal inference contract:
    • InferenceRequest.max_tokens: Option
    • InferenceRequest.temperature: Option
  • Added builder helpers:
    • with_max_tokens(...)
    • with_temperature(...)
  • Added conversion helper on OpenAI request type:
    • ChatCompletionRequest::to_inference_request(required_memory_mb)
    • Copies model/prompt/priority and forwards max_tokens/temperature.
  • Updated gateway request paths to use this conversion:
    • HTTP OpenAI handler
    • WebSocket streaming handler
  • Updated inference bridge path to forward these same generation params.
  • Added tests to prevent regressions (including backward-compatible deserialize behavior for older request payloads without new fields).

Files changed

  • crates/mofa-foundation/src/inference/types.rs
  • crates/mofa-gateway/src/openai_compat/types.rs
  • crates/mofa-gateway/src/openai_compat/handler.rs
  • crates/mofa-gateway/src/streaming/ws.rs
  • crates/mofa-gateway/src/inference_bridge.rs

Design notes / tradeoffs

  • This PR focuses on propagation correctness, not full provider-side enforcement.
  • Keeping fields optional preserves backward compatibility.
  • Centralized conversion (to_inference_request) reduces future drift across multiple gateway entry points.

Tests added/updated

  • mofa-foundation inference type tests:
    • request builder includes new fields
    • serde roundtrip includes new fields
    • deserializing payloads without new fields defaults to None
  • mofa-gateway OpenAI type test:
    • to_inference_request propagates max_tokens and temperature
  • Existing openai handler tests still pass with updated request translation path.

Validation run

  • cargo test -p mofa-foundation --lib inference::types::tests
  • cargo test -p mofa-gateway --features openai-compat --lib openai_compat::types::tests
  • cargo test -p mofa-gateway --features openai-compat --lib openai_compat::handler::tests

Checklist

  • Focused fix for one problem
  • Propagation added across relevant gateway paths
  • Regression tests added
  • No unrelated functional changes
  • Full workspace fmt/clippy/test gates (can be run in CI / maintainers’ environment as needed)

———

@Bhanudahiyaa
Copy link
Contributor Author

@lijingrs @BH3GEI @yangrudan

This change closes a contract gap between the OpenAI compatible API layer and internal inference orchestration.

Previously, request-level generation controls (max_tokens, temperature) were accepted by the schema but dropped during translation, which produced silent behavior drift. This PR introduces explicit propagation via a centralized conversion helper and extends InferenceRequest with optional fields to maintain backward compatibility.

Question for maintainers

Should we follow up with explicit validation ranges at the gateway boundary (e.g., temperature bounds), or leave normalization/validation to downstream provider adapters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAI gateway silently drops max_tokens and temperature before inference routing/execution

1 participant