Skip to content

fix: disable thinking mode for structured JSON output requests#34

Open
back2zion wants to merge 1 commit intonikmcfly:mainfrom
back2zion:fix/disable-thinking-for-json-output
Open

fix: disable thinking mode for structured JSON output requests#34
back2zion wants to merge 1 commit intonikmcfly:mainfrom
back2zion:fix/disable-thinking-for-json-output

Conversation

@back2zion
Copy link
Copy Markdown

Summary

Models with reasoning/thinking capabilities (e.g. Qwen3 with --reasoning-parser) emit <think> tags before generating content. When combined with response_format=json_object and vLLM's guided decoding, this causes an infinite abort-retry loop — the thinking tokens violate the JSON schema, vLLM aborts, the client retries, and the process hangs indefinitely.

This blocks persona generation at ~50-53 out of 75 agents every time.

Fix

Automatically sets enable_thinking=false via chat_template_kwargs whenever structured JSON output is requested, in both:

  • llm_client.py (general LLM client used for ontology generation etc.)
  • oasis_profile_generator.py (agent persona generation)

Test plan

  • Verified persona generation completes all 75/75 agents without hanging
  • Confirmed JSON responses are valid without <think> tag contamination
  • Non-JSON requests (chat, report) still use thinking mode normally

Models with reasoning/thinking capabilities (e.g. Qwen3 with
--reasoning-parser) emit <think> tags before generating content.
When combined with response_format=json_object and vLLM's guided
decoding, this causes an infinite abort-retry loop because the
thinking tokens violate the JSON schema constraint.

Automatically sets enable_thinking=false via chat_template_kwargs
whenever structured JSON output is requested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant