Skip to content

fix: fallback to Ollama native API for thinking-mode models#27

Open
wbryanta wants to merge 1 commit intonikmcfly:mainfrom
wbryanta:fix/ollama-thinking-mode-fallback
Open

fix: fallback to Ollama native API for thinking-mode models#27
wbryanta wants to merge 1 commit intonikmcfly:mainfrom
wbryanta:fix/ollama-thinking-mode-fallback

Conversation

@wbryanta
Copy link
Copy Markdown

@wbryanta wbryanta commented Apr 3, 2026

Fixes #26

Problem

Models with thinking mode (e.g., Gemma 4) return empty content via Ollama's /v1/chat/completions endpoint. Thinking tokens consume the max_tokens budget but visible content is stripped, causing 500 errors on simulation start.

Fix

When LLMClient.chat() gets empty content from the OpenAI-compatible endpoint and the backend is Ollama, it retries via the native /api/chat endpoint, which handles thinking tokens correctly.

  • Only triggers on empty responses — no change in behavior for working models
  • No new dependencies (requests already in requirements)
  • Tested with gemma4:26b on Ollama 0.20.0

Models with thinking mode (e.g., Gemma 4) return empty content via
Ollama's OpenAI-compatible /v1/chat/completions endpoint. The thinking
tokens consume the max_tokens budget but visible content is stripped.

This adds a fallback in LLMClient.chat(): when the OpenAI endpoint
returns empty content and we're talking to Ollama, retry via the native
/api/chat endpoint which handles thinking tokens correctly.

Backwards-compatible — only triggers on empty responses.

Fixes nikmcfly#26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thinking-mode models (Gemma 4) return empty responses via Ollama OpenAI-compatible endpoint

1 participant