Skip to content

Conversation

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Nov 2, 2025

Description

MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This PR adds a minimal reasoning format override that injects a synthetic opening <think> tag while keeping all reasoning content inline, ensuring compatibility with existing clients without modifying the current parsing behavior.
This approach is equivalent to reasoning_format=none but with synthetic prefix injection. When set via --reasoning-format minimax-m2 at server startup, it overrides client API requests that specify reasoning_format=auto, allowing the model to receive the full reasoning block it needs while remaining compatible with all OpenAI-compatible clients.

Changes

  • Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format
  • Implement minimax-m2 format parsing that bypasses reasoning extraction
  • Inject synthetic <think>\n chunk before first generated token when minimax-m2 is active
  • Track injection state with minimax_reasoning_prefix_injected and minimax_reasoning_prefix_streamed slot flags
  • Prepend <think>\n to generated_text for final response and chat parsing
  • Prevent client reasoning_format=auto from overriding server CLI setting
  • Add minimax-m2 to CLI help, README.md, and code documentation
  • Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording
  • Update process_token to preserve delta_to_send for streaming correctness
  • Defer synthetic prefix injection until first generated token for better UX

Testing

Tested with MiniMax-M2-230B model using --reasoning-format minimax-m2 flag on stock Svelte UI.

…tibility

MiniMax-M2 models require the complete <think>...</think> block including tags
to be present in the context for proper reasoning. This mode injects a synthetic
opening <think> tag in the stream while keeping all reasoning tags inline in
message.content, ensuring the model receives the full reasoning block it needs.

Changes:
- Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format
- Implement minimax-m2 format parsing that bypasses reasoning extraction
- Inject synthetic <think>\n chunk at slot start when minimax-m2 is active
- Track injection state with minimax_reasoning_prefix_injected slot flag
- Prepend <think>\n to generated_text for final response and chat parsing
- Prevent client reasoning_format=auto from overriding server CLI setting
- Add minimax-m2 to CLI help, README.md, and code documentation
- Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording
- Update process_token to preserve delta_to_send for streaming correctness
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why this is need, can you give a concrete example?

Also, I feel like this can be a patch to chat.cpp instead of extending server.cpp code. The server.cpp code is already very complex. We should not adding too much code for non-inference functionalities, including chat template and formatting logic. These functionalities should be confined in a dedicated module.

"- none: leaves thoughts unparsed in `message.content`\n"
"- deepseek: puts thoughts in `message.reasoning_content`\n"
"- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n"
"- minimax-m2: streams a synthetic opening `<think>` and keeps `</think>` tags in `message.content`\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we name this something more generic? like synthetic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I've moved as much as possible to chat.cpp. For parameter naming, I kept consistency with existing formats, treating the first model (DeepSeek) as the "parent" behavior reference.

However, we could prepare a more modular refactor by renaming the parameters to better reflect their actual behavior:

  • none -> disables the backend parser (name already good)
  • deepseek -> remove or document it's an "auto" alias (most used, backend reasoning parser, writes reasoning inside reasoning_content chunks: the OpenAI-compatible target)
  • deepseek-legacy -> rename to clone or something clearer? (inline <think> tags + duplicate inside reasoning_content = Legacy+OAI-Compat mirroring, I don't have a use case for this)
  • minimax-m2 (this PR) -> inline reasoning tags + adds a missing <think> opening tag

To make this truly generic, we'd need an additional parameter to define the prepended string instead of hardcoding <think>. Use case: anyone dealing with Jinja templates that pre-open reasoning tags, causing the model to not regenerate them, making subsequent parsing difficult?

Would you prefer I open a follow-up issue to discuss a more generic synthetic-prefix approach with configurable strings?

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Nov 2, 2025

Thanks for the feedback.
That's exactly why I marked this PR as draft.
This part of the server is already complex, and I wanted other eyes on it before deciding how far to modularize or relocate the logic.

The behavior I'm implementing is indeed a special case, which probably deserves a more generic approach, as you mentioned. Ideally, the refactor would rename the existing formats to make them clearer and more consistent, for example:

none -> it disables the backend parser (name already good)
deepseek -> remove or just inform it's already an "auto" alias (the most used one, backend reasoning parser, writes reasoning inside reasoning.content chunks: it's the OpenAI-compatible target)
deepseek-legacy -> "clone" as a better name? (inline <think> tags + duplicate inside reasoning content, I don't have usecase)
This one from the PR -> inline reasoning tags + adds a missing <think> opening tag

https://huggingface.co/MiniMaxAI/MiniMax-M2
They state:
"IMPORTANT: MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking content from the assistant's turns within the historical messages. In the model's output content, we use the <think>...</think> format to wrap the assistant's thinking content. When using the model, you must ensure that the historical content is passed back in its original format. Do not remove the <think>...</think> part, otherwise, the model's performance will be negatively affected."

So by exposing this prefix injection and parsing behavior as modular options, we could easily handle other models with similar reasoning requirements without changing the core server logic.

In the best of all possible worlds, in the refactor we should also have an output template, just like the Jinja template currently a model input template. To remove/reduce kilometers of hardcoded parsing logic

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 2, 2025

I've got a PR for Minimax standard chat format in the works (without interleave), but ran into some weird corruption problems (but it might be related to #16935). Gonna upload today probably.

Move minimax-m2 prefix injection logic from server.cpp to chat.cpp via common_chat_stream_state
@ServeurpersoCom
Copy link
Collaborator Author

I've got a PR for Minimax standard chat format in the works (without interleave), but ran into some weird corruption problems (but it might be related to #16935). Gonna upload today probably.

The "interleave" part is simply keeping the reasoning in context by sending it in delta.content along with the rest of the conversation, which is what reasoning_format=none does, and what I've implemented here to avoid touching the rest of the codebase.

This model consumes context heavily during reasoning. I'm hoping improvements to the calculations will help it behave better, because right now it needs the full 128K token context to be useful.

Looking forward to your PR for the standard chat format!

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Nov 2, 2025

Sans_titre

Here you can see <think>...</think> present in the regular delta.content, which gets sent back to the model in subsequent messages, forming the interleaving that respects this model's specific training. Later, displaying a spoiler on the frontend remains possible if desired: but a different kind of spoiler, since this one gets sent back to context, which we don't do with OAI-compatible models.

Alternatively, we could skip this PR entirely and use the standard OAI reasoning_content with a proper backend parser, then add an exceptional frontend checkbox to optionally send the spoiler content back to context. Feasibility needs study, but a real parser would be better.

More broadly, a new templating engine specifically for streaming, not hardcoded in C++ would be revolutionary in the LLM world. Something like output templates to complement the existing input Jinja templates, reducing the kilometers of hardcoded parsing logic we currently maintain.

@hksdpc255
Copy link

hksdpc255 commented Nov 3, 2025

MiniMax-M2 models require the complete ... block including tags to be present in the context for proper reasoning.

It seems that the <think>...</think> block, including tags required for proper reasoning, is already implemented in my PR #16932?

I admit that the current implementation of try_parse_reasoning is buggy for this situation, so I’m handling the reasoning content without relying on it for now.

@aldehir
Copy link
Collaborator

aldehir commented Nov 3, 2025

I don't think this is necessary. If you look at the MiniMax-M2 template, the reasoning is only kept for assistant messages that follow the last user message. This happens during a tool call loop, where the client message has role tool and not user. Preserving reasoning content is not required for basic conversations.

Since you need a backend chat parser to handle tool calling, the extraction of reasoning content is straightforward. From there, clients can pass along the reasoning_content from assistant messages and the chat template will render it as needed.

@ServeurpersoCom
Copy link
Collaborator Author

Yes, it’s better to start with PR #16932 and, if necessary, figure out how to feed the content back into the context to preserve the model’s training behavior during longer conversations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants