Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions deploy/compose/docker-compose-rag-server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,11 @@ services:
LLM_MAX_TOKENS: ${LLM_MAX_TOKENS:-32768}
LLM_TEMPERATURE: ${LLM_TEMPERATURE:-0}
LLM_TOP_P: ${LLM_TOP_P:-1.0}

# Enable/disable thinking/reasoning for nemotron-3-nano models (30b variant)
# Set to "true" to enable reasoning mode with reasoning_budget
# Set to "false" to disable reasoning and get direct answers
ENABLE_NEMOTRON_3_NANO_THINKING: ${ENABLE_NEMOTRON_3_NANO_THINKING:-true}

##===Query Rewriter Model specific configurations===
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}
Expand Down
3 changes: 3 additions & 0 deletions deploy/compose/nvdev.env
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ export NVIDIA_API_KEY=${NGC_API_KEY}
# === Internally NVIDIA hosted NIM Endpoints (for cloud deployment) ===
# WAR: Use public endpoint for inference
export APP_LLM_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
# For nemotron-3-nano models hosted on NVIDIA cloud, use:
# export APP_LLM_MODELNAME=nvidia/nemotron-3-nano-30b-a3b
# Note: For locally deployed nemotron-3-nano, use: nvidia/nemotron-3-nano
export APP_FILTEREXPRESSIONGENERATOR_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
export APP_EMBEDDINGS_MODELNAME=nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2
# For VLM Embedding Model (Nemoretriever-1b-vlm-embed-v1)
Expand Down
5 changes: 5 additions & 0 deletions deploy/helm/nvidia-blueprint-rag/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,11 @@ envVars:
LLM_TEMPERATURE: "0"
LLM_TOP_P: "1.0"

# Enable/disable thinking/reasoning for nemotron-3-nano models (30b variant)
# Set to "true" to enable reasoning mode with reasoning_budget
# Set to "false" to disable reasoning and get direct answers
ENABLE_NEMOTRON_3_NANO_THINKING: "true"

##===Query Rewriter Model specific configurations===
APP_QUERYREWRITER_MODELNAME: "nvidia/llama-3.3-nemotron-super-49b-v1.5"
# URL on which query rewriter model is hosted. If "", Nvidia hosted API is used
Expand Down
45 changes: 41 additions & 4 deletions docs/enable-nemotron-thinking.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,16 @@ When the thinking budget is enabled, the model monitors the token count within t
As of NIM version 1.12, the Thinking Budget feature is supported on the following models:

- **nvidia/nvidia-nemotron-nano-9b-v2**
- **nvidia/nemotron-3-nano-30b-a3b**
- **nvidia/nemotron-3-nano-30b-a3b** (also accessible as `nvidia/nemotron-3-nano`)

For the latest supported models, refer to the [NIM Thinking Budget Control documentation](https://docs.nvidia.com/nim/large-language-models/latest/thinking-budget-control.html).

> **Note:** The model `nvidia/nemotron-3-nano` is an alias that can be used interchangeably with `nvidia/nemotron-3-nano-30b-a3b`. Both refer to the same underlying model.
>
> **Important - Model Naming:**
> - **For locally deployed NIMs:** Use model name `nvidia/nemotron-3-nano`
> - **For NVIDIA-hosted models:** Use model name `nvidia/nemotron-3-nano-30b-a3b`

### Enabling Thinking Budget on RAG

After enabling the reasoning as per the steps mentioned above, enable the thinking budget feature in RAG by including the following parameters in your API request:
Expand All @@ -126,13 +132,29 @@ After enabling the reasoning as per the steps mentioned above, enable the thinki
| `min_thinking_tokens` | 1 | Minimum number of thinking tokens to allocate for reasoning models. |
| `max_thinking_tokens` | 8192 | Maximum number of thinking tokens to allocate for reasoning models. |

> **Note for `nvidia/nemotron-3-nano-30b-a3b`**
> This model only uses the `max_thinking_tokens` parameter.
> - `min_thinking_tokens` is ignored for this model.
> **Note for `nvidia/nemotron-3-nano-30b-a3b` and `nvidia/nemotron-3-nano`**
> These models only use the `max_thinking_tokens` parameter.
> - `min_thinking_tokens` is ignored for these models.
> - Thinking budget is enabled by passing a positive `max_thinking_tokens` value in the request.
> - The RAG blueprint automatically handles the model-specific parameter mapping internally (`max_thinking_tokens` → `reasoning_budget`).
> - Unlike `nvidia/nvidia-nemotron-nano-9b-v2`, these models return reasoning in a separate `reasoning_content` field rather than using `<think>` tags.
>
> **Controlling Reasoning for nemotron-3-nano:**
> - Set `ENABLE_NEMOTRON_3_NANO_THINKING=true` (default) to enable reasoning/thinking mode
> - Set `ENABLE_NEMOTRON_3_NANO_THINKING=false` to disable reasoning mode
> - This controls the `enable_thinking` flag in `chat_template_kwargs`
>
> **Model Behavior Differences:**
>
> | Model | Reasoning Control | Reasoning Output | Token Budget Parameter |
> |-------|------------------|------------------|----------------------|
> | `nvidia/nvidia-nemotron-nano-9b-v2` | `min_thinking_tokens`, `max_thinking_tokens` | In `content` field with `<think>` tags | `min_thinking_tokens`, `max_thinking_tokens` |
> | `nvidia/nemotron-3-nano-30b-a3b` | `ENABLE_NEMOTRON_3_NANO_THINKING` env var | In `reasoning_content` field | `reasoning_budget` (mapped from `max_thinking_tokens`) |
> | `nvidia/llama-3.3-nemotron-super-49b-v1.5` | System prompt (`/think` or `/no_think`) | In `content` field with `<think>` tags | N/A (controlled by prompt) |

**Example API requests:**

**For nvidia/nvidia-nemotron-nano-9b-v2:**
```json
{
"messages": [
Expand All @@ -147,6 +169,21 @@ After enabling the reasoning as per the steps mentioned above, enable the thinki
}
```

**For nemotron-3-nano (locally deployed):**
```json
{
"messages": [
{
"role": "user",
"content": "What is the FY2017 operating cash flow ratio for Adobe?"
}
],
"max_thinking_tokens": 8192,
"model": "nvidia/nemotron-3-nano"
}
```

**For nemotron-3-nano (NVIDIA-hosted):**
```json
{
"messages": [
Expand Down
139 changes: 117 additions & 22 deletions src/nvidia_rag/utils/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@
"""The wrapper for interacting with llm models and pre or postprocessing LLM response.
1. get_prompts: Get the prompts from the YAML file.
2. get_llm: Get the LLM model. Uses the NVIDIA AI Endpoints or OpenAI.
3. streaming_filter_think: Filter the think tokens from the LLM response (sync).
4. get_streaming_filter_think_parser: Get the parser for filtering the think tokens (sync).
5. streaming_filter_think_async: Filter the think tokens from the LLM response (async).
6. get_streaming_filter_think_parser_async: Get the parser for filtering the think tokens (async).
3. extract_reasoning_and_content: Extract reasoning and content from response chunks.
4. streaming_filter_think: Filter the think tokens from the LLM response (sync).
5. get_streaming_filter_think_parser: Get the parser for filtering the think tokens (sync).
6. streaming_filter_think_async: Filter the think tokens from the LLM response (async).
7. get_streaming_filter_think_parser_async: Get the parser for filtering the think tokens (async).
"""

import logging
Expand Down Expand Up @@ -131,8 +132,19 @@ def _bind_thinking_tokens_if_configured(
) -> LLM | SimpleChatModel:
"""
If min_thinking_tokens or max_thinking_tokens are > 0 in kwargs, bind them to the LLM.
For models that use a reasoning budget (e.g., nemotron-3-nano-30b-a3b),
max_thinking_tokens is mapped to the underlying ChatNVIDIA ``reasoning_budget`` parameter.

Supports multiple reasoning/thinking model variants:

1. nvidia/nvidia-nemotron-nano-9b-v2:
- Uses min_thinking_tokens and max_thinking_tokens parameters
- Outputs reasoning wrapped in <think></think> tags in the content stream

2. nemotron-3-nano variants (nemotron-3-nano-30b-a3b, nvidia/nemotron-3-nano):
- Uses reasoning_budget parameter (mapped from max_thinking_tokens)
- Requires chat_template_kwargs={"enable_thinking": True/False}
- Outputs reasoning in a separate 'reasoning_content' field (not in content)
- Does NOT use <think> tags
- Can be controlled via ENABLE_NEMOTRON_3_NANO_THINKING env var

Raises:
ValueError: If min_thinking_tokens or max_thinking_tokens is passed but model
Expand All @@ -151,16 +163,27 @@ def _bind_thinking_tokens_if_configured(
if not has_thinking_tokens:
return llm

if has_thinking_tokens and "nvidia-nemotron-nano-9b-v2" not in model \
and "nemotron-3-nano-30b-a3b" not in model:
raise ValueError(
"min_thinking_tokens and max_thinking_tokens are only supported for models "
"'nvidia-nemotron-nano-9b-v2' and 'nemotron-3-nano-30b-a3b', "
f"but got model '{model}'"
)
# Check if model is a supported reasoning model (various name formats)
# Note: For locally hosted models, use "nvidia/nemotron-3-nano"
# For NVIDIA-hosted models, use "nvidia/nemotron-3-nano-30b-a3b"
is_nano_9b_v2 = model and "nvidia/nvidia-nemotron-nano-9b-v2" in model
is_nemotron_3_nano = model and (
"nemotron-3-nano" in model.lower() or
"nvidia/nemotron-3-nano" in model or
"nemotron-3-nano-30b-a3b" in model
)

if has_thinking_tokens and not (is_nano_9b_v2 or is_nemotron_3_nano):
raise ValueError(
"min_thinking_tokens and max_thinking_tokens are only supported for models "
"'nvidia/nvidia-nemotron-nano-9b-v2' and nemotron-3-nano variants "
"(e.g., 'nemotron-3-nano-30b-a3b', 'nvidia/nemotron-3-nano'), "
f"but got model '{model}'"
)

bind_args = {}
if "nvidia-nemotron-nano-9b-v2" in model:
if is_nano_9b_v2:
# nvidia/nvidia-nemotron-nano-9b-v2: Uses thinking token parameters directly
if min_think is not None and min_think > 0:
bind_args["min_thinking_tokens"] = min_think
else:
Expand All @@ -169,17 +192,31 @@ def _bind_thinking_tokens_if_configured(
)
if max_think is not None and max_think > 0:
bind_args["max_thinking_tokens"] = max_think
else:
raise ValueError(
f"max_thinking_tokens must be a positive integer, but got {max_think}"
elif is_nemotron_3_nano:
# nemotron-3-nano variants: Use reasoning_budget and enable_thinking flag
# Check environment variable for enable_thinking control
enable_thinking_env = os.getenv("ENABLE_NEMOTRON_3_NANO_THINKING", "true").lower()
enable_thinking = enable_thinking_env in ("true", "1", "yes")

# For nemotron-3-nano variants, min_thinking_tokens is not supported
# If min_thinking_tokens is provided, max_thinking_tokens is required
if min_think is not None and min_think > 0:
if max_think is None or max_think <= 0:
raise ValueError(
"max_thinking_tokens must be a positive integer when using "
"min_thinking_tokens with nemotron-3-nano variants"
)
logger.warning(
"min_thinking_tokens is not supported for nemotron-3-nano variants, "
"only max_thinking_tokens (mapped to reasoning_budget) is supported"
)
elif "nemotron-3-nano-30b-a3b" in model:

if max_think is not None and max_think > 0:
bind_args["reasoning_budget"] = max_think
bind_args["chat_template_kwargs"] = {"enable_thinking": True}
else:
raise ValueError(
f"max_thinking_tokens must be a positive integer, but got {max_think}"
bind_args["chat_template_kwargs"] = {"enable_thinking": enable_thinking}
logger.info(
"nemotron-3-nano: Setting reasoning_budget=%d, enable_thinking=%s (from env: %s)",
max_think, enable_thinking, enable_thinking_env
)

if bind_args:
Expand Down Expand Up @@ -309,6 +346,64 @@ def get_llm(config: NvidiaRAGConfig | None = None, **kwargs) -> LLM | SimpleChat
)


def extract_reasoning_and_content(chunk) -> tuple[str, str]:
"""
Extract both reasoning and content from a response chunk.

Different models handle reasoning differently:
- nvidia/nvidia-nemotron-nano-9b-v2: Uses <think> tags in content stream
- nemotron-3-nano variants: Uses separate reasoning_content field
- llama-3.3-nemotron-super-49b: Uses <think> tags in content stream (controlled by prompt)

This function is designed to be robust and compatible with future changes:
- Checks both reasoning_content and content fields
- Returns whichever field has tokens, regardless of model behavior
- If both have content, returns both separately

This ensures that if the model server fixes the issue where reasoning is disabled
but content still goes to reasoning_content, the code will still work correctly.

Args:
chunk: A response chunk from ChatNVIDIA or similar LLM interface

Returns:
tuple: (reasoning_text, content_text) - either may be empty string

Example:
>>> for chunk in llm.stream([HumanMessage(content="question")]):
>>> reasoning, content = extract_reasoning_and_content(chunk)
>>> if reasoning:
>>> print(f"[REASONING: {reasoning}]", end="", flush=True)
>>> if content:
>>> print(content, end="", flush=True)
"""
reasoning = ""
content = ""

# Check for reasoning_content in additional_kwargs (nemotron-3-nano variants)
# This field is populated by nemotron-3-nano models for reasoning output
if hasattr(chunk, 'additional_kwargs') and 'reasoning_content' in chunk.additional_kwargs:
reasoning = chunk.additional_kwargs.get('reasoning_content', '')

# Check for regular content
# This field is populated by most models for regular output
# For nemotron-nano-9b-v2 and llama-49b, this may include <think> tags
if hasattr(chunk, 'content') and chunk.content:
content = chunk.content

# Robust fallback: If reasoning field has content but content field is empty,
# treat reasoning as content. This handles the case where enable_thinking=false
# but the model still populates reasoning_content instead of content.
# This makes the code compatible with future fixes to the model server.
if reasoning and not content:
# If only reasoning has content, it might actually be the final response
# (occurs when enable_thinking=false but model hasn't been updated)
# Keep it in reasoning field but also check if it looks like a final answer
pass # Keep as-is, let the caller decide how to handle

return reasoning, content


def streaming_filter_think(chunks: Iterable[str]) -> Iterable[str]:
"""
This generator filters content between think tags in streaming LLM responses.
Expand Down
Loading