[OMNIML-2857] Support the DeepSeek V3.2 model #435

cjluo-nv · 2025-10-14T18:52:00Z

What does this PR do?

Type of change: ? New model support

Overview: ?

Usage

Please see examples/deepseek/README.md

Summary by CodeRabbit

New Features
- Support for DeepSeek V3.2 quantization and automatic detection of available DeepSeek versions.
- Triton-backed weight dequantization utility and MoE-aware calibration mode to improve calibration fidelity.
Documentation
- DeepSeek examples README expanded with setup, conversion, calibration, and FP8→FP4 quantization workflows for R1, V3, and V3.2.
Bug Fixes
- More robust, failure-tolerant copying of auxiliary files/assets during quantization.
Chores
- Updated changelog and lint/ignore rules for example artifacts.

Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai · 2025-10-14T18:52:28Z

Walkthrough

Adds DeepSeek V3.2 quantization support and documentation; introduces a Triton-based weight dequantization kernel and Python wrapper; updates PTQ to prefer V3.2 with fallback to V3 and expands calibration datasets; adjusts quantization scripts/copy behavior; and updates lint ignores for the new kernel.

Changes

Cohort / File(s)	Summary
Release notes `CHANGELOG.rst`	Adds a New Features entry announcing DeepSeek V3.2 model quantization support.
DeepSeek examples docs `examples/deepseek/README.md`	Expands README with V3/R1/V3.1 and new V3.2 setup, conversion, calibration, and NVFP4 quantization workflows; restructures headings and adds conversion/one-step quantization examples.
DeepSeek example ignores `examples/deepseek/.gitignore`	Adds `DeepSeek-V3.2-Exp/` to the ignore list.
Triton dequant kernel `examples/deepseek/ds_kernel.py`	New Triton kernel `weight_dequant_kernel` and Python wrapper `weight_dequant(...)` for per-block weight dequantization with tiling, masking, and input validation.
PTQ calibration flow `examples/deepseek/ptq.py`	Adds dynamic inference-path resolution preferring `DeepSeek-V3.2-Exp/inference` with fallback to `DeepSeek-V3/inference`; imports `weight_dequant` from `ds_kernel`; introduces `CalibMoe` to patch MoE behavior for calibration; expands calibration datasets to two entries and aligns calib sizes; raises explicit error if neither path exists.
Quantization shell script `examples/deepseek/quantize_fp8_to_nvfp4.sh`	Replaces multi-source `cp` with separate copy commands, adds `cp -r $FP8_HF_PATH/assets $FP4_PATH/
Quantization Python script `examples/deepseek/quantize_to_nvfp4.py`	Imports `weight_dequant` from `ds_kernel` (removes ad-hoc sys.path handling); uses `ds_kernel.weight_dequant`; adjusts a scaling factor expression grouping.
Lint configuration `pyproject.toml`	Adds `examples/deepseek/ds_kernel.py` to `extend-ignore` and per-file-ignores for Ruff.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant P as PTQ Script (ptq.py)
  participant FS as Filesystem
  participant D32 as DeepSeek-V3.2-Exp/inference
  participant D3 as DeepSeek-V3/inference

  U->>P: Run PTQ
  P->>FS: Check path: DeepSeek-V3.2-Exp/inference
  alt V3.2 found
    FS-->>P: Exists
    P->>D32: Import & patch (uses ds_kernel.weight_dequant)
  else V3 fallback
    P->>FS: Check path: DeepSeek-V3/inference
    FS-->>P: Exists
    P->>D3: Import & patch (uses ds_kernel.weight_dequant)
  else Neither
    P-->>U: Raise ValueError (neither path exists)
  end
  Note over P: Build calibration dataloaders for<br/>["cnn_dailymail","nemotron-post-training-dataset-v2"]
  P-->>U: Run calibration / quantization

sequenceDiagram
  autonumber
  participant Q as Quantize Script
  participant DK as ds_kernel.weight_dequant
  participant TK as Triton Kernel
  participant T as PyTorch

  Q->>DK: weight_dequant(x, s, block_size)
  DK->>T: Validate tensors, compute M,N, allocate output y
  DK->>TK: Launch weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE)
  TK-->>DK: Dequantized tile results (y)
  DK-->>Q: Return full dequantized weights y

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump a rhythm, kernels hum,
V3.2 arrives—new paths to run.
Triton wakes each scaled row bright,
FP8 to FP4 in moonlit byte.
I stash the assets, hop—quantized delight. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[OMNIML-2857] Support the DeepSeek V3.2 model" is directly related to the main changes in the changeset. The raw summary confirms that the PR adds DeepSeek V3.2 support, including new configurations in ptq.py, expanded documentation in README.md covering V3.2 with specific calibration and quantization steps, new weight dequantization kernels in ds_kernel.py, and updated quantization scripts. The CHANGELOG entry itself states "Support DeepSeek V3.2 model quantization," which aligns with the title. The title is concise, clear, and specific enough that a teammate scanning the history would immediately understand that this PR adds support for a new model version.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/dsv32

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/deepseek/quantize_fp8_to_nvfp4.sh (1)
81-83: Consider adding quotes for robustness.

The fault-tolerant copy operations are well-designed: JSON files (required) fail the script if missing, while Python files and assets (optional) use || true to prevent exit under set -e.

For robustness against paths with spaces or special characters, consider quoting the variables and paths:
-cp $FP8_HF_PATH/*.json $FP4_PATH/
-cp $FP8_HF_PATH/*.py $FP4_PATH/ || true
-cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true
+cp "$FP8_HF_PATH"/*.json "$FP4_PATH"/
+cp "$FP8_HF_PATH"/*.py "$FP4_PATH"/ || true
+cp -r "$FP8_HF_PATH"/assets "$FP4_PATH"/ || true
examples/deepseek/ds_kernel.py (1)
37-58: Consider validating scale tensor shape.

The wrapper function correctly validates input contiguity and dimensionality, launches the kernel with proper grid dimensions, and returns the dequantized result.

For robustness, consider adding a check to validate the scale tensor shape matches the expected dimensions:
 def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
     """..."""
     assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
     assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
     M, N = x.size()
+    expected_s_shape = ((M + block_size - 1) // block_size, (N + block_size - 1) // block_size)
+    assert s.shape == expected_s_shape, f"Scale tensor shape {s.shape} doesn't match expected {expected_s_shape}"
     y = torch.empty_like(x, dtype=torch.get_default_dtype())
     grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
     weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
     return y
Note on dtype handling: Line 55 uses torch.get_default_dtype() for the output tensor, which respects the global dtype context (e.g., torch.set_default_dtype(torch.bfloat16) in ptq.py). This is appropriate for the DeepSeek workflow.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d168c and 19722c9.

📒 Files selected for processing (8)

CHANGELOG.rst (1 hunks)
examples/deepseek/.gitignore (1 hunks)
examples/deepseek/README.md (1 hunks)
examples/deepseek/ds_kernel.py (1 hunks)
examples/deepseek/ptq.py (2 hunks)
examples/deepseek/quantize_fp8_to_nvfp4.sh (1 hunks)
examples/deepseek/quantize_to_nvfp4.py (1 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/deepseek/ptq.py (1)

examples/deepseek/ds_kernel.py (1)

weight_dequant (37-58)

examples/deepseek/quantize_to_nvfp4.py (1)

examples/deepseek/ds_kernel.py (1)

weight_dequant (37-58)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (7)

examples/deepseek/.gitignore (1)

2-2: LGTM!

The addition of DeepSeek-V3.2-Exp/ follows the existing pattern for ignoring model directories and is consistent with the V3.2 support being added.

CHANGELOG.rst (1)

16-16: LGTM!

The changelog entry clearly documents the new DeepSeek V3.2 model quantization support and properly references the examples directory.

pyproject.toml (1)

68-68: LGTM!

The lint ignore rules for ds_kernel.py are appropriate for Triton-style code and consistent with the existing rules for modelopt/torch/quantization/triton/* at line 67. The rules allow uppercase variable names (N803, N806) and lambda assignments (E731) which are idiomatic in Triton kernels.

examples/deepseek/quantize_to_nvfp4.py (1)

47-47: LGTM!

Replacing the local kernel import with the external ds_kernel module simplifies the import path and eliminates the need for sys.path manipulation. The new ds_kernel.py module (reviewed separately) provides a compatible weight_dequant API.

examples/deepseek/ptq.py (2)

67-81: LGTM! Clear path resolution and mixed import strategy.

The path resolution logic cleanly handles both DeepSeek V3 and V3.2, with V3.2 taking precedence when available. The error message is clear and helpful.

The import strategy is well-designed:

model and kernel modules from the DeepSeek inference directory (via sys.path)

ds_kernel from the local examples/deepseek/ directory (maintained in ModelOpt repo)

This separation allows ModelOpt to provide its own optimized kernels while still using the original DeepSeek model implementation.

258-261: LGTM! Improved calibration diversity.

Using two calibration datasets (cnn_dailymail and nemotron-post-training-dataset-v2) with equal sample counts improves calibration quality through better data diversity.

examples/deepseek/ds_kernel.py (1)

8-34: LGTM! Well-structured Triton kernel.

The weight_dequant_kernel follows standard Triton patterns:

Proper 2D grid decomposition with pid_m and pid_n

Correct boundary masking to handle non-multiple dimensions

Efficient block-wise dequantization (y = x * s)

The scale tensor indexing at line 32 (pid_m * n + pid_n) correctly accesses the row-major flattened scale tensor.

examples/deepseek/README.md

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

examples/deepseek/quantize_fp8_to_nvfp4.sh (1)
81-83: Consider adding explicit file existence checks.

The separated copy commands with || true make the script more tolerant of missing files, which is good for optional files like Python modules or assets. However, JSON configuration files are typically required for quantization to succeed.

Consider adding explicit existence checks for required files to provide clearer error messages:
 mkdir -p $FP4_PATH
-cp $FP8_HF_PATH/*.json $FP4_PATH/
+if ! cp $FP8_HF_PATH/*.json $FP4_PATH/ 2>/dev/null; then
+    echo "Error: No JSON files found in $FP8_HF_PATH"
+    exit 1
+fi
 cp $FP8_HF_PATH/*.py $FP4_PATH/ || true
 cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true
examples/deepseek/ds_kernel.py (1)
59-80: Consider adding shape validation for the scale tensor.

The function validates that inputs are contiguous and 2D, but doesn't verify that the scale tensor s has the expected shape. Based on the kernel implementation (line 54: s_ptr + pid_m * n + pid_n), the scale tensor should have shape (M // block_size, N // block_size).

Add validation after line 76:
     M, N = x.size()
+    expected_s_shape = (triton.cdiv(M, block_size), triton.cdiv(N, block_size))
+    assert s.shape == expected_s_shape, (
+        f"Scale tensor shape {s.shape} doesn't match expected {expected_s_shape} "
+        f"for input shape ({M}, {N}) with block_size={block_size}"
+    )
     y = torch.empty_like(x, dtype=torch.get_default_dtype())
This will catch mismatched tensor shapes early and provide a clear error message rather than allowing out-of-bounds memory access or silent incorrect results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d168c and 4e4bf16.

📒 Files selected for processing (8)

CHANGELOG.rst (1 hunks)
examples/deepseek/.gitignore (1 hunks)
examples/deepseek/README.md (1 hunks)
examples/deepseek/ds_kernel.py (1 hunks)
examples/deepseek/ptq.py (2 hunks)
examples/deepseek/quantize_fp8_to_nvfp4.sh (1 hunks)
examples/deepseek/quantize_to_nvfp4.py (1 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/deepseek/quantize_to_nvfp4.py (1)

examples/deepseek/ds_kernel.py (1)

weight_dequant (59-80)

examples/deepseek/ptq.py (1)

examples/deepseek/ds_kernel.py (1)

weight_dequant (59-80)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (7)

examples/deepseek/.gitignore (1)

2-2: LGTM!

The addition of DeepSeek-V3.2-Exp/ to the ignore list appropriately excludes the new V3.2 model directory from version control, consistent with the existing ignore pattern for DeepSeek-V3/.

CHANGELOG.rst (1)

16-16: LGTM!

The changelog entry clearly documents the new DeepSeek V3.2 quantization feature and directs users to the appropriate documentation.

pyproject.toml (1)

68-68: LGTM!

The lint suppression for ds_kernel.py is appropriate. Triton kernels follow different naming conventions (lowercase variable names for tensor pointers, lambda functions for grid configuration), and exempting them from standard Python linting rules (N803, N806, E731) is consistent with the existing approach for other Triton code in the repository.

examples/deepseek/quantize_to_nvfp4.py (1)

47-47: LGTM!

The refactoring to import weight_dequant from the centralized ds_kernel module is a good improvement. It eliminates the need for local path manipulation and ensures consistent dequantization behavior across the codebase.

examples/deepseek/ptq.py (2)

67-81: LGTM!

The path resolution strategy is well-designed:

Clearly defines both V3 and V3.2 paths

Prioritizes V3.2 when present, providing backward compatibility with V3

Raises a descriptive error if neither path exists

Uses noqa: E402 appropriately for imports that must follow sys.path modification

This approach cleanly supports both model versions without code duplication.

258-261: Verify the expanded calibration dataset configuration.

The calibration now uses a combination of two datasets (cnn_dailymail and nemotron-post-training-dataset-v2) with equal sample counts from each. This change aligns with the broader improvements mentioned in the changelog (line 14 of CHANGELOG.rst).

Please confirm that using [calib_size, calib_size] (e.g., [512, 512] for a total of 1024 samples) rather than splitting a single calib_size value is intentional. If the goal is to maintain the same total number of calibration samples, you might want to use [calib_size // 2, calib_size // 2] instead.

examples/deepseek/README.md (1)

19-64: LGTM!

The documentation updates are comprehensive and well-structured:

Clearly separates DeepSeek V3/R1/V3.1 from V3.2 setup instructions

Provides version-specific calibration commands with appropriate config files

Documents additional dependencies required for V3.2 (fast-hadamard-transform)

Maintains clear workflow from setup through quantization

The structure makes it easy for users to follow the correct path for their target model version.

codecov · 2025-10-14T19:07:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.39%. Comparing base (32d168c) to head (2223d4b).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #435      +/-   ##
==========================================
+ Coverage   73.38%   73.39%   +0.01%     
==========================================
  Files         180      180              
  Lines       17934    17976      +42     
==========================================
+ Hits        13160    13194      +34     
- Misses       4774     4782       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97

Approving as codeowner for pyproject.toml

Updated section headings and fixed paths in README. Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e4bf16 and 90865c3.

📒 Files selected for processing (1)

examples/deepseek/README.md (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

coderabbitai · 2025-10-16T19:32:31Z

examples/deepseek/README.md

+```bash
+# download the FP8 checkpoint from Hugginface.
+huggingface-cli download deepseek-ai/DeepSeek-V3.2-Exp --local-dir $HF_FP8_CKPT
+
+# clone DeepSeek-V3.2 Github repository for FP8 inference,
+git clone https://github.com/deepseek-ai/DeepSeek-V3.2-Exp.git && cd DeepSeek-V3.2-Exp && git checkout 3b99a53
+
+# Install requirements
+pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git
+pip install -r inference/requirements.txt
+```
+
+### Convert the Checkpoint
+
+```bash
 # convert the HF checkpoint to a specific format for Deepseek
 python inference/convert.py --hf-ckpt-path $HF_FP8_CKPT --save-path $DS_CKPT --n-experts 256 --model-parallel 8
 ```

-### Post-training quantization
+## Post-training quantization
+
+### Run the calibration scripts

-#### Run the calibration scripts
+DeepSeek V3, R1, V3.1

 ```bash
 torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
 ```

-#### Quantize the FP8 hf checkpoint to FP4
+DeepSeek V3.2
+
+```bash
+torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
+```


⚠️ Potential issue | 🟠 Major

Add a cd .. before running the calibration command.

The V3.2 setup block leaves us inside DeepSeek-V3.2-Exp. If we run the calibration command as written from there, the path resolves to DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fails. Please add a step (e.g., cd ..) after installing requirements so readers return to the project root before launching calibration.

🤖 Prompt for AI Agents

In examples/deepseek/README.md around lines 31 to 64, the instructions leave the user inside the DeepSeek-V3.2-Exp repo after cloning and installing requirements which causes subsequent calibration commands to resolve paths like DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fail; add a step immediately after the pip install lines to change directory back to the project root (e.g., run cd ..) so the calibration torchrun commands run from the correct location.

examples/deepseek/ds_kernel.py

Added SPDX license information to the file. Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90865c3 and 9b64663.

📒 Files selected for processing (1)

examples/deepseek/ds_kernel.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

coderabbitai · 2025-10-16T19:47:59Z

examples/deepseek/ds_kernel.py

+        x (torch.Tensor): The quantized weight tensor of shape (M, N).
+        s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
+        block_size (int, optional): The block size to use for dequantization. Defaults to 128.
+
+    Returns:
+        torch.Tensor: The dequantized weight tensor of the same shape as `x`.
+
+    Raises:
+        AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
+    """
+    assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
+    assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
+    M, N = x.size()
+    y = torch.empty_like(x, dtype=torch.get_default_dtype())
+    grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
+    weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
+    return y


⚠️ Potential issue | 🔴 Critical

Block-scale tensor shape can go out-of-bounds

The docstring and lack of shape checks let callers size s using integer division (M//block_size, N//block_size). When M or N isn’t an exact multiple of block_size, weight_dequant_kernel still launches trailing tiles (tl.cdiv) and loads s_ptr + pid_m * n + pid_n, which now indexes past the allocated scale buffer. That OOB read can yield silent corruption or a hard fault.

Please validate the shape up front (using ceil-div) and update the docstring accordingly, e.g.:

@@ - s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size). + s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)). @@ - M, N = x.size() + M, N = x.size() + m_blocks = (M + block_size - 1) // block_size + n_blocks = (N + block_size - 1) // block_size + assert s.size() == (m_blocks, n_blocks), \ + f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"

This keeps the kernel within bounds and matches its launch configuration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

x (torch.Tensor): The quantized weight tensor of shape (M, N).

s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).

block_size (int, optional): The block size to use for dequantization. Defaults to 128.

Returns:

torch.Tensor: The dequantized weight tensor of the same shape as `x`.

Raises:

AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.

"""

assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"

assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"

M, N = x.size()

y = torch.empty_like(x, dtype=torch.get_default_dtype())

grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))

weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)

return y

x (torch.Tensor): The quantized weight tensor of shape (M, N).

s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)).

block_size (int, optional): The block size to use for dequantization. Defaults to 128.

Returns:

torch.Tensor: The dequantized weight tensor of the same shape as `x`.

Raises:

AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.

"""

assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"

assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"

M, N = x.size()

m_blocks = (M + block_size - 1) // block_size

n_blocks = (N + block_size - 1) // block_size

assert s.size() == (m_blocks, n_blocks), \

f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"

y = torch.empty_like(x, dtype=torch.get_default_dtype())

grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))

weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)

return y

🤖 Prompt for AI Agents

In examples/deepseek/ds_kernel.py around lines 79 to 95, the scale tensor `s` can be too small when M or N are not multiples of block_size because the kernel launch uses ceil-div (triton.cdiv) but the docstring and current checks use floor-div; update the docstring to state s.shape == (ceil(M/block_size), ceil(N/block_size)) and add an assertion that s.dim()==2 and s.is_contiguous() and s.size(0)==triton.cdiv(M, block_size) and s.size(1)==triton.cdiv(N, block_size) (with a clear error message) before launching the kernel so the kernel never reads out-of-bounds.

Signed-off-by: Chenjie Luo <[email protected]>

…imizer into chenjiel/dsv32

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/deepseek/ptq.py (1)

67-81: Dynamic path resolution looks good, but consider validating imports.

The two-path resolution with V3.2 prioritized over V3 aligns well with the PR objectives. The fallback logic and error message are clear.

Minor observations:

Line 79: The import alias deekseep_model has a typo (should be deepseek_model), though it's used consistently throughout the file.

Consider adding basic validation after imports to ensure required symbols exist, e.g., verifying that deekseep_model.MoE and weight_dequant are available before proceeding.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b64663 and 2223d4b.

📒 Files selected for processing (2)

examples/deepseek/ptq.py (5 hunks)
examples/deepseek/quantize_to_nvfp4.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/deepseek/quantize_to_nvfp4.py (2)

examples/deepseek/ds_kernel.py (1)

weight_dequant (74-95)

modelopt/torch/quantization/nn/modules/tensor_quantizer.py (2)

amax (236-241)

amax (244-255)

examples/deepseek/ptq.py (4)

examples/deepseek/ds_kernel.py (1)

weight_dequant (74-95)

modelopt/torch/quantization/plugins/huggingface.py (9)

_setup (55-58)

_setup (161-164)

_setup (239-244)

_setup (349-350)

_setup (365-369)

_setup (388-390)

_setup (427-473)

_setup (601-612)

forward (71-119)

modelopt/torch/quantization/nn/modules/tensor_quantizer.py (1)

forward (847-946)

modelopt/torch/quantization/conversion.py (1)

register (325-366)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (4)

examples/deepseek/quantize_to_nvfp4.py (2)

153-154: Improved clarity with explicit parentheses.

The refactoring from / 6.0 / 448.0 to / (6.0 * 448.0) is mathematically equivalent and makes the division by the product more explicit and readable.

47-47: Good refactoring to use the standardized kernel.

The import change from a local implementation to the shared ds_kernel.weight_dequant improves maintainability. The function call at line 196 correctly uses the default block_size=128, which is the established standard for DeepSeek quantization as confirmed in ptq.py. The math refactoring at line 154 is also mathematically equivalent and improves clarity.

examples/deepseek/ptq.py (2)

279-283: LGTM: Expanded calibration dataset coverage.

Adding the nemotron dataset alongside cnn_dailymail provides broader calibration coverage, which should improve quantization quality for DeepSeek V3.2.

343-348: Commented-out instrumentation is fine.

This appears to be debugging instrumentation for analyzing expert activation patterns during calibration. Since it's commented out, there's no runtime impact.

coderabbitai · 2025-10-21T05:00:51Z

examples/deepseek/ptq.py

+    class CalibMoe(deekseep_model.MoE):
+        def __init__(self, *args, **kwargs):
+            super().__init__(*args, **kwargs)
+            self._setup()
+
+        def _setup(self):
+            self._original_topk = self.gate.topk
+            self._original_topk_groups = self.gate.topk_groups
+
+        def forward(self, x: torch.Tensor) -> torch.Tensor:
+            # Forward all tokens to all experts for calibration
+            self.gate.topk = self.n_routed_experts
+            self.gate.topk_groups = self.gate.n_groups
+            super().forward(x)
+            # Restore the original topk and topk_groups
+            self.gate.topk = self._original_topk
+            self.gate.topk_groups = self._original_topk_groups
+
+            return super().forward(x)
+


⚠️ Potential issue | 🟠 Major

Fix the double forward pass in CalibMoe.

The forward() method calls super().forward(x) twice on lines 214 and 219:

Line 214: Routes all tokens to all experts (for calibration) but discards the result.

Line 219: Routes with original topk settings and returns this result.

This double invocation means every forward pass during calibration runs inference twice through the MoE layer, which is extremely expensive for large models.

Expected behavior: The forward pass should only route to all experts during calibration phase (when quantizers collect statistics), then switch to normal routing once calibration is complete. The current implementation runs both routing strategies on every call.

Consider this fix:

class CalibMoe(deekseep_model.MoE): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._setup() def _setup(self): - self._original_topk = self.gate.topk - self._original_topk_groups = self.gate.topk_groups + # Route to all experts during calibration + self.gate.topk = self.n_routed_experts + self.gate.topk_groups = self.gate.n_groups def forward(self, x: torch.Tensor) -> torch.Tensor: - # Forward all tokens to all experts for calibration - self.gate.topk = self.n_routed_experts - self.gate.topk_groups = self.gate.n_groups - super().forward(x) - # Restore the original topk and topk_groups - self.gate.topk = self._original_topk - self.gate.topk_groups = self._original_topk_groups - return super().forward(x)

This sets the all-expert routing once during setup and keeps it throughout calibration, eliminating the double forward pass.

Support DeepSeek V3.2 models

19722c9

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv requested review from a team as code owners October 14, 2025 18:52

cjluo-nv requested review from kevalmorabia97 and sugunav14 October 14, 2025 18:52

Add license

4e4bf16

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv requested review from Edwardf0t1 and meenchen October 14, 2025 18:55

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

examples/deepseek/README.md Show resolved Hide resolved

cjluo-nv changed the title ~~Support DeepSeek V3.2 models~~ [OMNIML-2857] Support DeepSeek V3.2 models Oct 14, 2025

cjluo-nv changed the title ~~[OMNIML-2857] Support DeepSeek V3.2 models~~ [OMNIML-2857] Support the DeepSeek V3.2 model Oct 14, 2025

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

kevalmorabia97 approved these changes Oct 14, 2025

View reviewed changes

Revise README formatting and paths for DeepSeek

90865c3

Updated section headings and fixed paths in README. Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

meenchen reviewed Oct 16, 2025

View reviewed changes

examples/deepseek/ds_kernel.py Show resolved Hide resolved

examples/deepseek/ds_kernel.py Show resolved Hide resolved

Add SPDX copyright and license information

9b64663

Added SPDX license information to the file. Signed-off-by: Chenjie Luo <[email protected]>

meenchen approved these changes Oct 16, 2025

View reviewed changes

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

cjluo-nv added 2 commits October 21, 2025 04:56

Route all experts

ea9190e

Signed-off-by: Chenjie Luo <[email protected]>

Merge branch 'chenjiel/dsv32' of github.com:NVIDIA/TensorRT-Model-Opt…

2223d4b

…imizer into chenjiel/dsv32

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

[OMNIML-2857] Support the DeepSeek V3.2 model #435

Are you sure you want to change the base?

[OMNIML-2857] Support the DeepSeek V3.2 model #435

Conversation

cjluo-nv commented Oct 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

codecov bot commented Oct 14, 2025 •

edited

Loading