Skip to content

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Oct 14, 2025

What does this PR do?

Type of change: ? New model support

Overview: ?

Usage

Please see examples/deepseek/README.md

Summary by CodeRabbit

  • New Features

    • Support for DeepSeek V3.2 quantization and automatic detection of available DeepSeek versions.
    • Triton-backed weight dequantization utility and MoE-aware calibration mode to improve calibration fidelity.
  • Documentation

    • DeepSeek examples README expanded with setup, conversion, calibration, and FP8→FP4 quantization workflows for R1, V3, and V3.2.
  • Bug Fixes

    • More robust, failure-tolerant copying of auxiliary files/assets during quantization.
  • Chores

    • Updated changelog and lint/ignore rules for example artifacts.

Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv cjluo-nv requested review from a team as code owners October 14, 2025 18:52
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds DeepSeek V3.2 quantization support and documentation; introduces a Triton-based weight dequantization kernel and Python wrapper; updates PTQ to prefer V3.2 with fallback to V3 and expands calibration datasets; adjusts quantization scripts/copy behavior; and updates lint ignores for the new kernel.

Changes

Cohort / File(s) Summary
Release notes
CHANGELOG.rst
Adds a New Features entry announcing DeepSeek V3.2 model quantization support.
DeepSeek examples docs
examples/deepseek/README.md
Expands README with V3/R1/V3.1 and new V3.2 setup, conversion, calibration, and NVFP4 quantization workflows; restructures headings and adds conversion/one-step quantization examples.
DeepSeek example ignores
examples/deepseek/.gitignore
Adds DeepSeek-V3.2-Exp/ to the ignore list.
Triton dequant kernel
examples/deepseek/ds_kernel.py
New Triton kernel weight_dequant_kernel and Python wrapper weight_dequant(...) for per-block weight dequantization with tiling, masking, and input validation.
PTQ calibration flow
examples/deepseek/ptq.py
Adds dynamic inference-path resolution preferring DeepSeek-V3.2-Exp/inference with fallback to DeepSeek-V3/inference; imports weight_dequant from ds_kernel; introduces CalibMoe to patch MoE behavior for calibration; expands calibration datasets to two entries and aligns calib sizes; raises explicit error if neither path exists.
Quantization shell script
examples/deepseek/quantize_fp8_to_nvfp4.sh
Replaces multi-source cp with separate copy commands, adds `cp -r $FP8_HF_PATH/assets $FP4_PATH/
Quantization Python script
examples/deepseek/quantize_to_nvfp4.py
Imports weight_dequant from ds_kernel (removes ad-hoc sys.path handling); uses ds_kernel.weight_dequant; adjusts a scaling factor expression grouping.
Lint configuration
pyproject.toml
Adds examples/deepseek/ds_kernel.py to extend-ignore and per-file-ignores for Ruff.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant P as PTQ Script (ptq.py)
  participant FS as Filesystem
  participant D32 as DeepSeek-V3.2-Exp/inference
  participant D3 as DeepSeek-V3/inference

  U->>P: Run PTQ
  P->>FS: Check path: DeepSeek-V3.2-Exp/inference
  alt V3.2 found
    FS-->>P: Exists
    P->>D32: Import & patch (uses ds_kernel.weight_dequant)
  else V3 fallback
    P->>FS: Check path: DeepSeek-V3/inference
    FS-->>P: Exists
    P->>D3: Import & patch (uses ds_kernel.weight_dequant)
  else Neither
    P-->>U: Raise ValueError (neither path exists)
  end
  Note over P: Build calibration dataloaders for<br/>["cnn_dailymail","nemotron-post-training-dataset-v2"]
  P-->>U: Run calibration / quantization
Loading
sequenceDiagram
  autonumber
  participant Q as Quantize Script
  participant DK as ds_kernel.weight_dequant
  participant TK as Triton Kernel
  participant T as PyTorch

  Q->>DK: weight_dequant(x, s, block_size)
  DK->>T: Validate tensors, compute M,N, allocate output y
  DK->>TK: Launch weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE)
  TK-->>DK: Dequantized tile results (y)
  DK-->>Q: Return full dequantized weights y
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump a rhythm, kernels hum,
V3.2 arrives—new paths to run.
Triton wakes each scaled row bright,
FP8 to FP4 in moonlit byte.
I stash the assets, hop—quantized delight. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[OMNIML-2857] Support the DeepSeek V3.2 model" is directly related to the main changes in the changeset. The raw summary confirms that the PR adds DeepSeek V3.2 support, including new configurations in ptq.py, expanded documentation in README.md covering V3.2 with specific calibration and quantization steps, new weight dequantization kernels in ds_kernel.py, and updated quantization scripts. The CHANGELOG entry itself states "Support DeepSeek V3.2 model quantization," which aligns with the title. The title is concise, clear, and specific enough that a teammate scanning the history would immediately understand that this PR adds support for a new model version.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/dsv32

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chenjie Luo <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/deepseek/quantize_fp8_to_nvfp4.sh (1)

81-83: Consider adding quotes for robustness.

The fault-tolerant copy operations are well-designed: JSON files (required) fail the script if missing, while Python files and assets (optional) use || true to prevent exit under set -e.

For robustness against paths with spaces or special characters, consider quoting the variables and paths:

-cp $FP8_HF_PATH/*.json $FP4_PATH/
-cp $FP8_HF_PATH/*.py $FP4_PATH/ || true
-cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true
+cp "$FP8_HF_PATH"/*.json "$FP4_PATH"/
+cp "$FP8_HF_PATH"/*.py "$FP4_PATH"/ || true
+cp -r "$FP8_HF_PATH"/assets "$FP4_PATH"/ || true
examples/deepseek/ds_kernel.py (1)

37-58: Consider validating scale tensor shape.

The wrapper function correctly validates input contiguity and dimensionality, launches the kernel with proper grid dimensions, and returns the dequantized result.

For robustness, consider adding a check to validate the scale tensor shape matches the expected dimensions:

 def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
     """..."""
     assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
     assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
     M, N = x.size()
+    expected_s_shape = ((M + block_size - 1) // block_size, (N + block_size - 1) // block_size)
+    assert s.shape == expected_s_shape, f"Scale tensor shape {s.shape} doesn't match expected {expected_s_shape}"
     y = torch.empty_like(x, dtype=torch.get_default_dtype())
     grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
     weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
     return y

Note on dtype handling: Line 55 uses torch.get_default_dtype() for the output tensor, which respects the global dtype context (e.g., torch.set_default_dtype(torch.bfloat16) in ptq.py). This is appropriate for the DeepSeek workflow.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d168c and 19722c9.

📒 Files selected for processing (8)
  • CHANGELOG.rst (1 hunks)
  • examples/deepseek/.gitignore (1 hunks)
  • examples/deepseek/README.md (1 hunks)
  • examples/deepseek/ds_kernel.py (1 hunks)
  • examples/deepseek/ptq.py (2 hunks)
  • examples/deepseek/quantize_fp8_to_nvfp4.sh (1 hunks)
  • examples/deepseek/quantize_to_nvfp4.py (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/deepseek/ptq.py (1)
examples/deepseek/ds_kernel.py (1)
  • weight_dequant (37-58)
examples/deepseek/quantize_to_nvfp4.py (1)
examples/deepseek/ds_kernel.py (1)
  • weight_dequant (37-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (7)
examples/deepseek/.gitignore (1)

2-2: LGTM!

The addition of DeepSeek-V3.2-Exp/ follows the existing pattern for ignoring model directories and is consistent with the V3.2 support being added.

CHANGELOG.rst (1)

16-16: LGTM!

The changelog entry clearly documents the new DeepSeek V3.2 model quantization support and properly references the examples directory.

pyproject.toml (1)

68-68: LGTM!

The lint ignore rules for ds_kernel.py are appropriate for Triton-style code and consistent with the existing rules for modelopt/torch/quantization/triton/* at line 67. The rules allow uppercase variable names (N803, N806) and lambda assignments (E731) which are idiomatic in Triton kernels.

examples/deepseek/quantize_to_nvfp4.py (1)

47-47: LGTM!

Replacing the local kernel import with the external ds_kernel module simplifies the import path and eliminates the need for sys.path manipulation. The new ds_kernel.py module (reviewed separately) provides a compatible weight_dequant API.

examples/deepseek/ptq.py (2)

67-81: LGTM! Clear path resolution and mixed import strategy.

The path resolution logic cleanly handles both DeepSeek V3 and V3.2, with V3.2 taking precedence when available. The error message is clear and helpful.

The import strategy is well-designed:

  • model and kernel modules from the DeepSeek inference directory (via sys.path)
  • ds_kernel from the local examples/deepseek/ directory (maintained in ModelOpt repo)

This separation allows ModelOpt to provide its own optimized kernels while still using the original DeepSeek model implementation.


258-261: LGTM! Improved calibration diversity.

Using two calibration datasets (cnn_dailymail and nemotron-post-training-dataset-v2) with equal sample counts improves calibration quality through better data diversity.

examples/deepseek/ds_kernel.py (1)

8-34: LGTM! Well-structured Triton kernel.

The weight_dequant_kernel follows standard Triton patterns:

  • Proper 2D grid decomposition with pid_m and pid_n
  • Correct boundary masking to handle non-multiple dimensions
  • Efficient block-wise dequantization (y = x * s)

The scale tensor indexing at line 32 (pid_m * n + pid_n) correctly accesses the row-major flattened scale tensor.

@cjluo-nv cjluo-nv changed the title Support DeepSeek V3.2 models [OMNIML-2857] Support DeepSeek V3.2 models Oct 14, 2025
@cjluo-nv cjluo-nv changed the title [OMNIML-2857] Support DeepSeek V3.2 models [OMNIML-2857] Support the DeepSeek V3.2 model Oct 14, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/deepseek/quantize_fp8_to_nvfp4.sh (1)

81-83: Consider adding explicit file existence checks.

The separated copy commands with || true make the script more tolerant of missing files, which is good for optional files like Python modules or assets. However, JSON configuration files are typically required for quantization to succeed.

Consider adding explicit existence checks for required files to provide clearer error messages:

 mkdir -p $FP4_PATH
-cp $FP8_HF_PATH/*.json $FP4_PATH/
+if ! cp $FP8_HF_PATH/*.json $FP4_PATH/ 2>/dev/null; then
+    echo "Error: No JSON files found in $FP8_HF_PATH"
+    exit 1
+fi
 cp $FP8_HF_PATH/*.py $FP4_PATH/ || true
 cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true
examples/deepseek/ds_kernel.py (1)

59-80: Consider adding shape validation for the scale tensor.

The function validates that inputs are contiguous and 2D, but doesn't verify that the scale tensor s has the expected shape. Based on the kernel implementation (line 54: s_ptr + pid_m * n + pid_n), the scale tensor should have shape (M // block_size, N // block_size).

Add validation after line 76:

     M, N = x.size()
+    expected_s_shape = (triton.cdiv(M, block_size), triton.cdiv(N, block_size))
+    assert s.shape == expected_s_shape, (
+        f"Scale tensor shape {s.shape} doesn't match expected {expected_s_shape} "
+        f"for input shape ({M}, {N}) with block_size={block_size}"
+    )
     y = torch.empty_like(x, dtype=torch.get_default_dtype())

This will catch mismatched tensor shapes early and provide a clear error message rather than allowing out-of-bounds memory access or silent incorrect results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d168c and 4e4bf16.

📒 Files selected for processing (8)
  • CHANGELOG.rst (1 hunks)
  • examples/deepseek/.gitignore (1 hunks)
  • examples/deepseek/README.md (1 hunks)
  • examples/deepseek/ds_kernel.py (1 hunks)
  • examples/deepseek/ptq.py (2 hunks)
  • examples/deepseek/quantize_fp8_to_nvfp4.sh (1 hunks)
  • examples/deepseek/quantize_to_nvfp4.py (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/deepseek/quantize_to_nvfp4.py (1)
examples/deepseek/ds_kernel.py (1)
  • weight_dequant (59-80)
examples/deepseek/ptq.py (1)
examples/deepseek/ds_kernel.py (1)
  • weight_dequant (59-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (7)
examples/deepseek/.gitignore (1)

2-2: LGTM!

The addition of DeepSeek-V3.2-Exp/ to the ignore list appropriately excludes the new V3.2 model directory from version control, consistent with the existing ignore pattern for DeepSeek-V3/.

CHANGELOG.rst (1)

16-16: LGTM!

The changelog entry clearly documents the new DeepSeek V3.2 quantization feature and directs users to the appropriate documentation.

pyproject.toml (1)

68-68: LGTM!

The lint suppression for ds_kernel.py is appropriate. Triton kernels follow different naming conventions (lowercase variable names for tensor pointers, lambda functions for grid configuration), and exempting them from standard Python linting rules (N803, N806, E731) is consistent with the existing approach for other Triton code in the repository.

examples/deepseek/quantize_to_nvfp4.py (1)

47-47: LGTM!

The refactoring to import weight_dequant from the centralized ds_kernel module is a good improvement. It eliminates the need for local path manipulation and ensures consistent dequantization behavior across the codebase.

examples/deepseek/ptq.py (2)

67-81: LGTM!

The path resolution strategy is well-designed:

  • Clearly defines both V3 and V3.2 paths
  • Prioritizes V3.2 when present, providing backward compatibility with V3
  • Raises a descriptive error if neither path exists
  • Uses noqa: E402 appropriately for imports that must follow sys.path modification

This approach cleanly supports both model versions without code duplication.


258-261: Verify the expanded calibration dataset configuration.

The calibration now uses a combination of two datasets (cnn_dailymail and nemotron-post-training-dataset-v2) with equal sample counts from each. This change aligns with the broader improvements mentioned in the changelog (line 14 of CHANGELOG.rst).

Please confirm that using [calib_size, calib_size] (e.g., [512, 512] for a total of 1024 samples) rather than splitting a single calib_size value is intentional. If the goal is to maintain the same total number of calibration samples, you might want to use [calib_size // 2, calib_size // 2] instead.

examples/deepseek/README.md (1)

19-64: LGTM!

The documentation updates are comprehensive and well-structured:

  • Clearly separates DeepSeek V3/R1/V3.1 from V3.2 setup instructions
  • Provides version-specific calibration commands with appropriate config files
  • Documents additional dependencies required for V3.2 (fast-hadamard-transform)
  • Maintains clear workflow from setup through quantization

The structure makes it easy for users to follow the correct path for their target model version.

@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.39%. Comparing base (32d168c) to head (2223d4b).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #435      +/-   ##
==========================================
+ Coverage   73.38%   73.39%   +0.01%     
==========================================
  Files         180      180              
  Lines       17934    17976      +42     
==========================================
+ Hits        13160    13194      +34     
- Misses       4774     4782       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as codeowner for pyproject.toml

Updated section headings and fixed paths in README.

Signed-off-by: Chenjie Luo <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e4bf16 and 90865c3.

📒 Files selected for processing (1)
  • examples/deepseek/README.md (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality

Comment on lines +31 to +64
```bash
# download the FP8 checkpoint from Hugginface.
huggingface-cli download deepseek-ai/DeepSeek-V3.2-Exp --local-dir $HF_FP8_CKPT

# clone DeepSeek-V3.2 Github repository for FP8 inference,
git clone https://github.com/deepseek-ai/DeepSeek-V3.2-Exp.git && cd DeepSeek-V3.2-Exp && git checkout 3b99a53

# Install requirements
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git
pip install -r inference/requirements.txt
```

### Convert the Checkpoint

```bash
# convert the HF checkpoint to a specific format for Deepseek
python inference/convert.py --hf-ckpt-path $HF_FP8_CKPT --save-path $DS_CKPT --n-experts 256 --model-parallel 8
```

### Post-training quantization
## Post-training quantization

### Run the calibration scripts

#### Run the calibration scripts
DeepSeek V3, R1, V3.1

```bash
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
```

#### Quantize the FP8 hf checkpoint to FP4
DeepSeek V3.2

```bash
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a cd .. before running the calibration command.

The V3.2 setup block leaves us inside DeepSeek-V3.2-Exp. If we run the calibration command as written from there, the path resolves to DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fails. Please add a step (e.g., cd ..) after installing requirements so readers return to the project root before launching calibration.

🤖 Prompt for AI Agents
In examples/deepseek/README.md around lines 31 to 64, the instructions leave the
user inside the DeepSeek-V3.2-Exp repo after cloning and installing requirements
which causes subsequent calibration commands to resolve paths like
DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fail; add a step immediately after
the pip install lines to change directory back to the project root (e.g., run cd
..) so the calibration torchrun commands run from the correct location.

Added SPDX license information to the file.

Signed-off-by: Chenjie Luo <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90865c3 and 9b64663.

📒 Files selected for processing (1)
  • examples/deepseek/ds_kernel.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality

Comment on lines +79 to +95
x (torch.Tensor): The quantized weight tensor of shape (M, N).
s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
Returns:
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
Raises:
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
"""
assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
M, N = x.size()
y = torch.empty_like(x, dtype=torch.get_default_dtype())
grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
return y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Block-scale tensor shape can go out-of-bounds

The docstring and lack of shape checks let callers size s using integer division (M//block_size, N//block_size). When M or N isn’t an exact multiple of block_size, weight_dequant_kernel still launches trailing tiles (tl.cdiv) and loads s_ptr + pid_m * n + pid_n, which now indexes past the allocated scale buffer. That OOB read can yield silent corruption or a hard fault.

Please validate the shape up front (using ceil-div) and update the docstring accordingly, e.g.:

@@
-    s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
+    s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)).
@@
-    M, N = x.size()
+    M, N = x.size()
+    m_blocks = (M + block_size - 1) // block_size
+    n_blocks = (N + block_size - 1) // block_size
+    assert s.size() == (m_blocks, n_blocks), \
+        f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"

This keeps the kernel within bounds and matches its launch configuration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
x (torch.Tensor): The quantized weight tensor of shape (M, N).
s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
Returns:
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
Raises:
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
"""
assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
M, N = x.size()
y = torch.empty_like(x, dtype=torch.get_default_dtype())
grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
return y
x (torch.Tensor): The quantized weight tensor of shape (M, N).
s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)).
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
Returns:
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
Raises:
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
"""
assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
M, N = x.size()
m_blocks = (M + block_size - 1) // block_size
n_blocks = (N + block_size - 1) // block_size
assert s.size() == (m_blocks, n_blocks), \
f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"
y = torch.empty_like(x, dtype=torch.get_default_dtype())
grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
return y
🤖 Prompt for AI Agents
In examples/deepseek/ds_kernel.py around lines 79 to 95, the scale tensor `s`
can be too small when M or N are not multiples of block_size because the kernel
launch uses ceil-div (triton.cdiv) but the docstring and current checks use
floor-div; update the docstring to state s.shape == (ceil(M/block_size),
ceil(N/block_size)) and add an assertion that s.dim()==2 and s.is_contiguous()
and s.size(0)==triton.cdiv(M, block_size) and s.size(1)==triton.cdiv(N,
block_size) (with a clear error message) before launching the kernel so the
kernel never reads out-of-bounds.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/deepseek/ptq.py (1)

67-81: Dynamic path resolution looks good, but consider validating imports.

The two-path resolution with V3.2 prioritized over V3 aligns well with the PR objectives. The fallback logic and error message are clear.

Minor observations:

  • Line 79: The import alias deekseep_model has a typo (should be deepseek_model), though it's used consistently throughout the file.
  • Consider adding basic validation after imports to ensure required symbols exist, e.g., verifying that deekseep_model.MoE and weight_dequant are available before proceeding.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b64663 and 2223d4b.

📒 Files selected for processing (2)
  • examples/deepseek/ptq.py (5 hunks)
  • examples/deepseek/quantize_to_nvfp4.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/deepseek/quantize_to_nvfp4.py (2)
examples/deepseek/ds_kernel.py (1)
  • weight_dequant (74-95)
modelopt/torch/quantization/nn/modules/tensor_quantizer.py (2)
  • amax (236-241)
  • amax (244-255)
examples/deepseek/ptq.py (4)
examples/deepseek/ds_kernel.py (1)
  • weight_dequant (74-95)
modelopt/torch/quantization/plugins/huggingface.py (9)
  • _setup (55-58)
  • _setup (161-164)
  • _setup (239-244)
  • _setup (349-350)
  • _setup (365-369)
  • _setup (388-390)
  • _setup (427-473)
  • _setup (601-612)
  • forward (71-119)
modelopt/torch/quantization/nn/modules/tensor_quantizer.py (1)
  • forward (847-946)
modelopt/torch/quantization/conversion.py (1)
  • register (325-366)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (4)
examples/deepseek/quantize_to_nvfp4.py (2)

153-154: Improved clarity with explicit parentheses.

The refactoring from / 6.0 / 448.0 to / (6.0 * 448.0) is mathematically equivalent and makes the division by the product more explicit and readable.


47-47: Good refactoring to use the standardized kernel.

The import change from a local implementation to the shared ds_kernel.weight_dequant improves maintainability. The function call at line 196 correctly uses the default block_size=128, which is the established standard for DeepSeek quantization as confirmed in ptq.py. The math refactoring at line 154 is also mathematically equivalent and improves clarity.

examples/deepseek/ptq.py (2)

279-283: LGTM: Expanded calibration dataset coverage.

Adding the nemotron dataset alongside cnn_dailymail provides broader calibration coverage, which should improve quantization quality for DeepSeek V3.2.


343-348: Commented-out instrumentation is fine.

This appears to be debugging instrumentation for analyzing expert activation patterns during calibration. Since it's commented out, there's no runtime impact.

Comment on lines +201 to +220
class CalibMoe(deekseep_model.MoE):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._setup()

def _setup(self):
self._original_topk = self.gate.topk
self._original_topk_groups = self.gate.topk_groups

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Forward all tokens to all experts for calibration
self.gate.topk = self.n_routed_experts
self.gate.topk_groups = self.gate.n_groups
super().forward(x)
# Restore the original topk and topk_groups
self.gate.topk = self._original_topk
self.gate.topk_groups = self._original_topk_groups

return super().forward(x)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the double forward pass in CalibMoe.

The forward() method calls super().forward(x) twice on lines 214 and 219:

  1. Line 214: Routes all tokens to all experts (for calibration) but discards the result.
  2. Line 219: Routes with original topk settings and returns this result.

This double invocation means every forward pass during calibration runs inference twice through the MoE layer, which is extremely expensive for large models.

Expected behavior: The forward pass should only route to all experts during calibration phase (when quantizers collect statistics), then switch to normal routing once calibration is complete. The current implementation runs both routing strategies on every call.

Consider this fix:

 class CalibMoe(deekseep_model.MoE):
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         self._setup()

     def _setup(self):
-        self._original_topk = self.gate.topk
-        self._original_topk_groups = self.gate.topk_groups
+        # Route to all experts during calibration
+        self.gate.topk = self.n_routed_experts
+        self.gate.topk_groups = self.gate.n_groups

     def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # Forward all tokens to all experts for calibration
-        self.gate.topk = self.n_routed_experts
-        self.gate.topk_groups = self.gate.n_groups
-        super().forward(x)
-        # Restore the original topk and topk_groups
-        self.gate.topk = self._original_topk
-        self.gate.topk_groups = self._original_topk_groups
-
         return super().forward(x)

This sets the all-expert routing once during setup and keeps it throughout calibration, eliminating the double forward pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants