-
Notifications
You must be signed in to change notification settings - Fork 183
[OMNIML-2857] Support the DeepSeek V3.2 model #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chenjie Luo <[email protected]>
WalkthroughAdds DeepSeek V3.2 quantization support and documentation; introduces a Triton-based weight dequantization kernel and Python wrapper; updates PTQ to prefer V3.2 with fallback to V3 and expands calibration datasets; adjusts quantization scripts/copy behavior; and updates lint ignores for the new kernel. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant P as PTQ Script (ptq.py)
participant FS as Filesystem
participant D32 as DeepSeek-V3.2-Exp/inference
participant D3 as DeepSeek-V3/inference
U->>P: Run PTQ
P->>FS: Check path: DeepSeek-V3.2-Exp/inference
alt V3.2 found
FS-->>P: Exists
P->>D32: Import & patch (uses ds_kernel.weight_dequant)
else V3 fallback
P->>FS: Check path: DeepSeek-V3/inference
FS-->>P: Exists
P->>D3: Import & patch (uses ds_kernel.weight_dequant)
else Neither
P-->>U: Raise ValueError (neither path exists)
end
Note over P: Build calibration dataloaders for<br/>["cnn_dailymail","nemotron-post-training-dataset-v2"]
P-->>U: Run calibration / quantization
sequenceDiagram
autonumber
participant Q as Quantize Script
participant DK as ds_kernel.weight_dequant
participant TK as Triton Kernel
participant T as PyTorch
Q->>DK: weight_dequant(x, s, block_size)
DK->>T: Validate tensors, compute M,N, allocate output y
DK->>TK: Launch weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE)
TK-->>DK: Dequantized tile results (y)
DK-->>Q: Return full dequantized weights y
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Chenjie Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/deepseek/quantize_fp8_to_nvfp4.sh (1)
81-83: Consider adding quotes for robustness.The fault-tolerant copy operations are well-designed: JSON files (required) fail the script if missing, while Python files and assets (optional) use
|| trueto prevent exit underset -e.For robustness against paths with spaces or special characters, consider quoting the variables and paths:
-cp $FP8_HF_PATH/*.json $FP4_PATH/ -cp $FP8_HF_PATH/*.py $FP4_PATH/ || true -cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true +cp "$FP8_HF_PATH"/*.json "$FP4_PATH"/ +cp "$FP8_HF_PATH"/*.py "$FP4_PATH"/ || true +cp -r "$FP8_HF_PATH"/assets "$FP4_PATH"/ || trueexamples/deepseek/ds_kernel.py (1)
37-58: Consider validating scale tensor shape.The wrapper function correctly validates input contiguity and dimensionality, launches the kernel with proper grid dimensions, and returns the dequantized result.
For robustness, consider adding a check to validate the scale tensor shape matches the expected dimensions:
def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor: """...""" assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous" assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions" M, N = x.size() + expected_s_shape = ((M + block_size - 1) // block_size, (N + block_size - 1) // block_size) + assert s.shape == expected_s_shape, f"Scale tensor shape {s.shape} doesn't match expected {expected_s_shape}" y = torch.empty_like(x, dtype=torch.get_default_dtype()) grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"])) weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size) return yNote on dtype handling: Line 55 uses
torch.get_default_dtype()for the output tensor, which respects the global dtype context (e.g.,torch.set_default_dtype(torch.bfloat16)in ptq.py). This is appropriate for the DeepSeek workflow.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CHANGELOG.rst(1 hunks)examples/deepseek/.gitignore(1 hunks)examples/deepseek/README.md(1 hunks)examples/deepseek/ds_kernel.py(1 hunks)examples/deepseek/ptq.py(2 hunks)examples/deepseek/quantize_fp8_to_nvfp4.sh(1 hunks)examples/deepseek/quantize_to_nvfp4.py(1 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/deepseek/ptq.py (1)
examples/deepseek/ds_kernel.py (1)
weight_dequant(37-58)
examples/deepseek/quantize_to_nvfp4.py (1)
examples/deepseek/ds_kernel.py (1)
weight_dequant(37-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (7)
examples/deepseek/.gitignore (1)
2-2: LGTM!The addition of
DeepSeek-V3.2-Exp/follows the existing pattern for ignoring model directories and is consistent with the V3.2 support being added.CHANGELOG.rst (1)
16-16: LGTM!The changelog entry clearly documents the new DeepSeek V3.2 model quantization support and properly references the examples directory.
pyproject.toml (1)
68-68: LGTM!The lint ignore rules for
ds_kernel.pyare appropriate for Triton-style code and consistent with the existing rules formodelopt/torch/quantization/triton/*at line 67. The rules allow uppercase variable names (N803, N806) and lambda assignments (E731) which are idiomatic in Triton kernels.examples/deepseek/quantize_to_nvfp4.py (1)
47-47: LGTM!Replacing the local kernel import with the external
ds_kernelmodule simplifies the import path and eliminates the need forsys.pathmanipulation. The newds_kernel.pymodule (reviewed separately) provides a compatibleweight_dequantAPI.examples/deepseek/ptq.py (2)
67-81: LGTM! Clear path resolution and mixed import strategy.The path resolution logic cleanly handles both DeepSeek V3 and V3.2, with V3.2 taking precedence when available. The error message is clear and helpful.
The import strategy is well-designed:
modelandkernelmodules from the DeepSeek inference directory (via sys.path)ds_kernelfrom the local examples/deepseek/ directory (maintained in ModelOpt repo)This separation allows ModelOpt to provide its own optimized kernels while still using the original DeepSeek model implementation.
258-261: LGTM! Improved calibration diversity.Using two calibration datasets (
cnn_dailymailandnemotron-post-training-dataset-v2) with equal sample counts improves calibration quality through better data diversity.examples/deepseek/ds_kernel.py (1)
8-34: LGTM! Well-structured Triton kernel.The
weight_dequant_kernelfollows standard Triton patterns:
- Proper 2D grid decomposition with
pid_mandpid_n- Correct boundary masking to handle non-multiple dimensions
- Efficient block-wise dequantization (y = x * s)
The scale tensor indexing at line 32 (
pid_m * n + pid_n) correctly accesses the row-major flattened scale tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
examples/deepseek/quantize_fp8_to_nvfp4.sh (1)
81-83: Consider adding explicit file existence checks.The separated copy commands with
|| truemake the script more tolerant of missing files, which is good for optional files like Python modules or assets. However, JSON configuration files are typically required for quantization to succeed.Consider adding explicit existence checks for required files to provide clearer error messages:
mkdir -p $FP4_PATH -cp $FP8_HF_PATH/*.json $FP4_PATH/ +if ! cp $FP8_HF_PATH/*.json $FP4_PATH/ 2>/dev/null; then + echo "Error: No JSON files found in $FP8_HF_PATH" + exit 1 +fi cp $FP8_HF_PATH/*.py $FP4_PATH/ || true cp -r $FP8_HF_PATH/assets $FP4_PATH/ || trueexamples/deepseek/ds_kernel.py (1)
59-80: Consider adding shape validation for the scale tensor.The function validates that inputs are contiguous and 2D, but doesn't verify that the scale tensor
shas the expected shape. Based on the kernel implementation (line 54:s_ptr + pid_m * n + pid_n), the scale tensor should have shape(M // block_size, N // block_size).Add validation after line 76:
M, N = x.size() + expected_s_shape = (triton.cdiv(M, block_size), triton.cdiv(N, block_size)) + assert s.shape == expected_s_shape, ( + f"Scale tensor shape {s.shape} doesn't match expected {expected_s_shape} " + f"for input shape ({M}, {N}) with block_size={block_size}" + ) y = torch.empty_like(x, dtype=torch.get_default_dtype())This will catch mismatched tensor shapes early and provide a clear error message rather than allowing out-of-bounds memory access or silent incorrect results.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CHANGELOG.rst(1 hunks)examples/deepseek/.gitignore(1 hunks)examples/deepseek/README.md(1 hunks)examples/deepseek/ds_kernel.py(1 hunks)examples/deepseek/ptq.py(2 hunks)examples/deepseek/quantize_fp8_to_nvfp4.sh(1 hunks)examples/deepseek/quantize_to_nvfp4.py(1 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/deepseek/quantize_to_nvfp4.py (1)
examples/deepseek/ds_kernel.py (1)
weight_dequant(59-80)
examples/deepseek/ptq.py (1)
examples/deepseek/ds_kernel.py (1)
weight_dequant(59-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (7)
examples/deepseek/.gitignore (1)
2-2: LGTM!The addition of
DeepSeek-V3.2-Exp/to the ignore list appropriately excludes the new V3.2 model directory from version control, consistent with the existing ignore pattern forDeepSeek-V3/.CHANGELOG.rst (1)
16-16: LGTM!The changelog entry clearly documents the new DeepSeek V3.2 quantization feature and directs users to the appropriate documentation.
pyproject.toml (1)
68-68: LGTM!The lint suppression for
ds_kernel.pyis appropriate. Triton kernels follow different naming conventions (lowercase variable names for tensor pointers, lambda functions for grid configuration), and exempting them from standard Python linting rules (N803,N806,E731) is consistent with the existing approach for other Triton code in the repository.examples/deepseek/quantize_to_nvfp4.py (1)
47-47: LGTM!The refactoring to import
weight_dequantfrom the centralizedds_kernelmodule is a good improvement. It eliminates the need for local path manipulation and ensures consistent dequantization behavior across the codebase.examples/deepseek/ptq.py (2)
67-81: LGTM!The path resolution strategy is well-designed:
- Clearly defines both V3 and V3.2 paths
- Prioritizes V3.2 when present, providing backward compatibility with V3
- Raises a descriptive error if neither path exists
- Uses
noqa: E402appropriately for imports that must followsys.pathmodificationThis approach cleanly supports both model versions without code duplication.
258-261: Verify the expanded calibration dataset configuration.The calibration now uses a combination of two datasets (
cnn_dailymailandnemotron-post-training-dataset-v2) with equal sample counts from each. This change aligns with the broader improvements mentioned in the changelog (line 14 of CHANGELOG.rst).Please confirm that using
[calib_size, calib_size](e.g.,[512, 512]for a total of 1024 samples) rather than splitting a singlecalib_sizevalue is intentional. If the goal is to maintain the same total number of calibration samples, you might want to use[calib_size // 2, calib_size // 2]instead.examples/deepseek/README.md (1)
19-64: LGTM!The documentation updates are comprehensive and well-structured:
- Clearly separates DeepSeek V3/R1/V3.1 from V3.2 setup instructions
- Provides version-specific calibration commands with appropriate config files
- Documents additional dependencies required for V3.2 (fast-hadamard-transform)
- Maintains clear workflow from setup through quantization
The structure makes it easy for users to follow the correct path for their target model version.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #435 +/- ##
==========================================
+ Coverage 73.38% 73.39% +0.01%
==========================================
Files 180 180
Lines 17934 17976 +42
==========================================
+ Hits 13160 13194 +34
- Misses 4774 4782 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as codeowner for pyproject.toml
Updated section headings and fixed paths in README. Signed-off-by: Chenjie Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/deepseek/README.md(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
| ```bash | ||
| # download the FP8 checkpoint from Hugginface. | ||
| huggingface-cli download deepseek-ai/DeepSeek-V3.2-Exp --local-dir $HF_FP8_CKPT | ||
|
|
||
| # clone DeepSeek-V3.2 Github repository for FP8 inference, | ||
| git clone https://github.com/deepseek-ai/DeepSeek-V3.2-Exp.git && cd DeepSeek-V3.2-Exp && git checkout 3b99a53 | ||
|
|
||
| # Install requirements | ||
| pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git | ||
| pip install -r inference/requirements.txt | ||
| ``` | ||
|
|
||
| ### Convert the Checkpoint | ||
|
|
||
| ```bash | ||
| # convert the HF checkpoint to a specific format for Deepseek | ||
| python inference/convert.py --hf-ckpt-path $HF_FP8_CKPT --save-path $DS_CKPT --n-experts 256 --model-parallel 8 | ||
| ``` | ||
|
|
||
| ### Post-training quantization | ||
| ## Post-training quantization | ||
|
|
||
| ### Run the calibration scripts | ||
|
|
||
| #### Run the calibration scripts | ||
| DeepSeek V3, R1, V3.1 | ||
|
|
||
| ```bash | ||
| torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH | ||
| ``` | ||
|
|
||
| #### Quantize the FP8 hf checkpoint to FP4 | ||
| DeepSeek V3.2 | ||
|
|
||
| ```bash | ||
| torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a cd .. before running the calibration command.
The V3.2 setup block leaves us inside DeepSeek-V3.2-Exp. If we run the calibration command as written from there, the path resolves to DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fails. Please add a step (e.g., cd ..) after installing requirements so readers return to the project root before launching calibration.
🤖 Prompt for AI Agents
In examples/deepseek/README.md around lines 31 to 64, the instructions leave the
user inside the DeepSeek-V3.2-Exp repo after cloning and installing requirements
which causes subsequent calibration commands to resolve paths like
DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fail; add a step immediately after
the pip install lines to change directory back to the project root (e.g., run cd
..) so the calibration torchrun commands run from the correct location.
Added SPDX license information to the file. Signed-off-by: Chenjie Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/deepseek/ds_kernel.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
| x (torch.Tensor): The quantized weight tensor of shape (M, N). | ||
| s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size). | ||
| block_size (int, optional): The block size to use for dequantization. Defaults to 128. | ||
| Returns: | ||
| torch.Tensor: The dequantized weight tensor of the same shape as `x`. | ||
| Raises: | ||
| AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2. | ||
| """ | ||
| assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous" | ||
| assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions" | ||
| M, N = x.size() | ||
| y = torch.empty_like(x, dtype=torch.get_default_dtype()) | ||
| grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"])) | ||
| weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size) | ||
| return y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block-scale tensor shape can go out-of-bounds
The docstring and lack of shape checks let callers size s using integer division (M//block_size, N//block_size). When M or N isn’t an exact multiple of block_size, weight_dequant_kernel still launches trailing tiles (tl.cdiv) and loads s_ptr + pid_m * n + pid_n, which now indexes past the allocated scale buffer. That OOB read can yield silent corruption or a hard fault.
Please validate the shape up front (using ceil-div) and update the docstring accordingly, e.g.:
@@
- s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
+ s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)).
@@
- M, N = x.size()
+ M, N = x.size()
+ m_blocks = (M + block_size - 1) // block_size
+ n_blocks = (N + block_size - 1) // block_size
+ assert s.size() == (m_blocks, n_blocks), \
+ f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"This keeps the kernel within bounds and matches its launch configuration.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| x (torch.Tensor): The quantized weight tensor of shape (M, N). | |
| s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size). | |
| block_size (int, optional): The block size to use for dequantization. Defaults to 128. | |
| Returns: | |
| torch.Tensor: The dequantized weight tensor of the same shape as `x`. | |
| Raises: | |
| AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2. | |
| """ | |
| assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous" | |
| assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions" | |
| M, N = x.size() | |
| y = torch.empty_like(x, dtype=torch.get_default_dtype()) | |
| grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"])) | |
| weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size) | |
| return y | |
| x (torch.Tensor): The quantized weight tensor of shape (M, N). | |
| s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)). | |
| block_size (int, optional): The block size to use for dequantization. Defaults to 128. | |
| Returns: | |
| torch.Tensor: The dequantized weight tensor of the same shape as `x`. | |
| Raises: | |
| AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2. | |
| """ | |
| assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous" | |
| assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions" | |
| M, N = x.size() | |
| m_blocks = (M + block_size - 1) // block_size | |
| n_blocks = (N + block_size - 1) // block_size | |
| assert s.size() == (m_blocks, n_blocks), \ | |
| f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}" | |
| y = torch.empty_like(x, dtype=torch.get_default_dtype()) | |
| grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"])) | |
| weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size) | |
| return y |
🤖 Prompt for AI Agents
In examples/deepseek/ds_kernel.py around lines 79 to 95, the scale tensor `s`
can be too small when M or N are not multiples of block_size because the kernel
launch uses ceil-div (triton.cdiv) but the docstring and current checks use
floor-div; update the docstring to state s.shape == (ceil(M/block_size),
ceil(N/block_size)) and add an assertion that s.dim()==2 and s.is_contiguous()
and s.size(0)==triton.cdiv(M, block_size) and s.size(1)==triton.cdiv(N,
block_size) (with a clear error message) before launching the kernel so the
kernel never reads out-of-bounds.
Signed-off-by: Chenjie Luo <[email protected]>
…imizer into chenjiel/dsv32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/deepseek/ptq.py (1)
67-81: Dynamic path resolution looks good, but consider validating imports.The two-path resolution with V3.2 prioritized over V3 aligns well with the PR objectives. The fallback logic and error message are clear.
Minor observations:
- Line 79: The import alias
deekseep_modelhas a typo (should bedeepseek_model), though it's used consistently throughout the file.- Consider adding basic validation after imports to ensure required symbols exist, e.g., verifying that
deekseep_model.MoEandweight_dequantare available before proceeding.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/deepseek/ptq.py(5 hunks)examples/deepseek/quantize_to_nvfp4.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/deepseek/quantize_to_nvfp4.py (2)
examples/deepseek/ds_kernel.py (1)
weight_dequant(74-95)modelopt/torch/quantization/nn/modules/tensor_quantizer.py (2)
amax(236-241)amax(244-255)
examples/deepseek/ptq.py (4)
examples/deepseek/ds_kernel.py (1)
weight_dequant(74-95)modelopt/torch/quantization/plugins/huggingface.py (9)
_setup(55-58)_setup(161-164)_setup(239-244)_setup(349-350)_setup(365-369)_setup(388-390)_setup(427-473)_setup(601-612)forward(71-119)modelopt/torch/quantization/nn/modules/tensor_quantizer.py (1)
forward(847-946)modelopt/torch/quantization/conversion.py (1)
register(325-366)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (4)
examples/deepseek/quantize_to_nvfp4.py (2)
153-154: Improved clarity with explicit parentheses.The refactoring from
/ 6.0 / 448.0to/ (6.0 * 448.0)is mathematically equivalent and makes the division by the product more explicit and readable.
47-47: Good refactoring to use the standardized kernel.The import change from a local implementation to the shared
ds_kernel.weight_dequantimproves maintainability. The function call at line 196 correctly uses the defaultblock_size=128, which is the established standard for DeepSeek quantization as confirmed inptq.py. The math refactoring at line 154 is also mathematically equivalent and improves clarity.examples/deepseek/ptq.py (2)
279-283: LGTM: Expanded calibration dataset coverage.Adding the nemotron dataset alongside cnn_dailymail provides broader calibration coverage, which should improve quantization quality for DeepSeek V3.2.
343-348: Commented-out instrumentation is fine.This appears to be debugging instrumentation for analyzing expert activation patterns during calibration. Since it's commented out, there's no runtime impact.
| class CalibMoe(deekseep_model.MoE): | ||
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| self._setup() | ||
|
|
||
| def _setup(self): | ||
| self._original_topk = self.gate.topk | ||
| self._original_topk_groups = self.gate.topk_groups | ||
|
|
||
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| # Forward all tokens to all experts for calibration | ||
| self.gate.topk = self.n_routed_experts | ||
| self.gate.topk_groups = self.gate.n_groups | ||
| super().forward(x) | ||
| # Restore the original topk and topk_groups | ||
| self.gate.topk = self._original_topk | ||
| self.gate.topk_groups = self._original_topk_groups | ||
|
|
||
| return super().forward(x) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the double forward pass in CalibMoe.
The forward() method calls super().forward(x) twice on lines 214 and 219:
- Line 214: Routes all tokens to all experts (for calibration) but discards the result.
- Line 219: Routes with original topk settings and returns this result.
This double invocation means every forward pass during calibration runs inference twice through the MoE layer, which is extremely expensive for large models.
Expected behavior: The forward pass should only route to all experts during calibration phase (when quantizers collect statistics), then switch to normal routing once calibration is complete. The current implementation runs both routing strategies on every call.
Consider this fix:
class CalibMoe(deekseep_model.MoE):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._setup()
def _setup(self):
- self._original_topk = self.gate.topk
- self._original_topk_groups = self.gate.topk_groups
+ # Route to all experts during calibration
+ self.gate.topk = self.n_routed_experts
+ self.gate.topk_groups = self.gate.n_groups
def forward(self, x: torch.Tensor) -> torch.Tensor:
- # Forward all tokens to all experts for calibration
- self.gate.topk = self.n_routed_experts
- self.gate.topk_groups = self.gate.n_groups
- super().forward(x)
- # Restore the original topk and topk_groups
- self.gate.topk = self._original_topk
- self.gate.topk_groups = self._original_topk_groups
-
return super().forward(x)This sets the all-expert routing once during setup and keeps it throughout calibration, eliminating the double forward pass.
What does this PR do?
Type of change: ? New model support
Overview: ?
Usage
Please see examples/deepseek/README.md
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Chores