Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Model Optimizer Changelog (Linux)
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` if no dataset is specified.
- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
- Support ``DeepSeek V3.2`` model quantization. See ``examples/deepseek`` for more details.

**Documentation**

Expand Down
1 change: 1 addition & 0 deletions examples/deepseek/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
DeepSeek-V3/
DeepSeek-V3.2-Exp/
40 changes: 35 additions & 5 deletions examples/deepseek/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,64 @@ This example will demonstrate the steps to quantize DeepSeek R1 model to FP4 and

Due to the model size, currently it requires 8xH200 or 16xH100 to quantize the FP8 model, we will use 8xH200 as example.

### Convert the HF checkpoint for deepseek FP8 inference
## Convert the HF checkpoint for deepseek FP8 inference

```bash
# set up variables to run the example
export HF_FP8_CKPT={path_to_downloaded_hf_checkpoint}
export DS_CKPT={path_to_save_converted_checkpoint}
export FP4_QUANT_PATH={path_to_save_quantization_results}
export HF_FP4_PATH={path_to_save_the_final_FP4_checkpoint}
```

### DeepSeek V3 R1 V3.1

# download the FP8 checkpoint from Hugginface
```bash
# download the FP8 checkpoint from Hugginface. This is an example of DeepSeek-R1
huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir $HF_FP8_CKPT

# clone DeepSeek-V3 (base model of R1) Github repository for FP8 inference,
git clone https://github.com/deepseek-ai/DeepSeek-V3.git && cd DeepSeek-V3 && git checkout 1398800
```

### DeepSeek V3.2

```bash
# download the FP8 checkpoint from Hugginface.
huggingface-cli download deepseek-ai/DeepSeek-V3.2-Exp --local-dir $HF_FP8_CKPT

# clone DeepSeek-V3.2 Github repository for FP8 inference,
git clone https://github.com/deepseek-ai/DeepSeek-V3.2-Exp.git && cd DeepSeek-V3.2-Exp && git checkout 3b99a53

# Install requirements
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git
pip install -r inference/requirements.txt
```

### Convert the Checkpoint

```bash
# convert the HF checkpoint to a specific format for Deepseek
python inference/convert.py --hf-ckpt-path $HF_FP8_CKPT --save-path $DS_CKPT --n-experts 256 --model-parallel 8
```

### Post-training quantization
## Post-training quantization

### Run the calibration scripts

#### Run the calibration scripts
DeepSeek V3, R1, V3.1

```bash
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
```

#### Quantize the FP8 hf checkpoint to FP4
DeepSeek V3.2

```bash
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
```
Comment on lines +31 to +64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a cd .. before running the calibration command.

The V3.2 setup block leaves us inside DeepSeek-V3.2-Exp. If we run the calibration command as written from there, the path resolves to DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fails. Please add a step (e.g., cd ..) after installing requirements so readers return to the project root before launching calibration.

🤖 Prompt for AI Agents
In examples/deepseek/README.md around lines 31 to 64, the instructions leave the
user inside the DeepSeek-V3.2-Exp repo after cloning and installing requirements
which causes subsequent calibration commands to resolve paths like
DeepSeek-V3.2-Exp/DeepSeek-V3.2-Exp/... and fail; add a step immediately after
the pip install lines to change directory back to the project root (e.g., run cd
..) so the calibration torchrun commands run from the correct location.


### Quantize the FP8 hf checkpoint to FP4

We provide a one-step-script which will:

Expand Down
95 changes: 95 additions & 0 deletions examples/deepseek/ds_kernel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# MIT License

# Copyright (c) 2023 DeepSeek

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import triton
import triton.language as tl

"""Reference: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/kernel.py"""


@triton.jit
def weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE: tl.constexpr):
"""
Dequantizes weights using the provided scaling factors and stores the result.
Args:
x_ptr (tl.pointer): Pointer to the quantized weights.
s_ptr (tl.pointer): Pointer to the scaling factors.
y_ptr (tl.pointer): Pointer to the output buffer for dequantized weights.
M (int): Number of rows in the weight matrix.
N (int): Number of columns in the weight matrix.
BLOCK_SIZE (tl.constexpr): Size of the block for tiling.
Returns:
None
"""
pid_m = tl.program_id(axis=0)
pid_n = tl.program_id(axis=1)
n = tl.cdiv(N, BLOCK_SIZE)
offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
offs = offs_m[:, None] * N + offs_n[None, :]
mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
x = tl.load(x_ptr + offs, mask=mask).to(tl.float32)
s = tl.load(s_ptr + pid_m * n + pid_n)
y = x * s
tl.store(y_ptr + offs, y, mask=mask)


def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
"""
Dequantizes the given weight tensor using the provided scale tensor.
Args:
x (torch.Tensor): The quantized weight tensor of shape (M, N).
s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
Returns:
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
Raises:
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
"""
assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
M, N = x.size()
y = torch.empty_like(x, dtype=torch.get_default_dtype())
grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
return y
Comment on lines +79 to +95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Block-scale tensor shape can go out-of-bounds

The docstring and lack of shape checks let callers size s using integer division (M//block_size, N//block_size). When M or N isn’t an exact multiple of block_size, weight_dequant_kernel still launches trailing tiles (tl.cdiv) and loads s_ptr + pid_m * n + pid_n, which now indexes past the allocated scale buffer. That OOB read can yield silent corruption or a hard fault.

Please validate the shape up front (using ceil-div) and update the docstring accordingly, e.g.:

@@
-    s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
+    s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)).
@@
-    M, N = x.size()
+    M, N = x.size()
+    m_blocks = (M + block_size - 1) // block_size
+    n_blocks = (N + block_size - 1) // block_size
+    assert s.size() == (m_blocks, n_blocks), \
+        f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"

This keeps the kernel within bounds and matches its launch configuration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
x (torch.Tensor): The quantized weight tensor of shape (M, N).
s (torch.Tensor): The scale tensor of shape (M//block_size, N//block_size).
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
Returns:
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
Raises:
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
"""
assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
M, N = x.size()
y = torch.empty_like(x, dtype=torch.get_default_dtype())
grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
return y
x (torch.Tensor): The quantized weight tensor of shape (M, N).
s (torch.Tensor): The scale tensor of shape (ceil_div(M, block_size), ceil_div(N, block_size)).
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
Returns:
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
Raises:
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
"""
assert x.is_contiguous() and s.is_contiguous(), "Input tensors must be contiguous"
assert x.dim() == 2 and s.dim() == 2, "Input tensors must have 2 dimensions"
M, N = x.size()
m_blocks = (M + block_size - 1) // block_size
n_blocks = (N + block_size - 1) // block_size
assert s.size() == (m_blocks, n_blocks), \
f"Expected s.shape == ({m_blocks}, {n_blocks}), got {tuple(s.size())}"
y = torch.empty_like(x, dtype=torch.get_default_dtype())
grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
return y
🤖 Prompt for AI Agents
In examples/deepseek/ds_kernel.py around lines 79 to 95, the scale tensor `s`
can be too small when M or N are not multiples of block_size because the kernel
launch uses ceil-div (triton.cdiv) but the docstring and current checks use
floor-div; update the docstring to state s.shape == (ceil(M/block_size),
ceil(N/block_size)) and add an assertion that s.dim()==2 and s.is_contiguous()
and s.size(0)==triton.cdiv(M, block_size) and s.size(1)==triton.cdiv(N,
block_size) (with a clear error message) before launching the kernel so the
kernel never reads out-of-bounds.

50 changes: 45 additions & 5 deletions examples/deepseek/ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,21 @@
from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
from modelopt.torch.utils.distributed import ParallelState

sys.path.append(str(Path(__file__).resolve().parent / "DeepSeek-V3/inference"))
import model as deekseep_model
from kernel import act_quant, fp8_gemm, weight_dequant
DS_V3_PATH = Path(__file__).resolve().parent / "DeepSeek-V3/inference"
DS_V3_2_PATH = Path(__file__).resolve().parent / "DeepSeek-V3.2-Exp/inference"

if DS_V3_2_PATH.exists():
sys.path.append(str(DS_V3_2_PATH))
elif DS_V3_PATH.exists():
sys.path.append(str(DS_V3_PATH))
else:
raise ValueError(
f"DeepSeek-V3 or DeepSeek-V3.2-Exp not found in {Path(__file__).resolve().parent}"
)

import model as deekseep_model # noqa: E402
from ds_kernel import weight_dequant # noqa: E402
from kernel import act_quant, fp8_gemm # noqa: E402


def monkey_patch_deepseek_model():
Expand Down Expand Up @@ -186,6 +198,26 @@ def _setup(self):
self.kv_bmm_quantizer = TensorQuantizer()
self.pe_bmm_quantizer = TensorQuantizer()

class CalibMoe(deekseep_model.MoE):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._setup()

def _setup(self):
self._original_topk = self.gate.topk
self._original_topk_groups = self.gate.topk_groups

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Forward all tokens to all experts for calibration
self.gate.topk = self.n_routed_experts
self.gate.topk_groups = self.gate.n_groups
super().forward(x)
# Restore the original topk and topk_groups
self.gate.topk = self._original_topk
self.gate.topk_groups = self._original_topk_groups

return super().forward(x)

Comment on lines +201 to +220
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the double forward pass in CalibMoe.

The forward() method calls super().forward(x) twice on lines 214 and 219:

  1. Line 214: Routes all tokens to all experts (for calibration) but discards the result.
  2. Line 219: Routes with original topk settings and returns this result.

This double invocation means every forward pass during calibration runs inference twice through the MoE layer, which is extremely expensive for large models.

Expected behavior: The forward pass should only route to all experts during calibration phase (when quantizers collect statistics), then switch to normal routing once calibration is complete. The current implementation runs both routing strategies on every call.

Consider this fix:

 class CalibMoe(deekseep_model.MoE):
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         self._setup()

     def _setup(self):
-        self._original_topk = self.gate.topk
-        self._original_topk_groups = self.gate.topk_groups
+        # Route to all experts during calibration
+        self.gate.topk = self.n_routed_experts
+        self.gate.topk_groups = self.gate.n_groups

     def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # Forward all tokens to all experts for calibration
-        self.gate.topk = self.n_routed_experts
-        self.gate.topk_groups = self.gate.n_groups
-        super().forward(x)
-        # Restore the original topk and topk_groups
-        self.gate.topk = self._original_topk
-        self.gate.topk_groups = self._original_topk_groups
-
         return super().forward(x)

This sets the all-expert routing once during setup and keeps it throughout calibration, eliminating the double forward pass.

mtq.register(
original_cls=deekseep_model.RowParallelLinear,
quantized_cls=QuantRowParallelLinear,
Expand All @@ -196,6 +228,7 @@ def _setup(self):
)
mtq.register(original_cls=deekseep_model.Linear, quantized_cls=QuantLinear)
mtq.register(original_cls=deekseep_model.MLA, quantized_cls=QuantMLA)
mtq.register(original_cls=deekseep_model.MoE, quantized_cls=CalibMoe)


def load_deepseek_model(model_config: str, model_path: str, batch_size: int):
Expand Down Expand Up @@ -243,10 +276,10 @@ def ptq(
## create dataset
device = next(model.parameters()).device
calib_dataset = get_dataset_dataloader(
dataset_name="cnn_dailymail",
dataset_name=["cnn_dailymail", "nemotron-post-training-dataset-v2"],
tokenizer=tokenizer,
batch_size=batch_size,
num_samples=calib_size,
num_samples=[calib_size, calib_size],
device=device,
)

Expand Down Expand Up @@ -307,6 +340,13 @@ def state_dict_filter(state_dict):
os.path.join(output_path, f"amax_dict_rank{rank}-mp{world_size}.pt"),
)

# if rank == 0:
# with open("expert_activation_counts.txt", "w") as f:
# for name, module in model.named_modules():
# if isinstance(module, deekseep_model.MoE):
# counts = module.activated_expert_counts()
# f.writelines(f"{name}: {count}\n" for count in counts)

quant_config = get_quant_config(model.named_modules())

if enable_fp8_kvcache:
Expand Down
4 changes: 3 additions & 1 deletion examples/deepseek/quantize_fp8_to_nvfp4.sh
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,9 @@ fi

# Copy miscellaneous files to the quantized checkpoint
mkdir -p $FP4_PATH
cp $FP8_HF_PATH/*.json $FP8_HF_PATH/*.py $FP4_PATH/
cp $FP8_HF_PATH/*.json $FP4_PATH/
cp $FP8_HF_PATH/*.py $FP4_PATH/ || true
cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true

# Run the quantization command
echo "Running quantization..."
Expand Down
8 changes: 2 additions & 6 deletions examples/deepseek/quantize_to_nvfp4.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,15 @@
import glob
import json
import os
import sys
from pathlib import Path
from typing import Any

import torch
from ds_kernel import weight_dequant
from safetensors.torch import load_file, save_file
from tqdm import tqdm

from modelopt.torch.quantization.qtensor import NVFP4QTensor

sys.path.append(str(Path(__file__).resolve().parent / "DeepSeek-V3/inference"))
from kernel import weight_dequant


def _remap_key(key_dict: dict[str, Any]):
# renaming the module to match HF modeling
Expand Down Expand Up @@ -155,7 +151,7 @@ def convert_fp8_ckpt_to_nvfp4(
per_layer_quant_config,
):
def amax_to_nvfp4_scaling_factor_2(amax):
return amax.float() / 6.0 / 448.0
return amax.float() / (6.0 * 448.0)

def amax_to_fp8_scaling_factor(amax):
return amax.float() / 448.0
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ extend-ignore = [
"*/_[a-zA-Z]*" = ["D"] # Private packages (_abc/*.py) or modules (_xyz.py)
"*.ipynb" = ["D", "E501"] # Ignore missing docstrings or line length for Jupyter notebooks
"modelopt/torch/quantization/triton/*" = ["N803", "N806", "E731"] # triton style

"examples/deepseek/ds_kernel.py" = ["N803", "N806", "E731"] # triton style

[tool.ruff.lint.pycodestyle]
max-line-length = 120 # Line length limit for comments and docstrings
Expand Down