Skip to content

Commit 725e6ca

Browse files
authored
Merge branch 'main' into dev-gagam-force-input-nodes
Signed-off-by: Gal Hubara-Agam <[email protected]>
2 parents d71399d + 4df4091 commit 725e6ca

File tree

119 files changed

+5088
-1313
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

119 files changed

+5088
-1313
lines changed

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
2222
modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
2323
modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
2424
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
25+
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
2526
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
2627
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
2728
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
@@ -50,4 +51,5 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
5051
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
5152
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
5253
/examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
54+
/examples/vllm_serve @NVIDIA/modelopt-examples-llm_ptq-codeowners
5355
/examples/windows @NVIDIA/modelopt-windows-codeowners

.github/workflows/example_tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ jobs:
6969
image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
7070
env:
7171
PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
72+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
7273
steps: &example_steps
7374
- uses: actions/checkout@v4
7475
- uses: nv-gha-runners/setup-proxy-cache@main

.github/workflows/gpu_tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ jobs:
6767
env:
6868
GIT_DEPTH: 1000 # For correct version for tests/gpu/torch/quantization/plugins/test_megatron.py
6969
PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
70+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
7071
steps: &gpu_steps
7172
- uses: actions/checkout@v4
7273
- uses: nv-gha-runners/setup-proxy-cache@main

.gitlab/tests.yml

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,20 +54,12 @@ example-torch:
5454
timeout: 30m
5555
parallel:
5656
matrix:
57-
- EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
57+
- EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
5858
script:
5959
- pip install ".[hf,dev-test]"
6060
- find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
6161
- pytest -s tests/examples/$EXAMPLE
6262

63-
# TODO: Fix llm_qat test hang in GitLab CI
64-
example-failing:
65-
extends: example-torch
66-
allow_failure: true
67-
parallel:
68-
matrix:
69-
- EXAMPLE: [llm_qat]
70-
7163
example-trtllm:
7264
extends: example-torch
7365
timeout: 60m

.vscode/settings.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,7 @@
4242
"evenBetterToml.schema.enabled": false, // disable toml/json schema since we have custom fields
4343
"python.analysis.extraPaths": [
4444
"./tests/" // add tests to python path just like pytest does in pyproject.toml
45-
]
45+
],
46+
"git.alwaysSignOff": true,
47+
"git.enableCommitSigning": true,
4648
}

CHANGELOG.rst

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,25 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4-
0.39 (2025-10-xx)
4+
0.39 (2025-11-xx)
55
^^^^^^^^^^^^^^^^^
66

77
**Deprecations**
88

99
**New Features**
1010

1111
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
12+
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
13+
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
14+
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` if no dataset is specified.
15+
- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
1216
- Add flag ``nodes_to_include`` and ``op_types_to_include`` in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
1317

14-
0.37 (2025-09-xx)
18+
**Documentation**
19+
20+
- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
21+
22+
0.37 (2025-10-08)
1523
^^^^^^^^^^^^^^^^^
1624

1725
**Deprecations**

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
______________________________________________________________________
1717

18-
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
18+
**NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
1919

2020
**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
2121

@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
2626

2727
## Latest News
2828

29+
- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
2930
- [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
3031
- [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
3132
- [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)

docs/source/deployment/1_tensorrt_llm.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22
TensorRT-LLM
33
==========================
44

5+
**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
6+
57
.. note::
68

7-
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
9+
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
810
first before going through this section.
911

1012

13+
1114
ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
1215

1316
This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
144147
Convert to TensorRT-LLM
145148
=======================
146149

147-
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
150+
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.

docs/source/guides/3_pruning.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ Following info will be printed before the pruning process is started:
190190
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
191191
┃ Constraint ┃ min ┃ centroid ┃ max ┃ max/min ratio ┃
192192
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
193-
│ flops │ 274.34M1.28G4.59G │ 16.73 │
193+
│ flops │ 548.68M2.56G9.18G │ 16.73 │
194194
│ params │ 2.70M │ 9.75M │ 25.50M │ 9.43 │
195195
└──────────────┴──────────────┴──────────────┴──────────────┴───────────────┘
196196
@@ -199,7 +199,7 @@ Following info will be printed before the pruning process is started:
199199
┃ ┃ ┃ Satisfiable ┃
200200
┃ Constraint ┃ Upper Bound ┃ Upper Bound ┃
201201
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
202-
│ flops │ 2.75G │ True │
202+
│ flops │ 5.50G │ True │
203203
└──────────────┴──────────────┴──────────────┘
204204
205205

docs/source/guides/7_nas.rst

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -109,8 +109,8 @@ the search space together with your deployment constraints using
109109
110110
import torch
111111
112-
# Looking for a subnet with at most 2 GFLOPs
113-
constraints = {"flops": 2.0e9}
112+
# Looking for a subnet with at most 4 GFLOPs
113+
constraints = {"flops": 4.0e9}
114114
115115
# Measure FLOPs against dummy_input
116116
# Can be provided as a single tensor or tuple of input args to the model.
@@ -129,7 +129,7 @@ Following info will be printed:
129129
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
130130
┃ Constraint ┃ min ┃ centroid ┃ max ┃ max/min ratio ┃
131131
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
132-
│ flops │ 487.92M1.84G4.59G │ 9.40 │
132+
│ flops │ 975.84M3.68G9.18G │ 9.40 │
133133
│ params │ 4.84M │ 12.33M │ 25.50M │ 5.27 │
134134
└──────────────┴──────────────┴──────────────┴──────────────┴───────────────┘
135135
@@ -138,7 +138,7 @@ Following info will be printed:
138138
┃ ┃ ┃ Satisfiable ┃
139139
┃ Constraint ┃ Upper Bound ┃ Upper Bound ┃
140140
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
141-
│ flops │ 2.00G │ True │
141+
│ flops │ 4.00G │ True │
142142
└──────────────┴──────────────┴──────────────┘
143143
144144
Search Space Summary:
@@ -242,8 +242,8 @@ Below is an example of running search on an AutoNAS converted and trained model.
242242
# Specify the sample input including target data shape for FLOPs calculation.
243243
dummy_input = torch.randn(1, 3, 224, 224)
244244
245-
# Looking for a subnet with at most 2 GFLOPs
246-
search_constraints = {"flops": 2.0e9}
245+
# Looking for a subnet with at most 4 GFLOPs
246+
search_constraints = {"flops": 4.0e9}
247247
248248
# search_res (dict) contains state_dict / stats of the searcher
249249
searched_model, search_res = mtn.search(
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
635635
increased training time.
636636
- May provide similar performance to NAS in particular applications, however, usually exhibits
637637
worse performance due to the limited search space and training time.
638+
639+
640+
[Advanced] Adding a new NAS/Prune Algorithm
641+
===========================================
642+
643+
* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_
644+
for adding a new NAS algorithm.
645+
* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
646+
for an actual example of adding Minitron Pruning algorithm.

0 commit comments

Comments
 (0)