NVIDIA
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.github/workflows/example_tests.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/example_tests.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/workflows/gpu_tests.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/gpu_tests.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitlab/tests.yml‎
Lines changed: 1 addition & 9 deletions b/‎.gitlab/tests.yml‎
Lines changed: 1 addition & 9 deletions
diff --git a/‎.vscode/settings.json‎
Lines changed: 3 additions & 1 deletion b/‎.vscode/settings.json‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎CHANGELOG.rst‎
Lines changed: 10 additions & 2 deletions b/‎CHANGELOG.rst‎
Lines changed: 10 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 1 deletion b/‎README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/source/deployment/1_tensorrt_llm.rst‎
Lines changed: 5 additions & 2 deletions b/‎docs/source/deployment/1_tensorrt_llm.rst‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/source/guides/3_pruning.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/guides/3_pruning.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/guides/7_nas.rst‎
Lines changed: 15 additions & 6 deletions b/‎docs/source/guides/7_nas.rst‎
Lines changed: 15 additions & 6 deletions
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
 modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
 modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
+modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
 modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
 modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
@@ -50,4 +51,5 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
 /examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
 /examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
+/examples/vllm_serve @NVIDIA/modelopt-examples-llm_ptq-codeowners
 /examples/windows @NVIDIA/modelopt-windows-codeowners
@@ -69,6 +69,7 @@ jobs:
       image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2
       env:
         PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
     steps: &example_steps
       - uses: actions/checkout@v4
       - uses: nv-gha-runners/setup-proxy-cache@main
 
@@ -67,6 +67,7 @@ jobs:
       env:
         GIT_DEPTH: 1000 # For correct version for tests/gpu/torch/quantization/plugins/test_megatron.py
         PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
     steps: &gpu_steps
       - uses: actions/checkout@v4
       - uses: nv-gha-runners/setup-proxy-cache@main
 
@@ -54,20 +54,12 @@ example-torch:
   timeout: 30m
   parallel:
     matrix:
-      - EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
+      - EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
   script:
     - pip install ".[hf,dev-test]"
     - find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
     - pytest -s tests/examples/$EXAMPLE
 
-# TODO: Fix llm_qat test hang in GitLab CI
-example-failing:
-  extends: example-torch
-  allow_failure: true
-  parallel:
-    matrix:
-      - EXAMPLE: [llm_qat]
-
 example-trtllm:
   extends: example-torch
   timeout: 60m
 
@@ -42,5 +42,7 @@
     "evenBetterToml.schema.enabled": false, // disable toml/json schema since we have custom fields
     "python.analysis.extraPaths": [
         "./tests/" // add tests to python path just like pytest does in pyproject.toml
-    ]
+    ],
+    "git.alwaysSignOff": true,
+    "git.enableCommitSigning": true,
 }
@@ -1,17 +1,25 @@
 Model Optimizer Changelog (Linux)
 =================================
 
-0.39 (2025-10-xx)
+0.39 (2025-11-xx)
 ^^^^^^^^^^^^^^^^^
 
 **Deprecations**
 
 **New Features**
 
 - Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
+- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
+- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
+- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` if no dataset is specified.
+- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
 - Add flag ``nodes_to_include`` and ``op_types_to_include`` in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
 
-0.37 (2025-09-xx)
+**Documentation**
+
+- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
+
+0.37 (2025-10-08)
 ^^^^^^^^^^^^^^^^^
 
 **Deprecations**
 
@@ -15,7 +15,7 @@
 
 ______________________________________________________________________
 
-The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
+**NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
 
 **[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
 
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
 
 ## Latest News
 
+- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
 - [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
 - [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
 - [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
 
@@ -2,12 +2,15 @@
 TensorRT-LLM
 ==========================
 
+**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
+
 .. note::
 
-    Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
+    Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
     first before going through this section.
 
 
+
 ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
 
 This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
 Convert to TensorRT-LLM
 =======================
 
-Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
+Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
@@ -190,7 +190,7 @@ Following info will be printed before the pruning process is started:
     ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
     ┃ Constraint   ┃ min          ┃ centroid     ┃ max          ┃ max/min ratio ┃
     ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
-    │ flops        │ 274.34M      │ 1.28G        │ 4.59G        │ 16.73         │
+    │ flops        │ 548.68M      │ 2.56G        │ 9.18G        │ 16.73         │
     │ params       │ 2.70M        │ 9.75M        │ 25.50M       │ 9.43          │
     └──────────────┴──────────────┴──────────────┴──────────────┴───────────────┘
 
@@ -199,7 +199,7 @@ Following info will be printed before the pruning process is started:
     ┃              ┃              ┃ Satisfiable  ┃
     ┃ Constraint   ┃ Upper Bound  ┃ Upper Bound  ┃
     ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-    │ flops        │ 2.75G        │ True         │
+    │ flops        │ 5.50G        │ True         │
     └──────────────┴──────────────┴──────────────┘
 
 
 
@@ -109,8 +109,8 @@ the search space together with your deployment constraints using
 
     import torch
 
-    # Looking for a subnet with at most 2 GFLOPs
-    constraints = {"flops": 2.0e9}
+    # Looking for a subnet with at most 4 GFLOPs
+    constraints = {"flops": 4.0e9}
 
     # Measure FLOPs against dummy_input
     # Can be provided as a single tensor or tuple of input args to the model.
@@ -129,7 +129,7 @@ Following info will be printed:
     ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
     ┃ Constraint   ┃ min          ┃ centroid     ┃ max          ┃ max/min ratio ┃
     ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
-    │ flops        │ 487.92M      │ 1.84G        │ 4.59G        │ 9.40          │
+    │ flops        │ 975.84M      │ 3.68G        │ 9.18G        │ 9.40          │
     │ params       │ 4.84M        │ 12.33M       │ 25.50M       │ 5.27          │
     └──────────────┴──────────────┴──────────────┴──────────────┴───────────────┘
 
@@ -138,7 +138,7 @@ Following info will be printed:
     ┃              ┃              ┃ Satisfiable  ┃
     ┃ Constraint   ┃ Upper Bound  ┃ Upper Bound  ┃
     ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-    │ flops        │ 2.00G        │ True         │
+    │ flops        │ 4.00G        │ True         │
     └──────────────┴──────────────┴──────────────┘
 
     Search Space Summary:
@@ -242,8 +242,8 @@ Below is an example of running search on an AutoNAS converted and trained model.
     # Specify the sample input including target data shape for FLOPs calculation.
     dummy_input = torch.randn(1, 3, 224, 224)
 
-    # Looking for a subnet with at most 2 GFLOPs
-    search_constraints = {"flops": 2.0e9}
+    # Looking for a subnet with at most 4 GFLOPs
+    search_constraints = {"flops": 4.0e9}
 
     # search_res (dict) contains state_dict / stats of the searcher
     searched_model, search_res = mtn.search(
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
         increased training time.
       - May provide similar performance to NAS in particular applications, however, usually exhibits
         worse performance due to the limited search space and training time.
+
+
+[Advanced] Adding a new NAS/Prune Algorithm
+===========================================
+
+* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_ 
+  for adding a new NAS algorithm.
+* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
+  for an actual example of adding Minitron Pruning algorithm.
Original file line number	Diff line number	Diff line change
`@@ -42,5 +42,7 @@`
`42`	`42`	`"evenBetterToml.schema.enabled": false, // disable toml/json schema since we have custom fields`
`43`	`43`	`"python.analysis.extraPaths": [`
`44`	`44`	`"./tests/" // add tests to python path just like pytest does in pyproject.toml`
`45`		`- ]`
	`45`	`+ ],`
	`46`	`+ "git.alwaysSignOff": true,`
	`47`	`+ "git.enableCommitSigning": true,`
`46`	`48`	`}`