Skip to content

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

License

Notifications You must be signed in to change notification settings

joe0731/TensorRT-Model-Optimizer

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Banner image

NVIDIA TensorRT Model Optimizer

Documentation version license

Documentation | Roadmap


NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA NeMo, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM.

Latest News

Previous News

Install

To install stable release packages for Model Optimizer with pip from PyPI:

pip install -U nvidia-modelopt[all]

To install from source in editable mode with all development dependencies or to use the latest features, run:

# Clone the Model Optimizer repository
git clone [email protected]:NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer

pip install -e .[dev]

You can also directly use the TensorRT-LLM docker images (e.g., nvcr.io/nvidia/tensorrt-llm/release:<version>), which have Model Optimizer pre-installed. Make sure to upgrade Model Optimizer to the latest version using pip as described above. Visit our installation guide for more fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.

Techniques

Technique Description Examples Docs
Post Training Quantization Compress model size by 2x-4x, speeding up inference while preserving model quality! [LLMs] [diffusers] [VLMs] [onnx] [windows] [docs]
Quantization Aware Training Refine accuracy even further with a few training steps! [NeMo] [Hugging Face] [docs]
Pruning Reduce your model size and accelerate inference by removing unnecessary weights! [PyTorch] [docs]
Distillation Reduce deployment model size by teaching small models to behave like larger models! [NeMo] [Hugging Face] [docs]
Speculative Decoding Train draft modules to predict extra tokens during inference! [Megatron] [Hugging Face] [docs]
Sparsity Efficiently compress your model by storing only its non-zero parameter values and their locations [PyTorch] [docs]

Pre-Quantized Checkpoints

Resources

Model Support Matrix

Model Type Support Matrix
LLM Quantization View Support Matrix
Diffusers Quantization View Support Matrix
VLM Quantization View Support Matrix
ONNX Quantization View Support Matrix
Windows Quantization View Support Matrix
Quantization Aware Training View Support Matrix
Pruning View Support Matrix
Distillation View Support Matrix
Speculative Decoding View Support Matrix

Contributing

Model Optimizer is now open source! We welcome any feedback, feature requests and PRs. Please read our Contributing guidelines for details on how to contribute to this project.

Top Contributors

Contributors

Happy optimizing!

About

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Other 1.6%