|
1 | 1 | <!-- |
2 | | -# Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 2 | +# Copyright 2020-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
3 | 3 | # |
4 | 4 | # Redistribution and use in source and binary forms, with or without |
5 | 5 | # modification, are permitted provided that the following conditions |
@@ -81,8 +81,8 @@ Currently, Triton requires that a specially patched version of |
81 | 81 | PyTorch be used with the PyTorch backend. The full source for |
82 | 82 | these PyTorch versions are available as Docker images from |
83 | 83 | [NGC](https://ngc.nvidia.com). For example, the PyTorch version |
84 | | -compatible with the 22.12 release of Triton is available as |
85 | | -nvcr.io/nvidia/pytorch:22.12-py3. |
| 84 | +compatible with the 25.09 release of Triton is available as |
| 85 | +nvcr.io/nvidia/pytorch:25.09-py3. |
86 | 86 |
|
87 | 87 | Copy over the LibTorch and Torchvision headers and libraries from the |
88 | 88 | [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) |
@@ -246,6 +246,79 @@ complex execution modes and dynamic shapes. If not specified, all are enabled by |
246 | 246 |
|
247 | 247 | `ENABLE_JIT_PROFILING` |
248 | 248 |
|
| 249 | +### PyTorch 2.0 Models |
| 250 | + |
| 251 | +The model repository should look like: |
| 252 | + |
| 253 | +```bash |
| 254 | +model_repository/ |
| 255 | +`-- model_directory |
| 256 | + |-- 1 |
| 257 | + | |-- model.py |
| 258 | + | `-- [model.pt] |
| 259 | + `-- config.pbtxt |
| 260 | +``` |
| 261 | + |
| 262 | +The `model.py` contains the class definition of the PyTorch model. |
| 263 | +The class should extend the |
| 264 | +[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). |
| 265 | +The `model.pt` may be optionally provided which contains the saved |
| 266 | +[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference) |
| 267 | +of the model. |
| 268 | + |
| 269 | +### TorchScript Models |
| 270 | + |
| 271 | +The model repository should look like: |
| 272 | + |
| 273 | +```bash |
| 274 | +model_repository/ |
| 275 | +`-- model_directory |
| 276 | + |-- 1 |
| 277 | + | `-- model.pt |
| 278 | + `-- config.pbtxt |
| 279 | +``` |
| 280 | + |
| 281 | +The `model.pt` is the TorchScript model file. |
| 282 | + |
| 283 | +### Customization |
| 284 | + |
| 285 | +The following PyTorch settings may be customized by setting parameters on the |
| 286 | +`config.pbtxt`. |
| 287 | + |
| 288 | +[`torch.set_num_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_threads.html#torch.set_num_threads) |
| 289 | + |
| 290 | +* Key: `NUM_THREADS` |
| 291 | +* Value: The number of threads used for intra-op parallelism on CPU. |
| 292 | + |
| 293 | +[`torch.set_num_interop_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_interop_threads.html#torch.set_num_interop_threads) |
| 294 | + |
| 295 | +* Key: `NUM_INTEROP_THREADS` |
| 296 | +* Value: The number of threads used for interop parallelism (e.g. in JIT interpreter) on CPU. |
| 297 | + |
| 298 | +[`torch.compile()` parameters](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) |
| 299 | + |
| 300 | +* Key: `TORCH_COMPILE_OPTIONAL_PARAMETERS` |
| 301 | +* Value: Any of following parameter(s) encoded as a JSON object. |
| 302 | + * `fullgraph` (`bool`): Whether it is ok to break model into several subgraphs. |
| 303 | + * `dynamic` (`bool`): Use dynamic shape tracing. |
| 304 | + * `backend` (`str`): The backend to be used. |
| 305 | + * `mode` (`str`): Can be either `"default"`, `"reduce-overhead"`, or `"max-autotune"`. |
| 306 | + * `options` (`dict`): A dictionary of options to pass to the backend. |
| 307 | + * `disable` (`bool`): Turn `torch.compile()` into a no-op for testing. |
| 308 | + |
| 309 | +For example: |
| 310 | + |
| 311 | +```proto |
| 312 | +parameters: { |
| 313 | + key: "NUM_THREADS" |
| 314 | + value: { string_value: "4" } |
| 315 | +} |
| 316 | +parameters: { |
| 317 | + key: "TORCH_COMPILE_OPTIONAL_PARAMETERS" |
| 318 | + value: { string_value: "{\"disable\": true}" } |
| 319 | +} |
| 320 | +``` |
| 321 | + |
249 | 322 | ### Support |
250 | 323 |
|
251 | 324 | #### Model Instance Group Kind |
@@ -306,126 +379,9 @@ instance in the |
306 | 379 | to ensure that the model instance and the tensors used for inference are |
307 | 380 | assigned to the same GPU device as on which the model was traced. |
308 | 381 |
|
309 | | -# PyTorch 2.0 Backend \[Experimental\] |
310 | | - |
311 | | -> [!WARNING] |
312 | | -> *This feature is subject to change and removal.* |
313 | | -
|
314 | | -Starting from 24.01, PyTorch models can be served directly via |
315 | | -[Python runtime](src/model.py). By default, Triton will use the |
316 | | -[LibTorch runtime](#pytorch-libtorch-backend) for PyTorch models. To use Python |
317 | | -runtime, provide the following |
318 | | -[runtime setting](https://github.com/triton-inference-server/backend/blob/main/README.md#backend-shared-library) |
319 | | -in the model configuration: |
320 | | - |
321 | | -``` |
322 | | -runtime: "model.py" |
323 | | -``` |
324 | | - |
325 | | -## Dependencies |
| 382 | +* Python functions optimizable by `torch.compile` may not be served directly in the `model.py` file, they need to be enclosed by a class extending the |
| 383 | + [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). |
326 | 384 |
|
327 | | -### Python backend dependency |
| 385 | +* Model weights cannot be shared across multiple instances on the same GPU device. |
328 | 386 |
|
329 | | -This feature depends on |
330 | | -[Python backend](https://github.com/triton-inference-server/python_backend), |
331 | | -see |
332 | | -[Python-based Backends](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md) |
333 | | -for more details. |
334 | | - |
335 | | -### PyTorch dependency |
336 | | - |
337 | | -This feature will take advantage of the |
338 | | -[`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) |
339 | | -optimization, make sure the |
340 | | -[PyTorch 2.0+ pip package](https://pypi.org/project/torch) is available in the |
341 | | -same Python environment. |
342 | | - |
343 | | -Alternatively, a [Python Execution Environment](#using-custom-python-execution-environments) |
344 | | -with the PyTorch dependency may be used. It can be created with the |
345 | | -[provided script](tools/gen_pb_exec_env.sh). The resulting |
346 | | -`pb_exec_env_model.py.tar.gz` file should be placed at the same |
347 | | -[backend shared library](https://github.com/triton-inference-server/backend/blob/main/README.md#backend-shared-library) |
348 | | -directory as the [Python runtime](src/model.py). |
349 | | - |
350 | | -## Model Layout |
351 | | - |
352 | | -### PyTorch 2.0 models |
353 | | - |
354 | | -The model repository should look like: |
355 | | - |
356 | | -``` |
357 | | -model_repository/ |
358 | | -`-- model_directory |
359 | | - |-- 1 |
360 | | - | |-- model.py |
361 | | - | `-- [model.pt] |
362 | | - `-- config.pbtxt |
363 | | -``` |
364 | | - |
365 | | -The `model.py` contains the class definition of the PyTorch model. The class |
366 | | -should extend the |
367 | | -[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). |
368 | | -The `model.pt` may be optionally provided which contains the saved |
369 | | -[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference) |
370 | | -of the model. |
371 | | - |
372 | | -### TorchScript models |
373 | | - |
374 | | -The model repository should look like: |
375 | | - |
376 | | -``` |
377 | | -model_repository/ |
378 | | -`-- model_directory |
379 | | - |-- 1 |
380 | | - | `-- model.pt |
381 | | - `-- config.pbtxt |
382 | | -``` |
383 | | - |
384 | | -The `model.pt` is the TorchScript model file. |
385 | | - |
386 | | -## Customization |
387 | | - |
388 | | -The following PyTorch settings may be customized by setting parameters on the |
389 | | -`config.pbtxt`. |
390 | | - |
391 | | -[`torch.set_num_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_threads.html#torch.set_num_threads) |
392 | | -- Key: NUM_THREADS |
393 | | -- Value: The number of threads used for intraop parallelism on CPU. |
394 | | - |
395 | | -[`torch.set_num_interop_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_interop_threads.html#torch.set_num_interop_threads) |
396 | | -- Key: NUM_INTEROP_THREADS |
397 | | -- Value: The number of threads used for interop parallelism (e.g. in JIT |
398 | | -interpreter) on CPU. |
399 | | - |
400 | | -[`torch.compile()` parameters](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) |
401 | | -- Key: TORCH_COMPILE_OPTIONAL_PARAMETERS |
402 | | -- Value: Any of following parameter(s) encoded as a JSON object. |
403 | | - - fullgraph (*bool*): Whether it is ok to break model into several subgraphs. |
404 | | - - dynamic (*bool*): Use dynamic shape tracing. |
405 | | - - backend (*str*): The backend to be used. |
406 | | - - mode (*str*): Can be either "default", "reduce-overhead" or "max-autotune". |
407 | | - - options (*dict*): A dictionary of options to pass to the backend. |
408 | | - - disable (*bool*): Turn `torch.compile()` into a no-op for testing. |
409 | | - |
410 | | -For example: |
411 | | -``` |
412 | | -parameters: { |
413 | | - key: "NUM_THREADS" |
414 | | - value: { string_value: "4" } |
415 | | -} |
416 | | -parameters: { |
417 | | - key: "TORCH_COMPILE_OPTIONAL_PARAMETERS" |
418 | | - value: { string_value: "{\"disable\": true}" } |
419 | | -} |
420 | | -``` |
421 | | - |
422 | | -## Limitations |
423 | | - |
424 | | -Following are few known limitations of this feature: |
425 | | -- Python functions optimizable by `torch.compile` may not be served directly in |
426 | | -the `model.py` file, they need to be enclosed by a class extending the |
427 | | -[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). |
428 | | -- Model weights cannot be shared across multiple instances on the same GPU |
429 | | -device. |
430 | | -- When using `KIND_MODEL` as model instance kind, the default device of the |
431 | | -first parameter on the model is used. |
| 387 | +* When using `KIND_MODEL` as model instance kind, the default device of the first parameter on the model is used. |
0 commit comments