-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingSomething isn't working
Description
System Info
Testing gpt-oss-120b on RTX 6000 PRO
Version Information:
- Branch:
main
- Commit:
2b8722b67
cd TensorRT-LLM/docker
make release_build CUDA_ARCHS="120-real"
docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host docker.io/tensorrt_llm/release:latest bash
inside docker:
CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /mnt/gpt-oss-120b --host 0.0.0.0 --port 4997
last lines:
[10/07/2025-18:33:41] [TRT-LLM] [I] Finished prefetching /mnt/gpt-oss-120b/model-00002-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 15/15 [00:00<00:00, 765.01it/s]
Loading weights: 100%|██████████| 801/801 [00:15<00:00, 51.07it/s]
Model init total -- 20.72s
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/07/2025-18:33:58] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/07/2025-18:33:58] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 7.2676849365234375 of total memory 23.742691040039062, respectively). The smaller value will be used.
[10/07/2025-18:33:58] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.56 GiB for max tokens in paged KV cache (8224).
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/07/2025-18:33:58] [TRT-LLM] [I] cache_transceiver is disabled
[10/07/2025-18:33:58] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 101016576 bytes
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Cache size after warmup is 28
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/07/2025-18:34:08] [TRT-LLM] [I] Creating CUDA graph instances for 34 batch sizes.
[10/07/2025-18:34:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
GPUassert: invalid argument /src/tensorrt_llm/cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_cuda.cu 64
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Testing gpt-oss-120b on RTX 6000 PRO
Version Information:
- Branch:
main
- Commit:
2b8722b67
cd TensorRT-LLM/docker
make release_build CUDA_ARCHS="120-real"
docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host docker.io/tensorrt_llm/release:latest bash
inside docker:
CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /mnt/gpt-oss-120b --host 0.0.0.0 --port 4997
last lines:
[10/07/2025-18:33:41] [TRT-LLM] [I] Finished prefetching /mnt/gpt-oss-120b/model-00002-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 15/15 [00:00<00:00, 765.01it/s]
Loading weights: 100%|██████████| 801/801 [00:15<00:00, 51.07it/s]
Model init total -- 20.72s
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/07/2025-18:33:58] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/07/2025-18:33:58] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 7.2676849365234375 of total memory 23.742691040039062, respectively). The smaller value will be used.
[10/07/2025-18:33:58] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.56 GiB for max tokens in paged KV cache (8224).
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/07/2025-18:33:58] [TRT-LLM] [I] cache_transceiver is disabled
[10/07/2025-18:33:58] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 101016576 bytes
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Cache size after warmup is 28
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/07/2025-18:34:08] [TRT-LLM] [I] Creating CUDA graph instances for 34 batch sizes.
[10/07/2025-18:34:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
GPUassert: invalid argument /src/tensorrt_llm/cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_cuda.cu 64
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Expected behavior
running
actual behavior
error
additional notes
testing #7937 which was merged to the main branch
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingSomething isn't working