Skip to content

[Bug]: gpt-oss-120 tinygemm2_cuda.cu 64 - invalid argument #8179

@voipmonitor

Description

@voipmonitor

System Info

Testing gpt-oss-120b on RTX 6000 PRO

Version Information:

  • Branch: main
  • Commit: 2b8722b67
cd TensorRT-LLM/docker
make release_build CUDA_ARCHS="120-real"

docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host docker.io/tensorrt_llm/release:latest bash

inside docker:

CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /mnt/gpt-oss-120b --host 0.0.0.0 --port 4997
last lines:

[10/07/2025-18:33:41] [TRT-LLM] [I] Finished prefetching /mnt/gpt-oss-120b/model-00002-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 15/15 [00:00<00:00, 765.01it/s]
Loading weights: 100%|██████████| 801/801 [00:15<00:00, 51.07it/s]
Model init total -- 20.72s
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/07/2025-18:33:58] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/07/2025-18:33:58] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 7.2676849365234375 of total memory 23.742691040039062, respectively). The smaller value will be used.
[10/07/2025-18:33:58] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.56 GiB for max tokens in paged KV cache (8224).
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/07/2025-18:33:58] [TRT-LLM] [I] cache_transceiver is disabled
[10/07/2025-18:33:58] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 101016576 bytes
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Cache size after warmup is 28
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/07/2025-18:34:08] [TRT-LLM] [I] Creating CUDA graph instances for 34 batch sizes.
[10/07/2025-18:34:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
GPUassert: invalid argument /src/tensorrt_llm/cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_cuda.cu 64
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Who can help?

@farazkh80

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Testing gpt-oss-120b on RTX 6000 PRO

Version Information:

  • Branch: main
  • Commit: 2b8722b67
cd TensorRT-LLM/docker
make release_build CUDA_ARCHS="120-real"

docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host docker.io/tensorrt_llm/release:latest bash

inside docker:

CUDA_VISIBLE_DEVICES=0,1 trtllm-serve /mnt/gpt-oss-120b --host 0.0.0.0 --port 4997
last lines:

[10/07/2025-18:33:41] [TRT-LLM] [I] Finished prefetching /mnt/gpt-oss-120b/model-00002-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 15/15 [00:00<00:00, 765.01it/s]
Loading weights: 100%|██████████| 801/801 [00:15<00:00, 51.07it/s]
Model init total -- 20.72s
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/07/2025-18:33:58] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/07/2025-18:33:58] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 7.2676849365234375 of total memory 23.742691040039062, respectively). The smaller value will be used.
[10/07/2025-18:33:58] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.56 GiB for max tokens in paged KV cache (8224).
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/07/2025-18:33:58] [TRT-LLM] [I] cache_transceiver is disabled
[10/07/2025-18:33:58] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 101016576 bytes
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Cache size after warmup is 28
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/07/2025-18:34:08] [TRT-LLM] [I] Creating CUDA graph instances for 34 batch sizes.
[10/07/2025-18:34:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
GPUassert: invalid argument /src/tensorrt_llm/cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_cuda.cu 64
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Expected behavior

running

actual behavior

error

additional notes

testing #7937 which was merged to the main branch

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions