[Bug]: gpt-oss-120 tinygemm2_cuda.cu 64 - invalid argument

### System Info

Testing gpt-oss-120b on RTX 6000 PRO 

**Version Information:**
  - Branch: `main`
  - Commit: `2b8722b67`

```
cd TensorRT-LLM/docker
make release_build CUDA_ARCHS="120-real"
```

`docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864  --gpus all --network host  docker.io/tensorrt_llm/release:latest  bash
`

inside docker: 

`CUDA_VISIBLE_DEVICES=0,1  trtllm-serve /mnt/gpt-oss-120b --host 0.0.0.0 --port 4997
`
last lines:

```
[10/07/2025-18:33:41] [TRT-LLM] [I] Finished prefetching /mnt/gpt-oss-120b/model-00002-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 15/15 [00:00<00:00, 765.01it/s]
Loading weights: 100%|██████████| 801/801 [00:15<00:00, 51.07it/s]
Model init total -- 20.72s
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/07/2025-18:33:58] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/07/2025-18:33:58] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 7.2676849365234375 of total memory 23.742691040039062, respectively). The smaller value will be used.
[10/07/2025-18:33:58] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.56 GiB for max tokens in paged KV cache (8224).
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/07/2025-18:33:58] [TRT-LLM] [I] cache_transceiver is disabled
[10/07/2025-18:33:58] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 101016576 bytes
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Cache size after warmup is 28
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/07/2025-18:34:08] [TRT-LLM] [I] Creating CUDA graph instances for 34 batch sizes.
[10/07/2025-18:34:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
GPUassert: invalid argument /src/tensorrt_llm/cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_cuda.cu 64
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
```


### Who can help?

@farazkh80 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Testing gpt-oss-120b on RTX 6000 PRO 

**Version Information:**
  - Branch: `main`
  - Commit: `2b8722b67`

```
cd TensorRT-LLM/docker
make release_build CUDA_ARCHS="120-real"
```

`docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864  --gpus all --network host  docker.io/tensorrt_llm/release:latest  bash
`

inside docker: 

`CUDA_VISIBLE_DEVICES=0,1  trtllm-serve /mnt/gpt-oss-120b --host 0.0.0.0 --port 4997
`
last lines:

```
[10/07/2025-18:33:41] [TRT-LLM] [I] Finished prefetching /mnt/gpt-oss-120b/model-00002-of-00014.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 15/15 [00:00<00:00, 765.01it/s]
Loading weights: 100%|██████████| 801/801 [00:15<00:00, 51.07it/s]
Model init total -- 20.72s
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/07/2025-18:33:58] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/07/2025-18:33:58] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 7.2676849365234375 of total memory 23.742691040039062, respectively). The smaller value will be used.
[10/07/2025-18:33:58] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/07/2025-18:33:58] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.56 GiB for max tokens in paged KV cache (8224).
[10/07/2025-18:33:58] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/07/2025-18:33:58] [TRT-LLM] [I] cache_transceiver is disabled
[10/07/2025-18:33:58] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 101016576 bytes
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Cache size after warmup is 28
[10/07/2025-18:34:08] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/07/2025-18:34:08] [TRT-LLM] [I] Creating CUDA graph instances for 34 batch sizes.
[10/07/2025-18:34:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
GPUassert: invalid argument /src/tensorrt_llm/cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_cuda.cu 64
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
```


### Expected behavior

running 

### actual behavior

error 

### additional notes

testing https://github.com/NVIDIA/TensorRT-LLM/pull/7937 which was merged to the main branch

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: gpt-oss-120 tinygemm2_cuda.cu 64 - invalid argument #8179

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: gpt-oss-120 tinygemm2_cuda.cu 64 - invalid argument #8179

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions