Build CUDA 12.6 instead of 12.1 #115

tlrmchlsmth · 2025-06-27T02:57:55Z

Build a CUDA 12.6 image instead of 12.1, to match what PyTorch ships (see here for instance https://pytorch.org/get-started/locally/)

Running into this problem trying to get the Docker builds green in this PR - vllm-project/vllm#20136
https://buildkite.com/vllm/ci/builds/22761#0197ae37-4295-48c9-b5c9-fc843c84f497/124-10693

Signed-off-by: Tyler Michael Smith <[email protected]>

zou3519 · 2025-06-30T16:14:22Z

Btw, we've run into issues building vLLM locally with CUDA 12.6. There's a issue somewhere with FlashAttention not building well with CUDA 12.6. If you want to pick another version, PyTorch supports 12.8 too (and 12.8 supports Blackwell, but I'm not too sure how that interaction goes).

tlrmchlsmth · 2025-06-30T16:52:17Z

@zou3519 thanks for that info.

The build image defaults to 12.8.1 already, I am just trying to get the rest of the builds green. They are chronically red, since they aren't enabled by default.

If 12.6 is problematic as well, we could simply delete the 12.1 build, and I think it's fair to consider this as an option.

tlrmchlsmth · 2025-07-01T17:08:36Z

Running the following (on vllm-project/vllm#20324):

docker build --file docker/Dockerfile --build-arg max_jobs=16 --build-arg CUDA_VERSION=12.6.0 --target test --progress plain .

I am seeing:

30 8641.4 [170/336] Building CUDA object vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o
#30 8641.4 FAILED: vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o
#30 8641.4 ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_PYBIND -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_vllm_fa3_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -D_vllm_fa3_C_EXPORTS -I/workspace/.deps/vllm-flash-attn-src/csrc -I/workspace/.deps/vllm-flash-attn-src/hopper -I/workspace/.deps/vllm-flash-attn-src/csrc/common -I/workspace/.deps/vllm-flash-attn-src/csrc/cutlass/include -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -g -DNDEBUG -std=c++17 -Xcompiler=-fPIC --expt-relaxed-constexpr -DENABLE_FP8 --threads=8 --expt-extended-lambda --use_fast_math -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -D_GLIBCXX_USE_CXX11_ABI=1 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_80,code=sm_80 -MD -MT vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o -MF vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o.d -x cu -c /workspace/.deps/vllm-flash-attn-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu -o vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o
#30 8641.4 ptxas info    : (C7517) warpgroup.wait is injected in around line 452106 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi128EEENS7_ILi80EEENS7_ILi256EEEEEELi256ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb0ELb1ELb1ELb1ELb0ELb0ELb1ELb1ELb1ELb1ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_SC_SB_EEES9_SE_SG_Li256ELb1ELb1ELb1ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi128ELi256ELi128ELb1ELb1ELb1EEEEEEEEEvNT_6ParamsE'
...
#30 8641.4 ptxas info    : (C7517) warpgroup.wait is injected in around line 128681 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi192EEENS7_ILi128EEENS7_ILi96EEEEEELi96ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb0ELb1ELb1ELb1ELb0ELb0ELb0ELb1ELb1ELb1ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_SC_SB_EEES9_SE_SG_Li384ELb1ELb1ELb1ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi192ELi384ELi128ELb1ELb1ELb1EEEEEEEEEvNT_6ParamsE'
#30 8641.4 ptxas info    : (C7518) Potential Performance Loss: wgmma.mma_async instructions are serialized due to program dependence on compiler-inserted WG.DP in divergent path in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi192EEENS7_ILi128EEENS7_ILi96EEEEEELi96ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb0ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb1ELb1ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_SC_SB_EEES9_SE_SG_Li384ELb1ELb1ELb1ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi192ELi384ELi128ELb1ELb1ELb1EEEEEEEEEvNT_6ParamsE'
#30 8641.4 Segmentation fault (core dumped)

Let's land #117 instead of this one.

Build CUDA 12.6 instead of 12.1

f1fc151

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth mentioned this pull request Jun 30, 2025

Remove CUDA 12.1 build #117

Merged

simon-mo approved these changes Jul 1, 2025

View reviewed changes

tlrmchlsmth closed this Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build CUDA 12.6 instead of 12.1 #115

Build CUDA 12.6 instead of 12.1 #115

Uh oh!

tlrmchlsmth commented Jun 27, 2025

Uh oh!

zou3519 commented Jun 30, 2025

Uh oh!

tlrmchlsmth commented Jun 30, 2025

Uh oh!

tlrmchlsmth commented Jul 1, 2025

Uh oh!

Uh oh!

Build CUDA 12.6 instead of 12.1 #115

Build CUDA 12.6 instead of 12.1 #115

Uh oh!

Conversation

tlrmchlsmth commented Jun 27, 2025

Uh oh!

zou3519 commented Jun 30, 2025

Uh oh!

tlrmchlsmth commented Jun 30, 2025

Uh oh!

tlrmchlsmth commented Jul 1, 2025

Uh oh!

Uh oh!