Skip to content

Conversation

tlrmchlsmth
Copy link
Contributor

Build a CUDA 12.6 image instead of 12.1, to match what PyTorch ships (see here for instance https://pytorch.org/get-started/locally/)

Running into this problem trying to get the Docker builds green in this PR - vllm-project/vllm#20136
https://buildkite.com/vllm/ci/builds/22761#0197ae37-4295-48c9-b5c9-fc843c84f497/124-10693

Signed-off-by: Tyler Michael Smith <[email protected]>
@zou3519
Copy link

zou3519 commented Jun 30, 2025

Btw, we've run into issues building vLLM locally with CUDA 12.6. There's a issue somewhere with FlashAttention not building well with CUDA 12.6. If you want to pick another version, PyTorch supports 12.8 too (and 12.8 supports Blackwell, but I'm not too sure how that interaction goes).

@tlrmchlsmth
Copy link
Contributor Author

@zou3519 thanks for that info.

The build image defaults to 12.8.1 already, I am just trying to get the rest of the builds green. They are chronically red, since they aren't enabled by default.

If 12.6 is problematic as well, we could simply delete the 12.1 build, and I think it's fair to consider this as an option.

@tlrmchlsmth
Copy link
Contributor Author

Running the following (on vllm-project/vllm#20324):

docker build --file docker/Dockerfile --build-arg max_jobs=16 --build-arg CUDA_VERSION=12.6.0 --target test --progress plain .

I am seeing:

30 8641.4 [170/336] Building CUDA object vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o
#30 8641.4 FAILED: vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o
#30 8641.4 ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_PYBIND -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_vllm_fa3_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -D_vllm_fa3_C_EXPORTS -I/workspace/.deps/vllm-flash-attn-src/csrc -I/workspace/.deps/vllm-flash-attn-src/hopper -I/workspace/.deps/vllm-flash-attn-src/csrc/common -I/workspace/.deps/vllm-flash-attn-src/csrc/cutlass/include -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -g -DNDEBUG -std=c++17 -Xcompiler=-fPIC --expt-relaxed-constexpr -DENABLE_FP8 --threads=8 --expt-extended-lambda --use_fast_math -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -D_GLIBCXX_USE_CXX11_ABI=1 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_80,code=sm_80 -MD -MT vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o -MF vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o.d -x cu -c /workspace/.deps/vllm-flash-attn-src/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu -o vllm-flash-attn/CMakeFiles/_vllm_fa3_C.dir/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_softcap_sm90.cu.o
#30 8641.4 ptxas info    : (C7517) warpgroup.wait is injected in around line 452106 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi128EEENS7_ILi80EEENS7_ILi256EEEEEELi256ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb0ELb1ELb1ELb1ELb0ELb0ELb1ELb1ELb1ELb1ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_SC_SB_EEES9_SE_SG_Li256ELb1ELb1ELb1ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi128ELi256ELi128ELb1ELb1ELb1EEEEEEEEEvNT_6ParamsE'
...
#30 8641.4 ptxas info    : (C7517) warpgroup.wait is injected in around line 128681 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi192EEENS7_ILi128EEENS7_ILi96EEEEEELi96ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb0ELb1ELb1ELb1ELb0ELb0ELb0ELb1ELb1ELb1ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_SC_SB_EEES9_SE_SG_Li384ELb1ELb1ELb1ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi192ELi384ELi128ELb1ELb1ELb1EEEEEEEEEvNT_6ParamsE'
#30 8641.4 ptxas info    : (C7518) Potential Performance Loss: wgmma.mma_async instructions are serialized due to program dependence on compiler-inserted WG.DP in divergent path in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi192EEENS7_ILi128EEENS7_ILi96EEEEEELi96ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb0ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb1ELb1ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_SC_SB_EEES9_SE_SG_Li384ELb1ELb1ELb1ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi192ELi384ELi128ELb1ELb1ELb1EEEEEEEEEvNT_6ParamsE'
#30 8641.4 Segmentation fault (core dumped)

Let's land #117 instead of this one.

@tlrmchlsmth tlrmchlsmth closed this Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants