Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H100 support issue #56

Closed
BugFreeee opened this issue Aug 10, 2023 · 4 comments
Closed

H100 support issue #56

BugFreeee opened this issue Aug 10, 2023 · 4 comments

Comments

@BugFreeee
Copy link

When can we expect H100 support? I have tried building environment basing on cuda11.8 and 12.0. There seems to be some issues realted to package discrepency. Any suggestion for now?

@coryMosaicML
Copy link
Collaborator

I just merged this PR that updates the dependencies for running on H100s. You may want to use the mosaicml/pytorch_vision:2.0.1_cu118-python3.10-ubuntu20.04 docker image, and install xformers via

pip install -U ninja
pip install -U git+https://github.com/facebookresearch/xformers

depending on your training config. Let us know if you run into issues!

@BugFreeee
Copy link
Author

I just merged this PR that updates the dependencies for running on H100s. You may want to use the mosaicml/pytorch_vision:2.0.1_cu118-python3.10-ubuntu20.04 docker image, and install xformers via

pip install -U ninja
pip install -U git+https://github.com/facebookresearch/xformers

depending on your training config. Let us know if you run into issues!

Hi, thanks for the quick reply. I use the new docker image as well as installing ninja and xformers via pip as you suggested.
But getting:
FATAL: kernel fmha_cutlassF_f32_aligned_64x64_rf_sm80 is for sm80-sm100, but was built for sm50

Have you tested it on H100? When I followed the old readme, I managed to get it working on DGXA100. So I think the code and dataset are fine on my side. But when switching to DGXH100, that's when the above issue was encounterd.

@BugFreeee
Copy link
Author

Update: seems to be an issue from xformers side. There were some problem with their H100 support. Working on it.

@BugFreeee
Copy link
Author

BugFreeee commented Aug 11, 2023

Solution:
export TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.2;7.5;8.0;8.6;9.0
before pip install -U git+https://github.com/facebookresearch/xformers
Seems xformers doesn't offer 9.0 arch by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants