Update workflows to cuda 12.4 #7000

loadams · 2025-02-04T20:56:29Z

Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1.
Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4).

NVIDIA Blackwell GPU generation has number 10. The SM code and architecture should be `100`, but the current code generates `1.`, because it expects a 2 characters string. This change modifies the logic to consider it as a string that contains a `.`, hence splits the string and uses the array of strings. Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]>

Signed-off-by: Logan Adams <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]>

Signed-off-by: Logan Adams <[email protected]>

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Logan Adams <[email protected]>

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Logan Adams <[email protected]>

Those files have code that gets run when importing them, so in systems that doesn't support triton but have triton installed this causes issues. In general, I think it is better to import triton when it is installed and supported. Signed-off-by: Omar Elayan <[email protected]> Signed-off-by: Logan Adams <[email protected]>

Signed-off-by: Logan Adams <[email protected]>

.github/workflows/nv-ds-chat.yml

- Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1. - Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4). --------- Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: inkcherry <[email protected]> Signed-off-by: Omar Elayan <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Liangliang Ma <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: Omar Elayan <[email protected]> Signed-off-by: gyou2021 <[email protected]>

- Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1. - Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4). --------- Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: inkcherry <[email protected]> Signed-off-by: Omar Elayan <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Liangliang Ma <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: Omar Elayan <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

- Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1. - Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4). --------- Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: inkcherry <[email protected]> Signed-off-by: Omar Elayan <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Liangliang Ma <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: Omar Elayan <[email protected]> Signed-off-by: yisheng <[email protected]>

fabiendupont and others added 8 commits February 7, 2025 14:56

Update workflows that use cuda 12.1 to use runners with 12.4

4bafe96

Signed-off-by: Logan Adams <[email protected]>

Update GH org references (#6998)

4d30eb9

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]>

Update CNAME

fefe45f

Signed-off-by: Logan Adams <[email protected]>

Update CNAME

81032c2

Signed-off-by: Logan Adams <[email protected]>

[XPU] max1100 workflow update for docker and softwares (#7003)

4557ab8

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Logan Adams <[email protected]>

loadams force-pushed the loadams/update-runners-124 branch from 1c88a19 to 0ae553c Compare February 7, 2025 22:57

loadams requested review from tjruwase, tohtana, GuanhuaWang, hwchen2017 and jomayeri as code owners February 7, 2025 22:57

loadams added 2 commits February 7, 2025 14:57

Merge branch 'master' into loadams/update-runners-124

5392d97

Update cuda and torch versions

9087bb3

Signed-off-by: Logan Adams <[email protected]>

loadams changed the title ~~Update workflows that use cuda 12.1 to use runners with 12.4~~ [Test] Update workflows that use cuda 12.1 to use runners with 12.4 Feb 7, 2025

loadams and others added 2 commits February 10, 2025 08:21

Merge branch 'master' into loadams/update-runners-124

567f1a1

Experiment with updated branch in DSE

8f9c466

Signed-off-by: Logan Adams <[email protected]>

loadams mentioned this pull request Feb 10, 2025

Update weights_only due to change in default in torch 2.6+ deepspeedai/DeepSpeedExamples#957

Merged

2 tasks

Merge branch 'master' into loadams/update-runners-124

389241f

tjruwase reviewed Feb 12, 2025

View reviewed changes

.github/workflows/nv-ds-chat.yml Outdated Show resolved Hide resolved

tjruwase approved these changes Feb 12, 2025

View reviewed changes

Update back to master branch now that DSE is updated

9baf074

loadams changed the title ~~[Test] Update workflows that use cuda 12.1 to use runners with 12.4~~ Update workflows that use cuda 12.1 to use runners with 12.4 Feb 12, 2025

Merge branch 'master' into loadams/update-runners-124

bb70ef0

loadams changed the title ~~Update workflows that use cuda 12.1 to use runners with 12.4~~ Update workflows to cuda 12.4 Feb 12, 2025

Merge branch 'master' into loadams/update-runners-124

5fb91aa

loadams merged commit 079de6b into master Feb 12, 2025
12 of 13 checks passed

loadams deleted the loadams/update-runners-124 branch February 12, 2025 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update workflows to cuda 12.4 #7000

Update workflows to cuda 12.4 #7000

loadams commented Feb 4, 2025 •

edited

Loading

Update workflows to cuda 12.4 #7000

Update workflows to cuda 12.4 #7000

Conversation

loadams commented Feb 4, 2025 • edited Loading

loadams commented Feb 4, 2025 •

edited

Loading