[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4 #2541

danielvegamyhre · 2025-07-14T18:46:52Z

Fixes #2477

pytorch-bot · 2025-07-14T18:46:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2541

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit 2530c2d with merge base c57226b ():

CANCELLED JOB - The following job was cancelled. Please retry:

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-07-14T23:04:16Z

fyi @vkuzo

.github/workflows/float8_test.yml

drisspg

this is the main way we are testing our integrations, add back

Also seems that there still exsits tests that can run non not 4 gpus that are worth doing

danielvegamyhre · 2025-07-15T00:15:35Z

Also seems that there still exsits tests that can run non not 4 gpus that are worth doing

Can you clarify what you mean here? We could use linux.aws.h100.8 for tests that require GPUs (e.g., 3d parallel). I have an in progress MoE test that requires 8 GPUs (#2481) but for the float8 distributed tests they require 4 max i believe

drisspg

My comment was just that we now are requesting 4 gpus for all of these tests where previously we only needed 1, I am not sure if this ends up being harder to schedule / add smore delays

danielvegamyhre · 2025-07-16T16:30:39Z

My comment was just that we now are requesting 4 gpus for all of these tests where previously we only needed 1, I am not sure if this ends up being harder to schedule / add smore delays

Hmm I see, perhaps we can see how it goes over the next week or so and if it's slowing us down then we can revisit

vkuzo

how about:

split test_everything.sh into one piece for single GPU and another one for multi GPU
keep the single GPU one in the current target
make a new target for multi GPU

vkuzo · 2025-07-16T16:32:44Z

@danielvegamyhre IMO worth reverting or forward fixing asap, because this PR seems to run the single GPU float8 tests twice, once from the original code and once again from test_everything.sh

danielvegamyhre · 2025-07-16T16:37:02Z

Fix forward: #2561

danielvegamyhre added ci float8 labels Jul 14, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 14, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 14, 2025

danielvegamyhre force-pushed the multi-device-ci branch from 63a6069 to dcb7d63 Compare July 14, 2025 21:30

run test_everything.sh in float8 test CI using linux.aws.h100.4

aca4873

danielvegamyhre force-pushed the multi-device-ci branch from dcb7d63 to aca4873 Compare July 14, 2025 22:59

drisspg reviewed Jul 14, 2025

View reviewed changes

.github/workflows/float8_test.yml Show resolved Hide resolved

drisspg requested changes Jul 14, 2025

View reviewed changes

add back tests

2530c2d

drisspg approved these changes Jul 16, 2025

View reviewed changes

danielvegamyhre merged commit dd6a4f5 into main Jul 16, 2025
17 of 18 checks passed

vkuzo requested changes Jul 16, 2025

View reviewed changes

danielvegamyhre mentioned this pull request Jul 16, 2025

Revert "[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4" #2560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4 #2541

[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4 #2541

Uh oh!

danielvegamyhre commented Jul 14, 2025

Uh oh!

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jul 14, 2025

Uh oh!

Uh oh!

drisspg left a comment

Uh oh!

danielvegamyhre commented Jul 15, 2025 •

edited

Loading

Uh oh!

drisspg left a comment

Uh oh!

danielvegamyhre commented Jul 16, 2025

Uh oh!

Uh oh!

vkuzo left a comment

Uh oh!

vkuzo commented Jul 16, 2025

Uh oh!

danielvegamyhre commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4 #2541

[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4 #2541

Uh oh!

Conversation

danielvegamyhre commented Jul 14, 2025

Uh oh!

pytorch-bot bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2541

❌ 1 Cancelled Job

Uh oh!

danielvegamyhre commented Jul 14, 2025

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jul 16, 2025

Uh oh!

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Jul 16, 2025

Uh oh!

danielvegamyhre commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading

danielvegamyhre commented Jul 15, 2025 •

edited

Loading

danielvegamyhre commented Jul 16, 2025 •

edited

Loading