Skip to content

[BE] [float8] Run test_everything.sh in float8 test CI using linux.aws.h100.4 #2541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 16, 2025

Conversation

danielvegamyhre
Copy link
Contributor

Fixes #2477

Copy link

pytorch-bot bot commented Jul 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2541

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit 2530c2d with merge base c57226b (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 14, 2025
@danielvegamyhre danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 14, 2025
@danielvegamyhre
Copy link
Contributor Author

fyi @vkuzo

Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main way we are testing our integrations, add back

Also seems that there still exsits tests that can run non not 4 gpus that are worth doing

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Jul 15, 2025

Also seems that there still exsits tests that can run non not 4 gpus that are worth doing

Can you clarify what you mean here? We could use linux.aws.h100.8 for tests that require GPUs (e.g., 3d parallel). I have an in progress MoE test that requires 8 GPUs (#2481) but for the float8 distributed tests they require 4 max i believe

Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was just that we now are requesting 4 gpus for all of these tests where previously we only needed 1, I am not sure if this ends up being harder to schedule / add smore delays

@danielvegamyhre
Copy link
Contributor Author

My comment was just that we now are requesting 4 gpus for all of these tests where previously we only needed 1, I am not sure if this ends up being harder to schedule / add smore delays

Hmm I see, perhaps we can see how it goes over the next week or so and if it's slowing us down then we can revisit

@danielvegamyhre danielvegamyhre merged commit dd6a4f5 into main Jul 16, 2025
17 of 18 checks passed
Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

  1. split test_everything.sh into one piece for single GPU and another one for multi GPU
  2. keep the single GPU one in the current target
  3. make a new target for multi GPU

@vkuzo
Copy link
Contributor

vkuzo commented Jul 16, 2025

@danielvegamyhre IMO worth reverting or forward fixing asap, because this PR seems to run the single GPU float8 tests twice, once from the original code and once again from test_everything.sh

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Jul 16, 2025

Fix forward: #2561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. float8 topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support running multi-device tests in CI
4 participants