-
Notifications
You must be signed in to change notification settings - Fork 315
[Test] Add integration tests to validate support for GB200. #6934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] Add integration tests to validate support for GB200. #6934
Conversation
3029b18 to
2f5e4b8
Compare
57f0db7 to
9818b38
Compare
37f558c to
5f07db2
Compare
tests/integration-tests/tests/gb200/test_gb200/test_gb200/pcluster.config.yaml
Outdated
Show resolved
Hide resolved
9115cd8 to
04a16f6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6934 +/- ##
===========================================
- Coverage 90.21% 90.13% -0.08%
===========================================
Files 181 181
Lines 16213 16396 +183
===========================================
+ Hits 14627 14779 +152
- Misses 1586 1617 +31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
04a16f6 to
ecc0b96
Compare
714a028 to
4fe6ef8
Compare
3107585 to
d71ba84
Compare
d71ba84 to
fc95126
Compare
0b45c81 to
083201e
Compare
| timeout ${IMEX_STOP_TIMEOUT} systemctl stop ${IMEX_SERVICE} | ||
| pkill -9 ${IMEX_SERVICE} | ||
|
|
||
| #TODO Improvement: rotate server port to prevent race condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-Blocking]When do we plan to do this? Next Phase or another iteration in the coming weeks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to address this improvement in a follow up PR once we fully understand the implications.
So far I have not observed any race condition, even re-executing the same integ test on an existing cluster multiple timnes, but it can be because we are using only 2 nodes and not a real cuda application
083201e to
5a9d5e4
Compare
…favor of the cookbook attribute to force IMEX configuration.
5a9d5e4 to
84e4988
Compare
Description of changes
Add integration tests to validate support for GB200. In particular, it verifies the automated configuration of NVIDIA IMEX.
This test creates a cluster with the necessary custom actions to configure NVIDIA IMEX and verifies the following:
IMEX service is healthy and no errors are reported in IMEX's or prolog's logs.
Also, IMEX gets reconfigured when nodes belonging to the same compute resource get replaced
keeping the default values and IMEX is not started.
The test prints in test log the full IMEX status to facilitate troubleshooting.
Important Notes
g4dnto simulate ap6e-gb200instance. This is a reasonable approximation for the test because the focus of the test is on IMEX configuration, which can be executed ong4dnas well.Limitations
Tests
test_gb200References
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.