-
Notifications
You must be signed in to change notification settings - Fork 58
[CI] Add more ported distributed cases #2082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cb15ece
to
1cbc6b9
Compare
c5009f3
to
0d9b54f
Compare
0d9b54f
to
85fa6f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the test scope as CI scope and nightly full scope
|
||
inputs: | ||
ut_name: | ||
required: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required: true | |
required: false |
ze = xpu_list[i+1]; | ||
} else { | ||
ze = i; | ||
if [ "${{ inputs.ut_name }}" == "xpu_distributed" ];then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any assumptions in here? Can we detect topology directly and dynamically on the test node?
Please consider below scenarios:
- No Xelink group, return failed
- 1 Xelink group, launch 1 worker
- 2 Xelink group, launch 2 workers
- ...
runner: | ||
runs-on: ${{ inputs.runner }} | ||
name: get-runner | ||
name: get-runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have such change?
This PR intends to add more ported distributed cases in torch-xpu-ops CI. And add pytest-xdist for distributed UT
The distributed UT time will increase to 2h20min with 2 work groups
(reference: 3h3m for 1 work group https://github.com/intel/torch-xpu-ops/actions/runs/17902859755/job/50907350984)
disable_e2e
disable_ut