Skip to content

Conversation

zxd1997066
Copy link
Contributor

@zxd1997066 zxd1997066 commented Sep 19, 2025

This PR intends to add more ported distributed cases in torch-xpu-ops CI. And add pytest-xdist for distributed UT

The distributed UT time will increase to 2h20min with 2 work groups
(reference: 3h3m for 1 work group https://github.com/intel/torch-xpu-ops/actions/runs/17902859755/job/50907350984)

disable_e2e
disable_ut

@zxd1997066 zxd1997066 force-pushed the xiangdong/ported_cases branch 2 times, most recently from cb15ece to 1cbc6b9 Compare September 22, 2025 02:34
@zxd1997066 zxd1997066 force-pushed the xiangdong/ported_cases branch 10 times, most recently from c5009f3 to 0d9b54f Compare September 25, 2025 09:35
Copy link
Contributor

@chuanqi129 chuanqi129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the test scope as CI scope and nightly full scope


inputs:
ut_name:
required: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
required: true
required: false

ze = xpu_list[i+1];
} else {
ze = i;
if [ "${{ inputs.ut_name }}" == "xpu_distributed" ];then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any assumptions in here? Can we detect topology directly and dynamically on the test node?
Please consider below scenarios:

  • No Xelink group, return failed
  • 1 Xelink group, launch 1 worker
  • 2 Xelink group, launch 2 workers
  • ...

runner:
runs-on: ${{ inputs.runner }}
name: get-runner
name: get-runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have such change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants