tests: [TRTQA-2905] improve timeout report for qa test cases #4753

crazydemo · 2025-05-29T04:38:26Z

Description

This PR provides the following improvement:

set timeout for mpi run, to make sure mpi run will exit normally when the case exceeds the time limit.
set addition time limits for long-running cases.
check the cpu load when build flash-attn, use an dynamic optimal strategy to set the max_jobs rather than set max_jobs = 4.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

crazydemo · 2025-05-29T08:43:21Z

/bot run

tensorrt-cicd · 2025-05-29T08:48:54Z

PR_Github #6921 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-29T09:52:28Z

PR_Github #6921 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5023 completed with status: 'FAILURE'

Signed-off-by: Ivy Zhang <[email protected]>

crazydemo force-pushed the timeout branch 7 times, most recently from 12aba65 to 4670620 Compare May 29, 2025 07:13

crazydemo changed the title ~~tests: [TRTQA-2905] get pytest timeout param in call func~~ tests: [TRTQA-2905] improve timeout report for qa test cases May 29, 2025

crazydemo marked this pull request as ready for review May 29, 2025 08:42

crazydemo requested review from EmmaQiaoCh, kaiyux, LarryXFly, StanleySun639 and xinhe-nv May 29, 2025 08:42

crazydemo added 4 commits May 30, 2025 15:53

add timeout threshold for long-running cases

91690fe

Signed-off-by: Ivy Zhang <[email protected]>

get pytest timeout on call func

e0739cd

Signed-off-by: Ivy Zhang <[email protected]>

adjust max jobs for compile flash-attn

d9717cb

Signed-off-by: Ivy Zhang <[email protected]>

adjust max jobs for compile flash-attn

9650517

Signed-off-by: Ivy Zhang <[email protected]>

crazydemo force-pushed the timeout branch from 4670620 to 9650517 Compare May 30, 2025 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests: [TRTQA-2905] improve timeout report for qa test cases #4753

tests: [TRTQA-2905] improve timeout report for qa test cases #4753

crazydemo commented May 29, 2025 •

edited

Loading

Uh oh!

crazydemo commented May 29, 2025

Uh oh!

tensorrt-cicd commented May 29, 2025

Uh oh!

tensorrt-cicd commented May 29, 2025

Uh oh!

Uh oh!

tests: [TRTQA-2905] improve timeout report for qa test cases #4753

Are you sure you want to change the base?

tests: [TRTQA-2905] improve timeout report for qa test cases #4753

Conversation

crazydemo commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

crazydemo commented May 29, 2025

Uh oh!

tensorrt-cicd commented May 29, 2025

Uh oh!

tensorrt-cicd commented May 29, 2025

Uh oh!

Uh oh!

crazydemo commented May 29, 2025 •

edited

Loading