Skip to content

feat: add heuristics for checkpoint files prefetching. #4765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yuxianq
Copy link
Collaborator

@yuxianq yuxianq commented May 29, 2025

When the memory consumption of prefetching is less than 90% of available CPU memory, we enable prefetching.
This PR also uses local rank instead of global rank to deal with multi-node cases.

@yuxianq yuxianq requested review from hlu1, nvpohanh and dongxuy04 May 29, 2025 11:55
@yuxianq yuxianq requested a review from a team as a code owner May 29, 2025 11:55
@yuxianq
Copy link
Collaborator Author

yuxianq commented May 29, 2025

/bot run

1 similar comment
@yuxianq
Copy link
Collaborator Author

yuxianq commented May 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6992 [ run ] triggered by Bot

Signed-off-by: Yuxian Qiu <[email protected]>
@yuxianq
Copy link
Collaborator Author

yuxianq commented May 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7003 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6992 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7003 [ run ] completed with state ABORTED

@yuxianq
Copy link
Collaborator Author

yuxianq commented May 31, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7123 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7123 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5148 completed with status: 'SUCCESS'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants