Add indexCache training support#2541
Open
faresobeid wants to merge 4 commits into
Open
Conversation
Co-authored-by: faresobeid <faresobeid@users.noreply.github.com>
Signed-off-by: faresobeid <111092724+faresobeid@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 563b77f. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Can use with
Note
Medium Risk
Touches core attention forward paths and introduces cross-layer state (cached indices), which could affect correctness/performance if misconfigured or if assumptions about layer scheduling break.
Overview
Enables DSA IndexCache in training by adding
use_index_cache,index_topk_freq, and optionalindex_topk_patternto the trainerModelConfigand propagating them into the loaded HFmodel_config.Updates the custom
glm_moe_dsaimplementation so decoder layers can reuse sparse attention top-k indices across layers: attention/decoder forwards now thread acached_indicestensor through the stack, and a new per-layer skip policy (_index_cache_skip_topk) controls when indices are recomputed vs reused.Reviewed by Cursor Bugbot for commit 563b77f. Bugbot is set up for automated code reviews on this repo. Configure here.