Skip to content

Ideal 2-Taken & 2-Fetch#736

Closed
Yakkhini wants to merge 9 commits intoxs-devfrom
2-taken-ideal
Closed

Ideal 2-Taken & 2-Fetch#736
Yakkhini wants to merge 9 commits intoxs-devfrom
2-taken-ideal

Conversation

@Yakkhini
Copy link
Copy Markdown
Collaborator

@Yakkhini Yakkhini commented Jan 26, 2026

Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0

Summary by CodeRabbit

  • New Features

    • Optional two-fetch mode: perform up to two predictions per cycle when enabled.
    • New configuration: toggle two-fetch and set max fetch bytes per cycle.
  • Refactor

    • Fetch and prediction flow reworked to support in-cycle two-fetch extension and updated fetch-loop/termination semantics.
  • Chores

    • Example presets updated to enable two-fetch and reduce queue sizes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

DecoupledBPUWithBTB gains a two-fetch mode and max-fetch-bytes config; tick() can produce up to two predictions per cycle. Fetch path now supports an in-cycle 2‑fetch extension, preserves/keeps the next FSQ entry buffered, and changes the fetch-stop semantics.

Changes

Cohort / File(s) Summary
Prediction core
src/cpu/pred/btb/decoupled_bpred.cc
Adds batching in tick() to run up to 2 prediction iterations per cycle (controlled by new flags); moves per-iteration request/finalize/clear/dry-run/FSQ-enqueue logic into loop; introduces tempNumOverrideBubbles.
Predictor interface / flags
src/cpu/pred/btb/decoupled_bpred.hh, src/cpu/pred/BranchPredictor.py
Adds configuration members enable2Fetch and maxFetchBytesPerCycle, enableTwoTaken flag, FTQ navigation/accessor helpers (getTarget, ftqHasNext, ftqNext, is2FetchEnabled, getMaxFetchBytesPerCycle).
Fetch logic
src/cpu/o3/fetch.cc, src/cpu/o3/fetch.hh
Adds 2‑fetch extension path: conditionally perform/do_2fetch, keep next FSQ entry buffered, change lookupAndUpdateNextPC return to reflect stop-this-cycle semantics, force I-cache reissue on buffer-edge PCs, and propagate stopFetchThisCycle.
Configs
configs/example/kmhv3.py, configs/example/idealkmhv3.py
Enable cpu.branchPred.enable2Fetch = True for DecoupledBPUWithBTB and reduce FTQ/FSQ sizes from 256 → 64 in example configs.

Sequence Diagram(s)

sequenceDiagram
    participant Core as DecoupledBPUWithBTB
    participant Predictor as BTB/TAGE
    participant FTQ as FTQ
    participant FSQ as FSQ
    participant Fetch as FetchUnit

    Note over Core: Up to N = (enable2Fetch ? 2 : 1) iterations per tick
    loop per-prediction
        Core->>Predictor: requestPrediction()
        Predictor-->>Core: provisionalPrediction
        Core->>Predictor: generateFinalPrediction()
        Predictor-->>Core: finalPrediction (+overrideBubbles)
        Core->>FTQ: ftqNext / getTarget()
        Core->>FSQ: enqueue(finalPrediction)
        alt do 2-fetch
            Core->>Fetch: signal do_2fetch (keep next FSQ entry buffered)
            Fetch-->>Core: continue fetch without consuming FTQ entry
        else single fetch
            Core->>Fetch: consume FTQ target / mark used
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • jensen-yan
  • tastynoob
  • CJ362ff

Poem

🐰 I hop in pairs where branches bend,

I stash the next and gently send,
Two little peeks in one swift beat,
Buffer tucked and pipeline neat,
— coderabbit 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: implementation of 2-taken and 2-fetch capabilities in the branch predictor and fetch logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 2-taken-ideal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Line 140: The local variable tempNumOverrideBubbles is declared but never
used; remove the unused declaration of tempNumOverrideBubbles from the scope
where it's defined (the function containing the line "unsigned
tempNumOverrideBubbles = 0;") to clean up dead code and avoid compiler warnings,
ensuring no other references to tempNumOverrideBubbles remain in the function.
🧹 Nitpick comments (2)
src/cpu/pred/btb/decoupled_bpred.hh (1)

162-162: Consider making enableTwoTaken configurable via params.

This feature flag is hardcoded to true with no way to disable it through simulation parameters. Other similar options like fetchStreamQueueSize, predictWidth, and resolveBlockThreshold are initialized from the Params object in the constructor. For flexibility during experimentation and for consistency with the existing pattern, consider adding this to the DecoupledBPUWithBTBParams.

src/cpu/pred/btb/decoupled_bpred.cc (1)

142-184: Add documentation clarifying the multi-prediction loop behavior.

The loop logic for producing up to 2 predictions per tick is non-trivial. Consider adding a comment explaining:

  1. The intended behavior when both predictions succeed (no bubbles)
  2. What happens when the first prediction generates override bubbles (second iteration essentially becomes a no-op)
  3. The interaction between numOverrideBubbles being set inside the loop but decremented only once outside

This will help future maintainers understand the "ideal 2-taken" semantics.

📝 Suggested documentation
     int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
-    unsigned tempNumOverrideBubbles = 0;

+    // "Ideal 2-taken" mode: attempt up to 2 predictions per tick.
+    // - If the first prediction generates override bubbles, the second iteration
+    //   will be blocked by validateFSQEnqueue() until bubbles are consumed.
+    // - If no bubbles, both predictions can be enqueued in a single tick.
+    // - Bubble decrement happens once per tick after the loop.
     while (predsRemainsToBeMade > 0) {

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.1706 📉 -0.0021 (-0.10%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added the perf label Jan 27, 2026
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: 0d664b4
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.64 20.27 +1.86 🟢

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.2052 📈 +0.0325 (+1.50%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added perf and removed perf labels Jan 29, 2026
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: ebf7386
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.64 20.27 +1.86 🟢

@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.2052 📈 +0.0325 (+1.50%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: 9406572
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.65 20.27 +1.92 🟢

@Yakkhini Yakkhini added perf and removed perf labels Jan 30, 2026
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: spec06-0.8c

@Yakkhini
Copy link
Copy Markdown
Collaborator Author

Yakkhini commented Mar 5, 2026

2026-03-05 Co-Analysis (New vs Ref)

1) Goal

This report re-analyzes the latest 2026-03-05 results with reference to answer:

  1. Has (near-perfect) 2Taken/2Fetch already removed frontend bandwidth bound?
  2. Has bottleneck fully shifted to branch misprediction and backend?
  3. Is there still practical performance headroom?

Also included as supplemental findings:

  • TopDown dominant-category migration.
  • Structural queue-pressure signals.
  • Workload-priority suggestions.

2) Data Sources

  • New: out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/spec_all/perf-weighted.csv
  • Ref: out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/ref/spec_all/perf-weighted-ref.csv

Only rows with valid numeric cycles are used (12 SPECint workloads).

3) Executive Summary

  • IPC geomean (new/ref) is 1.0279x (about +2.83% average IPC delta).
  • Frontend bottleneck is strongly reduced:
    • frontendBound: -48.53% avg
    • frontendBandwidthBound: -42.82% avg
    • frontendLatencyBound: -63.20% avg
  • But frontend bandwidth is not zero in absolute terms:
    • New absolute mean frontendBandwidthBound = 0.0490
    • New median frontendBandwidthBound = 0.0413
  • Bottleneck does shift toward speculation/backend/retiring mix:
    • New dominant categories: baseRetiring 7, badSpecBound 3, backendBound 2, frontendBound 0.
  • There is still headroom:
    • Rough upper-bound from removing only FE bandwidth term in new data: average potential about +5.3% IPC.

4) Core Question Answers

Q1. Is frontend bandwidth bound already gone?

No. It is much smaller, but not eliminated.

Evidence:

  • Relative to ref, frontendBandwidthBound improves for all 12 workloads.
  • Absolute new mean remains 0.0490.
  • Workloads with noticeable residual FE bandwidth in new data:
    • libquantum: frontendBandwidthBound=0.1120
    • astar: 0.0896
    • perlbench: 0.0869
    • sjeng: 0.0793

Q2. Has bottleneck fully shifted to branch misprediction + backend?

Partially, but not fully.

TopDown dominant category counts:

  • Ref: baseRetiring 8, badSpecBound 2, backendBound 2, frontendBound 0
  • New: baseRetiring 7, badSpecBound 3, backendBound 2, frontendBound 0

Interpretation:

  • FE is clearly no longer dominant.
  • More pressure appears in badSpecBound/backendBound on some workloads.
  • But many workloads are still retire-dominant; not all moved to bad-spec/backend.

Q3. Is there still improvement ceiling?

Yes.

Using a rough bound for new data: IPC_if_no_FE_bw ~= IPC / (1 - frontendBandwidthBound):

  • Average estimated additional room: ~5.3% IPC.
  • Higher residual FE-bandwidth opportunities:
    • libquantum: ~+12.6%
    • astar: ~+9.8%
    • perlbench: ~+9.5%
    • sjeng: ~+8.6%

This is optimistic and non-orthogonal, but confirms FE bandwidth is not yet fully exhausted.

5) What Changed (New vs Ref)

5.1 Throughput and frontend supply

  • ipc: +2.83% avg (+1.64% median), 11 improved / 1 regressed (omnetpp slight).
  • fetch_nisn_mean: +10.15% avg.
  • ftqNotValid: -65.79% avg (all 12 improved).

These strongly indicate 2Taken/2Fetch improves frontend supply and feed continuity.

5.2 Trade-offs and side-effects

  • badSpecBound: +8.18% avg.
  • branchMissPrediction: +8.17% avg.
  • backendBound: +18.09% avg.
  • controlSquashFromDecode: +5.69% avg.
  • controlSquashFromCommit: +3.50% avg.
  • overrideCount: +27.23% avg; overrideBubbleNum: +27.11% avg.

Interpretation:

  • Frontend gain is real, but more aggressive/denser fetch stream also increases correction/override activity.
  • Once FE pressure drops, bad-spec and backend limits become more visible.

5.3 Queue-pressure signals (important)

  • fsqFullCannotEnq: +305.63% avg.
  • resolveQueueFull: +340.47% avg (with outlier-heavy distribution).

Normalized examples (new, per 1k committed insts):

  • fsqFullCannotEnq/kInst:
    • mcf ~939.5
    • omnetpp ~752.9
    • gcc ~341.7
    • libquantum ~254.5
  • resolveQueueFull/kInst:
    • perlbench ~11.1
    • xalancbmk ~7.1
    • gcc ~6.6

This suggests structural queue pressure remains a likely next limiter.

6) Workload-Level Patterns

High IPC gain and strong FE reduction:

  • sjeng (+8.53%), perlbench (+6.52%), gcc (+6.38%), xalancbmk (+5.06%), gobmk (+3.55%).

Low/near-flat gains:

  • hmmer (+0.05%), libquantum (+0.11%), mcf (+0.30%), h264ref (+0.43%).

Slight regression:

  • omnetpp (-0.21%), where backend/memory pressure is already very strong.

Correlation hint:

  • IPC gain vs FE-bound reduction shows moderate positive trend (corr ~= 0.48), i.e., FE relief explains a meaningful part of gains.

7) Practical Conclusions

  1. 2Taken/2Fetch works: frontend bound is significantly reduced and IPC increases overall.
  2. Frontend bandwidth bound is not gone: absolute residual FE bandwidth is still visible, with benchmark-specific tails.
  3. Bottleneck migration is real but mixed: more bad-spec/backend pressure appears, but many workloads remain retire-dominant.
  4. Still has upside: FE residual + speculation quality + queue-structure tuning can still deliver gains.

8) Recommended Next Tuning Priorities

  1. Speculation-quality path (astar, gobmk, sjeng): reduce wrong-path/override penalties.
  2. Backend/memory path (omnetpp, mcf, gcc): memory/core backend constraints dominate.
  3. Queue-structure path: target fsqFullCannotEnq and resolveQueueFull hot workloads.
  4. Residual FE-bandwidth path (libquantum, astar, perlbench, sjeng): still has measurable headroom.

9) Notes

  • This report uses weighted aggregate CSVs; empty SPECfp rows are excluded.
  • Headroom estimates are intentionally rough and should be interpreted as directional bounds.
  • Some TopDown fields can be noisy (including occasional negative components); trend-level interpretation is recommended.

Yakkhini and others added 4 commits March 17, 2026 18:05
Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0
Co-authored-by: Xu Boran <xuboran@bosc.ac.cn>
Change-Id: Ic203a9694c093034744986309e796b9d66d6f826
Change-Id: I3f0f686000b610c3bf842e62c9b9e91e7188a028
Change-Id: Id8896d4c28af5c11e71e3ef453068cd84231e468
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2665 -
This PR 2.2970 📈 +0.0304 (+1.34%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added perf and removed perf labels Mar 19, 2026
Change-Id: Ib34d2a7b889b5e06fed64bb700f0a9779ec54e69
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: gcc12-spec06-0.8c

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2665 -
This PR 2.2970 📈 +0.0304 (+1.34%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link
Copy Markdown

XiangShanRobot commented Mar 19, 2026

[Generated by GEM5 Performance Robot]
commit: eac43be
workflow: On-Demand SPEC Test (Tier 1.5)

https://github.com/OpenXiangShan/GEM5/actions/runs/23296346286?pr=736

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 21.44 20.70 +3.61 🟢

Change-Id: I9523d6623d9e1afc95e15d0baeadbc406b18dfeb
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2665 -
This PR 2.2970 📈 +0.0304 (+1.34%)

✅ Difftest smoke test passed!

Change-Id: Icdab4d15307e9a1b3be947d26cddc2eafb5c904e
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2665 -
This PR 2.3186 📈 +0.0521 (+2.30%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added perf and removed perf labels Mar 24, 2026
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: gcc12-spec06-0.8c

Change-Id: Ic3b477418bae5150fec9c7491a0c949e63c01f5f
@Yakkhini
Copy link
Copy Markdown
Collaborator Author

@Yakkhini Yakkhini added perf and removed perf labels Mar 26, 2026
@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: gcc12-spec06-0.8c

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2665 -
This PR 2.3383 📈 +0.0718 (+3.17%)

✅ Difftest smoke test passed!

Change-Id: I6a3ea48aac437181b610155028b02af8fa3c700d
@Yakkhini
Copy link
Copy Markdown
Collaborator Author

Yakkhini commented Apr 10, 2026

Close this since further works switch to a detail predictor implementation. This PR aims to a overview estimates of coarse-grained performance on 2-Taken.

There is reports on the latest commit. Access it from docs/plans .

@Yakkhini Yakkhini closed this Apr 10, 2026
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2713 -
This PR 2.3433 📈 +0.0720 (+3.17%)

✅ Difftest smoke test passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants