Skip to content

cpu,arch-riscv,cpu-o3,bpu: align control-PC semantics, fetch coverage, and owner migration#805

Open
Lingrui98 wants to merge 12 commits intoOpenXiangShan:xs-devfrom
Lingrui98:impl-control-pc-views
Open

cpu,arch-riscv,cpu-o3,bpu: align control-PC semantics, fetch coverage, and owner migration#805
Lingrui98 wants to merge 12 commits intoOpenXiangShan:xs-devfrom
Lingrui98:impl-control-pc-views

Conversation

@Lingrui98
Copy link
Copy Markdown
Contributor

@Lingrui98 Lingrui98 commented Mar 22, 2026

Background

In the decoupled BTB frontend, split 4-byte RVI control instructions originally used the first halfword as the predictor-visible control-PC view, while fetch still tended to over-fetch a fixed larger window instead of following the predicted coverage span.

After switching the predictor-visible control identity to the tail halfword, the frontend also needs an explicit ownership model for split control instructions so that fetch, trace, RAS/uRAS, and stats/trace attribution can consistently use the right PC view.

What This PR Changes

This PR keeps the whole control-PC / owner-migration stack together and includes the following pieces:

  1. Align decoupled BTB control-PC coverage

    • Introduce derived control-PC / coverage helpers for split 4-byte RVI control instructions.
    • Make predictor-side coverage and taken matching use the same control-PC semantics.
  2. Align fetch range with predicted coverage

    • Make fetch requests follow the predicted coverage window instead of a fixed oversized span.
    • Allow split decode so a cross-boundary 4-byte instruction can complete naturally when its trailing half arrives in the next block.
  3. Switch predictor-visible control identity to tail halfword

    • Keep startPC as the architectural start-PC view.
    • Use tail-halfword controlPC for predictor-visible identity when the split-control case requires it.
  4. Migrate split-control ownership to the following fetch target

    • Carry explicit owner metadata in FetchTarget.
    • Let fetch hand off ownership only for the sequential split-control case.
    • Keep trace-mode fetch on the simpler sequential-consumption path.
  5. Fix follow-up regressions and simplify the final contract

    • Fix FS/trace follow-up behavior, start-PC callsites in RAS/uRAS, and the MicroTAGE inherited-parameter pitfall in BranchPredictor.py.
    • Rename decodeStartPC to ownerStartPC and centralize owner predicates in FetchTarget helpers so fetch/tests stop duplicating the protocol.
  6. Update frontend documentation

    • Document the startPC / controlPC / ownerStartPC split, the fetch-side owner handoff contract, and the MicroTAGE parameter pitfall in docs/Gem5_Docs/frontend/.

Scope Notes

  • OpenSpec proposal/design/validation assets are intentionally excluded from this PR.
  • The PR keeps only code, tests, and frontend documentation in the review diff.
  • This PR does not attempt a larger FTQ/fetch refactor beyond stabilizing and simplifying the owner-migration contract.

Validation

  • scons -j8 build/RISCV/gem5.opt
  • scons -j8 --unit-test build/RISCV/cpu/pred/btb/test/btb.test.opt build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt build/RISCV/cpu/pred/btb/test/uras.test.opt
  • build/RISCV/cpu/pred/btb/test/btb.test.opt
  • build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt
  • build/RISCV/cpu/pred/btb/test/uras.test.opt
  • XS_MAX_INSTS=10000 TRACE_FORMAT=champsim bash util/xs_scripts/trace/run_trace_champsim.sh /nfs/home/share/glr/champsim_traces/cvp1_public/compute_fp_1.gz
  • Source-stack regression record kept from the validated owner-migration follow-up:
    • FS checkpoint gate: 12/12 PASS
    • Trace gate: 66/66 PASS
    • Final aggregation PASS on the original owner-migration validation run

Summary by CodeRabbit

  • New Features

    • Incremental partial-instruction decoding with decoder introspection; fetch issues coverage-window–aware requests and better multi-cacheline handling.
  • Refactor

    • Branch predictor and fetch logic adopt distinct PC views and refined ownership semantics for split-control instructions; fetch buffer/valid-size semantics tightened.
  • Tests

    • Added unit tests for partial decoding, fetch coverage, and BTB/branch-info PC semantics.
  • Documentation

    • Expanded frontend docs on PC views, owner-migration, and fetch refactor.

Copilot AI review requested due to automatic review settings March 22, 2026 13:45
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a46992ea-68df-4c48-ade0-3e0bc885ec68

📥 Commits

Reviewing files that changed from the base of the PR and between 546514e and 94bd711.

📒 Files selected for processing (1)
  • src/cpu/o3/fetch.cc
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cpu/o3/fetch.cc

📝 Walkthrough

Walkthrough

Adds partial-instruction assembly and delivery: generic decoder API gains partial-introspection and a new moreBytes overload; RISC‑V decoder implements chunked assembly and tests; fetch/BTB/front-end logic updated to compute and supply coverage‑window request spans and handle partial-buffer ownership.

Changes

Cohort / File(s) Summary
Generic decoder interface
src/arch/generic/decoder.hh
Add hasPartialInst() and moreBytes(..., size_t validBytes) overload (default asserts validBytes == moreBytesSize() and forwards) to enable partial-byte delivery.
RISC‑V decoder impl & tests
src/arch/riscv/decoder.hh, src/arch/riscv/decoder.cc, src/arch/riscv/decoder.test.cc, src/arch/riscv/SConscript
Introduce PartialInstBuffer/PartialInstResult, incremental assembly via moreBytes(..., validBytes), update legacy moreBytes/reset/decode, add unit tests and test target.
Fetch unit (o3) refactor
src/cpu/o3/fetch.hh, src/cpu/o3/fetch.cc
Thread per-target requestSpan through cache fetch path; compute per-thread valid end and supply exact valid bytes to decoder (moreBytes(pc, fetchPC, validBytes)); change fetch-buffer semantics (validSize), multi-cacheline handling, packet merging, and split-control owner migration points.
BTB/BranchInfo coverage model
src/cpu/pred/btb/common.hh, src/cpu/pred/btb/*
Introduce startPC/controlPC/ownerStartPC views and helpers (fetchCoverageSpan, control/start conversions); update BranchInfo/FetchTarget fields and many callsites to use the new views.
Decoupled BTB predictor & stats
`src/cpu/pred/btb/decoupled_bpred.(hh
cc), src/cpu/pred/btb/decoupled_bpred_stats.cc`
BTB unit tests & build
src/cpu/pred/btb/test/..., src/cpu/pred/btb/test/SConscript
Add fetch_coverage.test.cc, update tests to new PC semantics, register new test target.
Trace, RAS, and consumers
src/cpu/o3/trace/TraceFetch.cc, src/cpu/pred/btb/ras.{cc,test*.cc}, src/cpu/pred/btb/*_*.cc
Adjust trace rewind/seek and wrong-path PC selection to use startPC/controlPC; switch RAS push/return to fallThroughPC(); update many commit/lookup keys to use controlPCFromStartPC(...).
Docs
docs/Gem5_Docs/frontend/*
Add/expand documentation describing PC views, owner‑migration semantics, and fetch-refactor rationale and reading order.
Python predictor params
src/cpu/pred/BranchPredictor.py
Change MicroTAGE blockSize and numDelay from Param.Unsigned to plain class attributes to avoid param-shadowing.

Sequence Diagram

sequenceDiagram
    participant BTB as BTB/FTQ
    participant Fetch as Fetch Unit
    participant Cache as Memory Cache
    participant Decoder as RISC-V Decoder

    BTB->>Fetch: provide FetchTarget(startPC, predEndPC/coverage)
    Fetch->>Fetch: compute requestSpan from coverage
    Fetch->>Cache: request bytes [startPC, startPC+requestSpan)
    Cache->>Fetch: return packet(s) (may be partial)
    Fetch->>Fetch: assemble buffer, set validSize
    loop assemble instruction
        Fetch->>Decoder: hasPartialInst()?
        Decoder-->>Fetch: (true/false)
        Fetch->>Decoder: moreBytes(pc, fetchPC, validBytes)
        Decoder->>Decoder: PartialInstBuffer.pushChunk(...)
        alt ReadyCompressed / ReadyFullWidth
            Decoder-->>Fetch: instDone (ready)
        else NeedMoreBytes
            Decoder-->>Fetch: outOfBytes (need more)
            Fetch->>BTB: consider following target / request more bytes
        end
    end
    Fetch->>Decoder: decode(PCState) -> produce instruction
    Decoder-->>Fetch: decoded instruction
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • jensen-yan
  • Yakkhini
  • tastynoob

Poem

🐇 I nibble bytes in little hops and bind them tight,

Two halves unite beneath the moonlit byte,
Coverage windows stretch to guide my stride,
Fetch passes pieces, decoder stitches with pride,
A happy rabbit hops — instructions now take flight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly summarizes the main changes: aligning control-PC semantics, fetch coverage, and owner migration in the CPU/architecture/BTB subsystems. It directly reflects the primary objectives of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns the decoupled-BTB “control-PC view” and the O3 fetch request range with the predicted coverage/span, specifically improving correctness for RISC-V 4B (RVI) control instructions that straddle fetch/block boundaries. It also introduces a split-decode path so the RISC-V decoder can accumulate partial bytes across adjacent fetch targets.

Changes:

  • Add derived PC/range semantics to BTB BranchInfo (startPC/triggerPC/endPCExclusive) and extend stream coverage for cross-boundary 4B control instructions.
  • Make O3 fetch request size follow the predicted coverage span (instead of a fixed window) and enable partial-byte delivery into the RISC-V decoder for split instructions.
  • Add focused unit tests for fetch coverage span and RISC-V partial instruction assembly.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/cpu/pred/btb/test/SConscript Registers the new fetch_coverage.test unit test.
src/cpu/pred/btb/test/fetch_coverage.test.cc Adds tests for fetchCoverageSpan() behavior.
src/cpu/pred/btb/test/btb.test.cc Adds tests validating derived PC views and trigger coverage behavior.
src/cpu/pred/btb/decoupled_bpred.hh Exposes “following FTQ target” accessors for split-fetch scenarios.
src/cpu/pred/btb/decoupled_bpred.cc Extends predEndPC using control-instruction coverage end when taken.
src/cpu/pred/btb/common.hh Introduces coverage-span helper and derived views on BranchInfo.
src/cpu/o3/fetch.hh Adds request-span plumbing and tracks valid byte count in fetch buffer.
src/cpu/o3/fetch.cc Implements coverage-sized I-cache requests and split-byte delivery to the decoder.
src/arch/riscv/SConscript Adds and registers riscv_decoder.test.
src/arch/riscv/decoder.test.cc Adds unit tests for partial instruction buffering/assembly.
src/arch/riscv/decoder.hh Adds PartialInstBuffer and overrides partial-byte moreBytes().
src/arch/riscv/decoder.cc Implements partial-byte delivery and resets partial state appropriately.
src/arch/generic/decoder.hh Adds a default moreBytes(pc, fetchPC, validBytes) overload and hasPartialInst().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fetchCoverageSpan(Addr startPC, Addr predEndPC, unsigned capacity)
{
assert(predEndPC >= startPC);
return std::min<unsigned>(capacity, predEndPC - startPC);
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchCoverageSpan() uses std::min<unsigned>(capacity, predEndPC - startPC), which narrows a 64-bit Addr difference to unsigned before taking the min. If predEndPC-startPC exceeds UINT_MAX (or wraps on narrowing), this can yield an incorrect (even zero) span and trip requestSpan > 0 asserts downstream. Consider performing the min in Addr/size_t and only cast to unsigned after clamping to capacity.

Suggested change
return std::min<unsigned>(capacity, predEndPC - startPC);
Addr span = predEndPC - startPC;
Addr clamped_span = std::min<Addr>(span, static_cast<Addr>(capacity));
return static_cast<unsigned>(clamped_span);

Copilot uses AI. Check for mistakes.
Addr getEnd() { return endPCExclusive(); }
Addr getEnd() const { return endPCExclusive(); }
BranchInfo()
: pc(0), target(0), resolved(false), isCond(false), isIndirect(false), isCall(false), isReturn(false), size(0)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BranchInfo() default constructor doesn’t initialize the isDirect field. This leaves it with an indeterminate value and can cause undefined behavior when isDirect is later read. Please add isDirect(false) to the initializer list (or otherwise ensure it is always initialized).

Suggested change
: pc(0), target(0), resolved(false), isCond(false), isIndirect(false), isCall(false), isReturn(false), size(0)
: pc(0), target(0), resolved(false),
isCond(false), isIndirect(false), isDirect(false),
isCall(false), isReturn(false), size(0)

Copilot uses AI. Check for mistakes.
delete secondPkt;
for (auto *packet : threads[tid].cacheReq.packets) {
delete packet;
}
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging, the code deletes the packets but leaves threads[tid].cacheReq.packets populated with dangling pointers (and keeps request/status metadata). This can trip drainSanityCheck() (which asserts cacheReq.packets.empty()) and risks accidental use-after-free if any later debug/logic touches the vector. Clear/reset the cacheReq state after cleanup (e.g., null/clear the packet vector and reset counters/status).

Suggested change
}
}
// Remove dangling pointers and reset the packet container
threads[tid].cacheReq.packets.clear();

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/arch/riscv/decoder.cc (1)

56-72: Legacy path behavior change is backward-compatible but worth documenting.

The 2-arg moreBytes now explicitly resets partialInst and sets instDone = true. This is correct for CPUs (Minor, Simple) that always deliver complete MachInst-sized payloads in one call.

However, if any code path accidentally mixes the 2-arg and 3-arg versions on the same decoder instance, partial assembly state would be silently lost. Consider adding an assertion or warning when partialInst has bytes:

🛡️ Optional defensive check
 void
 Decoder::moreBytes(const PCStateBase &pc, Addr fetchPC)
 {
     // Legacy full-width delivery path, retained for existing callers that
     // still provide a complete MachInst-sized payload in one shot.
+    if (partialInst.hasBytes()) {
+        warn_once("moreBytes(2-arg) called with partial state; "
+                  "discarding %u assembled bytes\n", partialInst.assembledBytes);
+    }
     auto inst = letoh(machInst);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/arch/riscv/decoder.cc` around lines 56 - 72, The legacy 2-arg moreBytes
path currently unconditionally resets partialInst and sets instDone = true,
which silently discards any in-progress assembly if the 3-arg path was used
earlier; update the 2-arg moreBytes implementation in decoder.cc to detect
existing partial bytes (check partialInst's size/state) and emit a defensive
check — either assert/fail or DPRINTF/warning — before resetting partialInst,
referencing the symbols partialInst and instDone (and the moreBytes entry point)
so callers mixing 2-arg and 3-arg usage are alerted rather than silently losing
state.
src/arch/riscv/decoder.hh (1)

67-76: Manual popcount could be replaced with builtin.

The countValidBytes function is a manual bit-counting implementation. For a 4-bit mask this is fine, but consider using the compiler builtin for clarity:

♻️ Optional simplification
     static unsigned
     countValidBytes(uint8_t mask)
     {
-        unsigned count = 0;
-        while (mask != 0) {
-            count += mask & 0x1;
-            mask >>= 1;
-        }
-        return count;
+        return __builtin_popcount(mask);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/arch/riscv/decoder.hh` around lines 67 - 76, Replace the manual loop in
countValidBytes(uint8_t mask) with the compiler builtin popcount for clarity and
potential efficiency: use __builtin_popcount (or __builtin_popcountl) on the
mask (casting mask to unsigned/int as needed) and return that result; keep the
function name countValidBytes and signature unchanged so callers remain
unaffected.
src/cpu/o3/fetch.cc (1)

2114-2133: Split instruction fetch coordination works as designed but lacks defensive safeguards.

The logic correctly coordinates split 32-bit instruction handling by waiting for the following FSQ entry before proceeding. However, verification shows there is no timeout or fallback mechanism if the following entry fails to arrive—the code will stall indefinitely at line 2114-2120.

While the predictor is responsible for guaranteeing that following FSQ entries are generated when a decoder has a partial instruction (indicated by extending predEndPC), there are no defensive safeguards if this guarantee is violated. The fallback target selection logic at lines 2124-2125 is unreachable when hasPartialInst() is true, leaving no escape path if the following entry is delayed or never arrives.

Consider adding:

  • A timeout or maximum stall counter to detect stuck predictor states
  • A panic/assertion if following entry doesn't arrive within reasonable time
  • A comment explaining the predictor's responsibility and why this blocking behavior is safe
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/fetch.cc` around lines 2114 - 2133, The wait for a following FSQ
entry when decoder[tid]->hasPartialInst() and dbpbtb->ftqHasFollowing(tid) is
false can stall forever; add a defensive per-thread stall counter (e.g., in
threads[tid] or a member) that increments each time this branch returns and is
reset when the following entry is seen, and if it exceeds a sensible threshold
log an error/panic (via DPRINTF/processLogger) and fall back to a safe target
selection (call shouldFetchFollowingTarget(tid, pc_state) again or use
dbpbtb->ftqFetchingTarget(tid) / pc_state.instAddr() as a conservative
start_pc), then continue with the existing logic (set threads[tid].startPC,
compute request_span and call fetchCacheLine). Also add a short comment near
decoder[tid]->hasPartialInst() explaining predictor responsibility and why the
timeout/fallback exists.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/arch/generic/decoder.hh`:
- Around line 148-160: The generic three-arg virtual moreBytes(const PCStateBase
&pc, Addr fetchPC, size_t validBytes) must not assert that validBytes ==
moreBytesSize() because O3 fetch sometimes supplies fewer bytes; instead remove
the assert and treat mismatched validBytes as a partial-delivery case by
delegating to the two-arg moreBytes(pc, fetchPC) (i.e., drop the assertion and
call moreBytes(pc, fetchPC) when validBytes != moreBytesSize()), so non-RISC-V
decoders that don't override the 3-arg method won't crash when O3 calls
moreBytes with fewer bytes.

---

Nitpick comments:
In `@src/arch/riscv/decoder.cc`:
- Around line 56-72: The legacy 2-arg moreBytes path currently unconditionally
resets partialInst and sets instDone = true, which silently discards any
in-progress assembly if the 3-arg path was used earlier; update the 2-arg
moreBytes implementation in decoder.cc to detect existing partial bytes (check
partialInst's size/state) and emit a defensive check — either assert/fail or
DPRINTF/warning — before resetting partialInst, referencing the symbols
partialInst and instDone (and the moreBytes entry point) so callers mixing 2-arg
and 3-arg usage are alerted rather than silently losing state.

In `@src/arch/riscv/decoder.hh`:
- Around line 67-76: Replace the manual loop in countValidBytes(uint8_t mask)
with the compiler builtin popcount for clarity and potential efficiency: use
__builtin_popcount (or __builtin_popcountl) on the mask (casting mask to
unsigned/int as needed) and return that result; keep the function name
countValidBytes and signature unchanged so callers remain unaffected.

In `@src/cpu/o3/fetch.cc`:
- Around line 2114-2133: The wait for a following FSQ entry when
decoder[tid]->hasPartialInst() and dbpbtb->ftqHasFollowing(tid) is false can
stall forever; add a defensive per-thread stall counter (e.g., in threads[tid]
or a member) that increments each time this branch returns and is reset when the
following entry is seen, and if it exceeds a sensible threshold log an
error/panic (via DPRINTF/processLogger) and fall back to a safe target selection
(call shouldFetchFollowingTarget(tid, pc_state) again or use
dbpbtb->ftqFetchingTarget(tid) / pc_state.instAddr() as a conservative
start_pc), then continue with the existing logic (set threads[tid].startPC,
compute request_span and call fetchCacheLine). Also add a short comment near
decoder[tid]->hasPartialInst() explaining predictor responsibility and why the
timeout/fallback exists.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad59d4af-6b6e-4cd8-88e1-fbe9780ecba9

📥 Commits

Reviewing files that changed from the base of the PR and between 103c503 and 8412ba7.

📒 Files selected for processing (13)
  • src/arch/generic/decoder.hh
  • src/arch/riscv/SConscript
  • src/arch/riscv/decoder.cc
  • src/arch/riscv/decoder.hh
  • src/arch/riscv/decoder.test.cc
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/pred/btb/common.hh
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
  • src/cpu/pred/btb/test/SConscript
  • src/cpu/pred/btb/test/btb.test.cc
  • src/cpu/pred/btb/test/fetch_coverage.test.cc

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cpu/o3/fetch.cc (1)

500-574: ⚠️ Potential issue | 🟠 Major

This still assumes every coverage window fits in two cache lines.

requestSpan is now derived from prediction coverage, but this implementation still emits at most two requests. After the first line, the second request is sized as the entire remainder, so a smaller cacheBlkSize or a larger fetchBufferSize can produce a request that crosses yet another line. Either split in a loop or guard the two-line assumption explicitly.

Minimal guard if you want to keep the current two-line implementation
     Addr fetchPC = vaddr;
-    unsigned fetchSize = cacheBlkSize - fetchPC % cacheBlkSize;  // Size for first cache line
+    const unsigned firstLineBytes =
+        cacheBlkSize - fetchPC % cacheBlkSize;
+    panic_if(requestSpan > firstLineBytes + cacheBlkSize,
+             "coverage span %u exceeds two-line fetch support "
+             "(pc=%#x, line=%u)",
+             requestSpan, vaddr, cacheBlkSize);
+    unsigned fetchSize = firstLineBytes;  // Size for first cache line
     fetchSize = std::min(fetchSize, requestSpan);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/fetch.cc` around lines 500 - 574, The current
Fetch::handleMultiCacheLineFetch assumes at most two cache-line requests by
making the second request cover the entire remainder (using fetchSize =
requestSpan - fetchSize), which can still cross additional lines; change this to
either (preferred) loop-split the remainder into per-cache-line RequestPtr
objects and call updateCacheRequestStatusByRequest/translateTiming for each new
request until the full requestSpan is covered (adding each to
threads[tid].cacheReq and incrementing req numbers), or (alternate) add an
explicit guard that fails/limits when requestSpan would require >2 lines; locate
logic around first_mem_req, second_mem_req, fetchSize, requestSpan and
threads[tid].cacheReq to implement the loop or the guard.
🧹 Nitpick comments (1)
src/cpu/pred/btb/common.hh (1)

23-29: Tighten the non-empty coverage contract here.

fetchCoverageSpan() can return 0 when predEndPC == startPC, but Fetch::fetchCacheLine() later asserts requestSpan > 0. Rejecting zero-width coverage at this helper boundary keeps that invariant in one place.

Possible tweak
-    assert(predEndPC >= startPC);
+    assert(predEndPC > startPC);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/btb/common.hh` around lines 23 - 29, The helper
fetchCoverageSpan currently allows a zero-length span (predEndPC == startPC)
which violates Fetch::fetchCacheLine's requestSpan > 0 contract; change the
precondition in fetchCoverageSpan by replacing the non-strict assert with a
strict one (assert(predEndPC > startPC)) so the function rejects zero-width
coverage at the helper boundary and keeps the invariant centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpu/o3/fetch.hh`:
- Around line 757-761: The current CacheRequest teardown drops stored PacketPtr
values via packets.clear() in Fetch::CacheRequest::reset(), which can leak when
a split fetch half is present; move packet destruction into the common teardown
used by both completion and squash/reset. Update Fetch::CacheRequest::reset() to
call releaseStoredPackets() (or replicate its proper destruction logic) instead
of a bare packets.clear(), and ensure processMultiCacheLineCompletion() no
longer double-destroys packets; reference functions/members:
Fetch::CacheRequest::reset, Fetch::releaseStoredPackets,
Fetch::processMultiCacheLineCompletion, and the packets/PacketPtr storage so the
stored packets are always destroyed on reset or completion.

---

Outside diff comments:
In `@src/cpu/o3/fetch.cc`:
- Around line 500-574: The current Fetch::handleMultiCacheLineFetch assumes at
most two cache-line requests by making the second request cover the entire
remainder (using fetchSize = requestSpan - fetchSize), which can still cross
additional lines; change this to either (preferred) loop-split the remainder
into per-cache-line RequestPtr objects and call
updateCacheRequestStatusByRequest/translateTiming for each new request until the
full requestSpan is covered (adding each to threads[tid].cacheReq and
incrementing req numbers), or (alternate) add an explicit guard that
fails/limits when requestSpan would require >2 lines; locate logic around
first_mem_req, second_mem_req, fetchSize, requestSpan and threads[tid].cacheReq
to implement the loop or the guard.

---

Nitpick comments:
In `@src/cpu/pred/btb/common.hh`:
- Around line 23-29: The helper fetchCoverageSpan currently allows a zero-length
span (predEndPC == startPC) which violates Fetch::fetchCacheLine's requestSpan >
0 contract; change the precondition in fetchCoverageSpan by replacing the
non-strict assert with a strict one (assert(predEndPC > startPC)) so the
function rejects zero-width coverage at the helper boundary and keeps the
invariant centralized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 590dbd8d-532d-42e5-a0ea-96a376ad5595

📥 Commits

Reviewing files that changed from the base of the PR and between 8412ba7 and b92a977.

📒 Files selected for processing (7)
  • src/arch/riscv/decoder.cc
  • src/arch/riscv/decoder.hh
  • src/arch/riscv/decoder.test.cc
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/pred/btb/common.hh
  • src/cpu/pred/btb/test/fetch_coverage.test.cc
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/arch/riscv/decoder.cc

@github-actions
Copy link
Copy Markdown

🚀 Performance test triggered: gcc12-spec06-0.8c

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: b92a977
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.69 20.70 -0.01 🔴

Lingrui98 added 11 commits April 3, 2026 22:17
Keep the canonical control PC as startPC for debug and attribution.

Add derived trigger/end/coverage helpers, extend predEndPC for split 4B taken branches, and switch fetch redirect matching to coverage-based logic with blocked/taken debug prints.

Also add helper-level BTB cases for 2B/4B derived PC views and cross-boundary trigger coverage.

Change-Id: I442b44312d6a6e9f09bb7c6621f2ca3c26107e6f
Limit fetch requests to the predicted coverage window instead of always
consuming the full 66-byte buffer span.

Keep 66 bytes as fetch-buffer capacity, but let the RISC-V decoder assemble
split 32-bit instructions from partial-byte deliveries and request the
following coverage block only when the trailing half is still missing.

Also add decoder helper tests and a fetch-coverage helper test so the new
coverage and split-decode behavior stays checked at helper level.

Change-Id: Ifa6cfcfb99159084593d02827ffe8b051b1c0d36
Restore the legacy RISC-V decoder needMoreBytes semantics for 32-bit
instructions, clamp fetchCoverageSpan before narrowing to unsigned,
initialize BranchInfo::isDirect in the default constructor, and drop
stored packet pointers after misaligned fetch completion.

Also add helper-level regression coverage for the legacy decoder path
and the widened-span clamp behavior.

Change-Id: Ideb1c46ccc59251bf92f1a008eec533052bdea3a
Make predictor-visible BTB branch identity use the final-halfword control PC while preserving explicit architectural start-PC metadata for stats, trace, and fetch byte coverage.

This commit also adds the pending OpenSpec baseline and dependent change docs for the new startPC/controlPC split.

Change-Id: Id1e9694a4789b8ddf3738ef2526b3da5748b955f
Add fetch-side ownership migration for split taken control instructions by carrying decodeStartPC in FetchTarget, consuming the previous target before buildInst when the following target claims the instruction, and matching taken redirects on the owner target's branch start PC.

This removes fetch's trigger_covered dependency while preserving controlPC-only predictor keying and startPC-based stats/debug semantics.

Change-Id: Iebf699048cad58c65cffdc165e349a74015a5809
Keep predictor-visible control-PC history recovery aligned with speculative updates, skip FTQ owner migration in trace mode, and fix trace rollback to resume at the correct 1-based trace index.

Also update the OpenSpec proposal/tasks to record the manifest-driven FS target12 plus trace66 regression gate and final aggregation outputs.

Change-Id: If408a9237db1e886140bacc065aa6d78b12a2848
Use the architectural branch start PC when deriving call fall-through addresses in RAS/uRAS, while keeping predictor-visible controlPC for predictor lookup state.

Also switch trace wrong-path NOP sizing back to predBranchInfo.startPC() and add directed BranchInfo/RAS coverage for split 4B instructions.

Change-Id: Id84ac2624e1bbccae9d73e05d55a8bb2d9279ba2
Tighten fetch-side FTQ owner migration so it only fires for the\nsequential split-control handoff case, not for ordinary taken-target\nentries that happen to sit behind the current fetch head.\n\nAlso keep control squashes keyed by predictor-visible controlPC and add a\nunit test that blocks the backward-target regression.

Change-Id: Ia8cb92e0ac44f84cbe64dd5b84b4d1af11d8e7da
The control-PC owner-migration validation uncovered two blockers.

MicroTAGE redeclared inherited TimedBaseBTBPredictor params in Python,
which generated shadow C++ param fields and left the base predictor with a
zero blockSize at runtime. That in turn produced bogus branch-position
warnings during trace runs. Fix this by overriding the inherited defaults
without redeclaring the Param fields.

Also track the OpenSpec validation assets for this change and make the
trace_top66 task derive its status from the actual trace run directory
instead of inline text matching. This keeps the manifest-driven FS+trace
regression reproducible and prevents false task failures during final
aggregation.

Validation:
- openspec validate update-bpu-control-pc-owner-migration --strict
- scons -j128 build/RISCV/gem5.opt
- FS checkpoint gate: 12/12 PASS
- Trace gate: 66/66 PASS
- final_verify overall PASS on control_pc_tail_halfword_20260401_recheck1

Change-Id: Ia2b3fc2f24a9873b41da88f39d0f94a77af30b46
Factor split-control ownership semantics into FetchTarget helpers so fetch-side handoff, owner-range checks, and taken matching share one contract.

Rename decodeStartPC to ownerStartPC to match the actual domain concept, keep trace mode out of fetch-time owner migration, and update BTB tests to validate the shared helper behavior.

Change-Id: I7fd65621b57fd1fada5b9528431a6bbedb6410d9
Validation: scons -j8 build/RISCV/gem5.opt; scons -j8 --unit-test build/RISCV/cpu/pred/btb/test/btb.test.opt build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt build/RISCV/cpu/pred/btb/test/uras.test.opt; build/RISCV/cpu/pred/btb/test/btb.test.opt; build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt; build/RISCV/cpu/pred/btb/test/uras.test.opt; XS_MAX_INSTS=10000 TRACE_FORMAT=champsim bash util/xs_scripts/trace/run_trace_champsim.sh /nfs/home/share/glr/champsim_traces/cvp1_public/compute_fp_1.gz
Update frontend docs to describe the control-PC tail-halfword view, FetchTarget ownerStartPC contract, fetch-side owner handoff boundaries, and the MicroTAGE inherited-parameter pitfall uncovered by the trace regression.

The docs now point readers to the current helper contracts in common.hh, the fetch-side consumers in fetch.cc and TraceFetch.cc, and the refactor constraint that owner predicates must remain centralized in FetchTarget helpers.

Change-Id: Ibe734b81a878fbb6dccba2370c7615a031eb16bc
@Lingrui98 Lingrui98 force-pushed the impl-control-pc-views branch from b92a977 to 546514e Compare April 3, 2026 14:45
@Lingrui98 Lingrui98 changed the title cpu-o3,arch-riscv,bpu: align decoupled BTB control coverage and fetch range cpu,arch-riscv,cpu-o3,bpu: align control-PC semantics, fetch coverage, and owner migration Apr 3, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/cpu/o3/trace/TraceFetch.cc (1)

950-989: ⚠️ Potential issue | 🟠 Major

Return early when a normal squash has no trace-index mapping.

If findTraceIndexForSeqNum(seqNum) misses and squash_itself is false, index stays 0 and this code computes desired_trace_index = 1, which rewinds the reader to the start of the trace instead of failing the rollback. That can desynchronize the trace stream after any unmapped squash.

Suggested fix
     if (!found) {
         if (squash_itself) {
             // If squashing the instruction itself, try one earlier
             DPRINTF(Fetch,
                     "rollbackTraceReader[sn:%lli]: No mapped trace index, "
                     "trying earlier instruction\n",
                     seqNum);
             index = findTraceIndexForSeqNum(seqNum - 1);
             if (index != 0) {
                 found = true;
                 reused_prev_mapping = true;
             } else {
                 DPRINTF(Fetch, "rollbackTraceReader[sn:%lli]: No mapped trace index (skip)\n", seqNum);
                 return false;
             }
+        } else {
+            DPRINTF(Fetch,
+                    "rollbackTraceReader[sn:%lli]: No mapped trace index (skip)\n",
+                    seqNum);
+            return false;
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/trace/TraceFetch.cc` around lines 950 - 989, The code allows a
normal (non-self) squash to proceed when findTraceIndexForSeqNum(seqNum) fails
because index stays 0, causing desired_trace_index to become 1 and incorrectly
rewind to trace start; fix this in the rollback logic (the if (!found) block in
rollbackTraceReader) by adding an explicit else path for the case when
found==false and squash_itself==false that logs the miss (use DPRINTF(Fetch,...,
seqNum)) and returns false immediately, so variables like index,
reused_prev_mapping and desired_trace_index are only used when a valid mapping
was found or when you successfully reused the previous mapping.
docs/Gem5_Docs/frontend/BPU.md (1)

171-212: ⚠️ Potential issue | 🟡 Minor

Wrap the BPU::tick() pseudo-code in a code fence.

There is no opening fence before the pseudo-code, so the fence at the end opens a block instead of closing one and the following sections render as code.

Suggested fix
+```
 BPU::tick()
 |
 |--if(!receivedPred && numOverrideBubbles == 0)
 |  // 生成最终预测结果,并创建override bubbles
@@
       |--for (int i = 0; i < numComponents; i++)
          components[i]->putPCHistory(s0PC, s0History, predsOfEachStage);
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/Gem5_Docs/frontend/BPU.md around lines 171 - 212, The pseudo-code for
BPU::tick() is missing an opening code fence; add a matching opening
triple-backtick fence immediately before the "BPU::tick()" line so the existing
closing fence after the components[i]->putPCHistory(...) line properly closes
the block, ensuring the whole BPU::tick() pseudo-code (including
generateFinalPredAndCreateBubbles, processEnqueueAndBubbles,
requestNewPrediction and the for-loop) is rendered as a single fenced code
block.


</details>

</blockquote></details>

</blockquote></details>
🧹 Nitpick comments (2)
src/arch/riscv/decoder.hh (1)

171-174: Keep legacyNeedMoreBytes() out of the public API if it's only an internal helper.

This looks like decoder-state-machine plumbing, and exposing it on Decoder makes that behavior harder to change later. If it's only needed by decoder.cc and unit tests, consider keeping it private or file-local and testing through moreBytes() instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/arch/riscv/decoder.hh` around lines 171 - 174, The legacyNeedMoreBytes
helper is exposed in the public Decoder API but is only decoder-state-machine
plumbing; make it non-public by moving legacyNeedMoreBytes(ExtMachInst) out of
the Decoder header and into decoder.cc as a static/file-local function (or mark
it private inside the Decoder class if truly needed by tests), update any
callers in decoder.cc to use the new file-local symbol, and ensure tests
exercise the behavior via the public moreBytes() wrapper instead of calling
legacyNeedMoreBytes directly.
src/arch/riscv/decoder.test.cc (1)

48-62: Exercise Decoder::reset() here, not just PartialInstBuffer::reset().

This only proves the helper clears itself. The flush-sensitive behavior lives on Decoder::reset() / hasPartialInst() in src/arch/riscv/decoder.cc and src/arch/riscv/decoder.hh; if that wiring breaks, this test still stays green.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/arch/riscv/decoder.test.cc` around lines 48 - 62, The test currently
exercises only PartialInstBuffer::reset(), so change it to exercise the
Decoder-level API: create a Decoder (or use the existing decoder helper) and
drive it to a partial state (by calling pushChunk or equivalent on its
PartialInstBuffer via the Decoder), then call Decoder::reset() and assert using
Decoder::hasPartialInst() (and the same expectations for instPC,
fullBits()/assembledBytes via the Decoder accessors or by inspecting its
internal PartialInstBuffer through the Decoder) to ensure the flush-sensitive
behavior is tested rather than only PartialInstBuffer::reset().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpu/pred/BranchPredictor.py`:
- Line 1101: Production default for MicroTAGE latency was changed to zero by
setting numDelay = 0 in BranchPredictor.py, but the test-only constructor in
microtage.cc still calls setNumDelay(1), causing unit tests to exercise a
different stage contract; update the test-only constructor in microtage.cc to
call setNumDelay(0) (or remove the hard-coded setNumDelay call so it inherits
the default) so that MicroTAGE::getDelay(), MicroTAGE::putPCHistory(), and the
stagePreds behavior are identical between tests and production.

---

Outside diff comments:
In `@docs/Gem5_Docs/frontend/BPU.md`:
- Around line 171-212: The pseudo-code for BPU::tick() is missing an opening
code fence; add a matching opening triple-backtick fence immediately before the
"BPU::tick()" line so the existing closing fence after the
components[i]->putPCHistory(...) line properly closes the block, ensuring the
whole BPU::tick() pseudo-code (including generateFinalPredAndCreateBubbles,
processEnqueueAndBubbles, requestNewPrediction and the for-loop) is rendered as
a single fenced code block.

In `@src/cpu/o3/trace/TraceFetch.cc`:
- Around line 950-989: The code allows a normal (non-self) squash to proceed
when findTraceIndexForSeqNum(seqNum) fails because index stays 0, causing
desired_trace_index to become 1 and incorrectly rewind to trace start; fix this
in the rollback logic (the if (!found) block in rollbackTraceReader) by adding
an explicit else path for the case when found==false and squash_itself==false
that logs the miss (use DPRINTF(Fetch,..., seqNum)) and returns false
immediately, so variables like index, reused_prev_mapping and
desired_trace_index are only used when a valid mapping was found or when you
successfully reused the previous mapping.

---

Nitpick comments:
In `@src/arch/riscv/decoder.hh`:
- Around line 171-174: The legacyNeedMoreBytes helper is exposed in the public
Decoder API but is only decoder-state-machine plumbing; make it non-public by
moving legacyNeedMoreBytes(ExtMachInst) out of the Decoder header and into
decoder.cc as a static/file-local function (or mark it private inside the
Decoder class if truly needed by tests), update any callers in decoder.cc to use
the new file-local symbol, and ensure tests exercise the behavior via the public
moreBytes() wrapper instead of calling legacyNeedMoreBytes directly.

In `@src/arch/riscv/decoder.test.cc`:
- Around line 48-62: The test currently exercises only
PartialInstBuffer::reset(), so change it to exercise the Decoder-level API:
create a Decoder (or use the existing decoder helper) and drive it to a partial
state (by calling pushChunk or equivalent on its PartialInstBuffer via the
Decoder), then call Decoder::reset() and assert using Decoder::hasPartialInst()
(and the same expectations for instPC, fullBits()/assembledBytes via the Decoder
accessors or by inspecting its internal PartialInstBuffer through the Decoder)
to ensure the flush-sensitive behavior is tested rather than only
PartialInstBuffer::reset().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 506b5a2f-8ae9-44e3-a0cb-04a608556570

📥 Commits

Reviewing files that changed from the base of the PR and between b92a977 and 546514e.

📒 Files selected for processing (31)
  • docs/Gem5_Docs/frontend/BPU.md
  • docs/Gem5_Docs/frontend/fetch-refactor.md
  • docs/Gem5_Docs/frontend/pc_data_flow.md
  • docs/Gem5_Docs/frontend/phr_design.md
  • src/arch/generic/decoder.hh
  • src/arch/riscv/SConscript
  • src/arch/riscv/decoder.cc
  • src/arch/riscv/decoder.hh
  • src/arch/riscv/decoder.test.cc
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/o3/trace/TraceFetch.cc
  • src/cpu/pred/BranchPredictor.py
  • src/cpu/pred/btb/abtb.cc
  • src/cpu/pred/btb/btb_ittage.cc
  • src/cpu/pred/btb/btb_mgsc.cc
  • src/cpu/pred/btb/btb_tage.cc
  • src/cpu/pred/btb/btb_ubtb.cc
  • src/cpu/pred/btb/common.hh
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
  • src/cpu/pred/btb/decoupled_bpred_stats.cc
  • src/cpu/pred/btb/mbtb.cc
  • src/cpu/pred/btb/microtage.cc
  • src/cpu/pred/btb/ras.cc
  • src/cpu/pred/btb/test/SConscript
  • src/cpu/pred/btb/test/btb.test.cc
  • src/cpu/pred/btb/test/fetch_coverage.test.cc
  • src/cpu/pred/btb/test/ras_test.cc
  • src/cpu/pred/btb/test/uras_test.cc
  • src/cpu/pred/btb/uras.cc
✅ Files skipped from review due to trivial changes (3)
  • docs/Gem5_Docs/frontend/fetch-refactor.md
  • docs/Gem5_Docs/frontend/phr_design.md
  • src/cpu/pred/btb/test/SConscript
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/arch/riscv/SConscript
  • src/arch/generic/decoder.hh
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/test/btb.test.cc
  • src/arch/riscv/decoder.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
  • src/cpu/o3/fetch.cc

numBanks = Param.Unsigned(4,"Number of banks for bank conflict simulation")
enableBankConflict = Param.Bool(False,"Enable bank conflict simulation")
numDelay = Param.Unsigned(0,"Prediction latency in cycles")
numDelay = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the UNIT_TEST latency in sync with the new default.

Line 1101 makes production MicroTAGE zero-cycle, but src/cpu/pred/btb/microtage.cc Line 51 still hard-codes setNumDelay(1) in the test-only constructor. Since MicroTAGE::putPCHistory() starts filling stagePreds from getDelay(), direct unit tests now exercise a different stage contract than production.

Suggested follow-up
diff --git a/src/cpu/pred/btb/microtage.cc b/src/cpu/pred/btb/microtage.cc
@@
-    setNumDelay(1);
+    setNumDelay(0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/pred/BranchPredictor.py` at line 1101, Production default for
MicroTAGE latency was changed to zero by setting numDelay = 0 in
BranchPredictor.py, but the test-only constructor in microtage.cc still calls
setNumDelay(1), causing unit tests to exercise a different stage contract;
update the test-only constructor in microtage.cc to call setNumDelay(0) (or
remove the hard-coded setNumDelay call so it inherits the default) so that
MicroTAGE::getDelay(), MicroTAGE::putPCHistory(), and the stagePreds behavior
are identical between tests and production.

Change-Id: Ibd82e8e60766d3969ab1c5999cf6862c22b42b61
@jensen-yan jensen-yan requested review from rich-cake April 10, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants