cpu,arch-riscv,cpu-o3,bpu: align control-PC semantics, fetch coverage, and owner migration#805
cpu,arch-riscv,cpu-o3,bpu: align control-PC semantics, fetch coverage, and owner migration#805Lingrui98 wants to merge 12 commits intoOpenXiangShan:xs-devfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds partial-instruction assembly and delivery: generic decoder API gains partial-introspection and a new moreBytes overload; RISC‑V decoder implements chunked assembly and tests; fetch/BTB/front-end logic updated to compute and supply coverage‑window request spans and handle partial-buffer ownership. Changes
Sequence DiagramsequenceDiagram
participant BTB as BTB/FTQ
participant Fetch as Fetch Unit
participant Cache as Memory Cache
participant Decoder as RISC-V Decoder
BTB->>Fetch: provide FetchTarget(startPC, predEndPC/coverage)
Fetch->>Fetch: compute requestSpan from coverage
Fetch->>Cache: request bytes [startPC, startPC+requestSpan)
Cache->>Fetch: return packet(s) (may be partial)
Fetch->>Fetch: assemble buffer, set validSize
loop assemble instruction
Fetch->>Decoder: hasPartialInst()?
Decoder-->>Fetch: (true/false)
Fetch->>Decoder: moreBytes(pc, fetchPC, validBytes)
Decoder->>Decoder: PartialInstBuffer.pushChunk(...)
alt ReadyCompressed / ReadyFullWidth
Decoder-->>Fetch: instDone (ready)
else NeedMoreBytes
Decoder-->>Fetch: outOfBytes (need more)
Fetch->>BTB: consider following target / request more bytes
end
end
Fetch->>Decoder: decode(PCState) -> produce instruction
Decoder-->>Fetch: decoded instruction
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR aligns the decoupled-BTB “control-PC view” and the O3 fetch request range with the predicted coverage/span, specifically improving correctness for RISC-V 4B (RVI) control instructions that straddle fetch/block boundaries. It also introduces a split-decode path so the RISC-V decoder can accumulate partial bytes across adjacent fetch targets.
Changes:
- Add derived PC/range semantics to BTB
BranchInfo(startPC/triggerPC/endPCExclusive) and extend stream coverage for cross-boundary 4B control instructions. - Make O3 fetch request size follow the predicted coverage span (instead of a fixed window) and enable partial-byte delivery into the RISC-V decoder for split instructions.
- Add focused unit tests for fetch coverage span and RISC-V partial instruction assembly.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cpu/pred/btb/test/SConscript | Registers the new fetch_coverage.test unit test. |
| src/cpu/pred/btb/test/fetch_coverage.test.cc | Adds tests for fetchCoverageSpan() behavior. |
| src/cpu/pred/btb/test/btb.test.cc | Adds tests validating derived PC views and trigger coverage behavior. |
| src/cpu/pred/btb/decoupled_bpred.hh | Exposes “following FTQ target” accessors for split-fetch scenarios. |
| src/cpu/pred/btb/decoupled_bpred.cc | Extends predEndPC using control-instruction coverage end when taken. |
| src/cpu/pred/btb/common.hh | Introduces coverage-span helper and derived views on BranchInfo. |
| src/cpu/o3/fetch.hh | Adds request-span plumbing and tracks valid byte count in fetch buffer. |
| src/cpu/o3/fetch.cc | Implements coverage-sized I-cache requests and split-byte delivery to the decoder. |
| src/arch/riscv/SConscript | Adds and registers riscv_decoder.test. |
| src/arch/riscv/decoder.test.cc | Adds unit tests for partial instruction buffering/assembly. |
| src/arch/riscv/decoder.hh | Adds PartialInstBuffer and overrides partial-byte moreBytes(). |
| src/arch/riscv/decoder.cc | Implements partial-byte delivery and resets partial state appropriately. |
| src/arch/generic/decoder.hh | Adds a default moreBytes(pc, fetchPC, validBytes) overload and hasPartialInst(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cpu/pred/btb/common.hh
Outdated
| fetchCoverageSpan(Addr startPC, Addr predEndPC, unsigned capacity) | ||
| { | ||
| assert(predEndPC >= startPC); | ||
| return std::min<unsigned>(capacity, predEndPC - startPC); |
There was a problem hiding this comment.
fetchCoverageSpan() uses std::min<unsigned>(capacity, predEndPC - startPC), which narrows a 64-bit Addr difference to unsigned before taking the min. If predEndPC-startPC exceeds UINT_MAX (or wraps on narrowing), this can yield an incorrect (even zero) span and trip requestSpan > 0 asserts downstream. Consider performing the min in Addr/size_t and only cast to unsigned after clamping to capacity.
| return std::min<unsigned>(capacity, predEndPC - startPC); | |
| Addr span = predEndPC - startPC; | |
| Addr clamped_span = std::min<Addr>(span, static_cast<Addr>(capacity)); | |
| return static_cast<unsigned>(clamped_span); |
src/cpu/pred/btb/common.hh
Outdated
| Addr getEnd() { return endPCExclusive(); } | ||
| Addr getEnd() const { return endPCExclusive(); } | ||
| BranchInfo() | ||
| : pc(0), target(0), resolved(false), isCond(false), isIndirect(false), isCall(false), isReturn(false), size(0) |
There was a problem hiding this comment.
BranchInfo() default constructor doesn’t initialize the isDirect field. This leaves it with an indeterminate value and can cause undefined behavior when isDirect is later read. Please add isDirect(false) to the initializer list (or otherwise ensure it is always initialized).
| : pc(0), target(0), resolved(false), isCond(false), isIndirect(false), isCall(false), isReturn(false), size(0) | |
| : pc(0), target(0), resolved(false), | |
| isCond(false), isIndirect(false), isDirect(false), | |
| isCall(false), isReturn(false), size(0) |
| delete secondPkt; | ||
| for (auto *packet : threads[tid].cacheReq.packets) { | ||
| delete packet; | ||
| } |
There was a problem hiding this comment.
After merging, the code deletes the packets but leaves threads[tid].cacheReq.packets populated with dangling pointers (and keeps request/status metadata). This can trip drainSanityCheck() (which asserts cacheReq.packets.empty()) and risks accidental use-after-free if any later debug/logic touches the vector. Clear/reset the cacheReq state after cleanup (e.g., null/clear the packet vector and reset counters/status).
| } | |
| } | |
| // Remove dangling pointers and reset the packet container | |
| threads[tid].cacheReq.packets.clear(); |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/arch/riscv/decoder.cc (1)
56-72: Legacy path behavior change is backward-compatible but worth documenting.The 2-arg
moreBytesnow explicitly resetspartialInstand setsinstDone = true. This is correct for CPUs (Minor, Simple) that always deliver complete MachInst-sized payloads in one call.However, if any code path accidentally mixes the 2-arg and 3-arg versions on the same decoder instance, partial assembly state would be silently lost. Consider adding an assertion or warning when partialInst has bytes:
🛡️ Optional defensive check
void Decoder::moreBytes(const PCStateBase &pc, Addr fetchPC) { // Legacy full-width delivery path, retained for existing callers that // still provide a complete MachInst-sized payload in one shot. + if (partialInst.hasBytes()) { + warn_once("moreBytes(2-arg) called with partial state; " + "discarding %u assembled bytes\n", partialInst.assembledBytes); + } auto inst = letoh(machInst);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/arch/riscv/decoder.cc` around lines 56 - 72, The legacy 2-arg moreBytes path currently unconditionally resets partialInst and sets instDone = true, which silently discards any in-progress assembly if the 3-arg path was used earlier; update the 2-arg moreBytes implementation in decoder.cc to detect existing partial bytes (check partialInst's size/state) and emit a defensive check — either assert/fail or DPRINTF/warning — before resetting partialInst, referencing the symbols partialInst and instDone (and the moreBytes entry point) so callers mixing 2-arg and 3-arg usage are alerted rather than silently losing state.src/arch/riscv/decoder.hh (1)
67-76: Manual popcount could be replaced with builtin.The
countValidBytesfunction is a manual bit-counting implementation. For a 4-bit mask this is fine, but consider using the compiler builtin for clarity:♻️ Optional simplification
static unsigned countValidBytes(uint8_t mask) { - unsigned count = 0; - while (mask != 0) { - count += mask & 0x1; - mask >>= 1; - } - return count; + return __builtin_popcount(mask); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/arch/riscv/decoder.hh` around lines 67 - 76, Replace the manual loop in countValidBytes(uint8_t mask) with the compiler builtin popcount for clarity and potential efficiency: use __builtin_popcount (or __builtin_popcountl) on the mask (casting mask to unsigned/int as needed) and return that result; keep the function name countValidBytes and signature unchanged so callers remain unaffected.src/cpu/o3/fetch.cc (1)
2114-2133: Split instruction fetch coordination works as designed but lacks defensive safeguards.The logic correctly coordinates split 32-bit instruction handling by waiting for the following FSQ entry before proceeding. However, verification shows there is no timeout or fallback mechanism if the following entry fails to arrive—the code will stall indefinitely at line 2114-2120.
While the predictor is responsible for guaranteeing that following FSQ entries are generated when a decoder has a partial instruction (indicated by extending
predEndPC), there are no defensive safeguards if this guarantee is violated. The fallback target selection logic at lines 2124-2125 is unreachable whenhasPartialInst()is true, leaving no escape path if the following entry is delayed or never arrives.Consider adding:
- A timeout or maximum stall counter to detect stuck predictor states
- A panic/assertion if following entry doesn't arrive within reasonable time
- A comment explaining the predictor's responsibility and why this blocking behavior is safe
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cpu/o3/fetch.cc` around lines 2114 - 2133, The wait for a following FSQ entry when decoder[tid]->hasPartialInst() and dbpbtb->ftqHasFollowing(tid) is false can stall forever; add a defensive per-thread stall counter (e.g., in threads[tid] or a member) that increments each time this branch returns and is reset when the following entry is seen, and if it exceeds a sensible threshold log an error/panic (via DPRINTF/processLogger) and fall back to a safe target selection (call shouldFetchFollowingTarget(tid, pc_state) again or use dbpbtb->ftqFetchingTarget(tid) / pc_state.instAddr() as a conservative start_pc), then continue with the existing logic (set threads[tid].startPC, compute request_span and call fetchCacheLine). Also add a short comment near decoder[tid]->hasPartialInst() explaining predictor responsibility and why the timeout/fallback exists.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/arch/generic/decoder.hh`:
- Around line 148-160: The generic three-arg virtual moreBytes(const PCStateBase
&pc, Addr fetchPC, size_t validBytes) must not assert that validBytes ==
moreBytesSize() because O3 fetch sometimes supplies fewer bytes; instead remove
the assert and treat mismatched validBytes as a partial-delivery case by
delegating to the two-arg moreBytes(pc, fetchPC) (i.e., drop the assertion and
call moreBytes(pc, fetchPC) when validBytes != moreBytesSize()), so non-RISC-V
decoders that don't override the 3-arg method won't crash when O3 calls
moreBytes with fewer bytes.
---
Nitpick comments:
In `@src/arch/riscv/decoder.cc`:
- Around line 56-72: The legacy 2-arg moreBytes path currently unconditionally
resets partialInst and sets instDone = true, which silently discards any
in-progress assembly if the 3-arg path was used earlier; update the 2-arg
moreBytes implementation in decoder.cc to detect existing partial bytes (check
partialInst's size/state) and emit a defensive check — either assert/fail or
DPRINTF/warning — before resetting partialInst, referencing the symbols
partialInst and instDone (and the moreBytes entry point) so callers mixing 2-arg
and 3-arg usage are alerted rather than silently losing state.
In `@src/arch/riscv/decoder.hh`:
- Around line 67-76: Replace the manual loop in countValidBytes(uint8_t mask)
with the compiler builtin popcount for clarity and potential efficiency: use
__builtin_popcount (or __builtin_popcountl) on the mask (casting mask to
unsigned/int as needed) and return that result; keep the function name
countValidBytes and signature unchanged so callers remain unaffected.
In `@src/cpu/o3/fetch.cc`:
- Around line 2114-2133: The wait for a following FSQ entry when
decoder[tid]->hasPartialInst() and dbpbtb->ftqHasFollowing(tid) is false can
stall forever; add a defensive per-thread stall counter (e.g., in threads[tid]
or a member) that increments each time this branch returns and is reset when the
following entry is seen, and if it exceeds a sensible threshold log an
error/panic (via DPRINTF/processLogger) and fall back to a safe target selection
(call shouldFetchFollowingTarget(tid, pc_state) again or use
dbpbtb->ftqFetchingTarget(tid) / pc_state.instAddr() as a conservative
start_pc), then continue with the existing logic (set threads[tid].startPC,
compute request_span and call fetchCacheLine). Also add a short comment near
decoder[tid]->hasPartialInst() explaining predictor responsibility and why the
timeout/fallback exists.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ad59d4af-6b6e-4cd8-88e1-fbe9780ecba9
📒 Files selected for processing (13)
src/arch/generic/decoder.hhsrc/arch/riscv/SConscriptsrc/arch/riscv/decoder.ccsrc/arch/riscv/decoder.hhsrc/arch/riscv/decoder.test.ccsrc/cpu/o3/fetch.ccsrc/cpu/o3/fetch.hhsrc/cpu/pred/btb/common.hhsrc/cpu/pred/btb/decoupled_bpred.ccsrc/cpu/pred/btb/decoupled_bpred.hhsrc/cpu/pred/btb/test/SConscriptsrc/cpu/pred/btb/test/btb.test.ccsrc/cpu/pred/btb/test/fetch_coverage.test.cc
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cpu/o3/fetch.cc (1)
500-574:⚠️ Potential issue | 🟠 MajorThis still assumes every coverage window fits in two cache lines.
requestSpanis now derived from prediction coverage, but this implementation still emits at most two requests. After the first line, the second request is sized as the entire remainder, so a smallercacheBlkSizeor a largerfetchBufferSizecan produce a request that crosses yet another line. Either split in a loop or guard the two-line assumption explicitly.Minimal guard if you want to keep the current two-line implementation
Addr fetchPC = vaddr; - unsigned fetchSize = cacheBlkSize - fetchPC % cacheBlkSize; // Size for first cache line + const unsigned firstLineBytes = + cacheBlkSize - fetchPC % cacheBlkSize; + panic_if(requestSpan > firstLineBytes + cacheBlkSize, + "coverage span %u exceeds two-line fetch support " + "(pc=%#x, line=%u)", + requestSpan, vaddr, cacheBlkSize); + unsigned fetchSize = firstLineBytes; // Size for first cache line fetchSize = std::min(fetchSize, requestSpan);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cpu/o3/fetch.cc` around lines 500 - 574, The current Fetch::handleMultiCacheLineFetch assumes at most two cache-line requests by making the second request cover the entire remainder (using fetchSize = requestSpan - fetchSize), which can still cross additional lines; change this to either (preferred) loop-split the remainder into per-cache-line RequestPtr objects and call updateCacheRequestStatusByRequest/translateTiming for each new request until the full requestSpan is covered (adding each to threads[tid].cacheReq and incrementing req numbers), or (alternate) add an explicit guard that fails/limits when requestSpan would require >2 lines; locate logic around first_mem_req, second_mem_req, fetchSize, requestSpan and threads[tid].cacheReq to implement the loop or the guard.
🧹 Nitpick comments (1)
src/cpu/pred/btb/common.hh (1)
23-29: Tighten the non-empty coverage contract here.
fetchCoverageSpan()can return0whenpredEndPC == startPC, butFetch::fetchCacheLine()later assertsrequestSpan > 0. Rejecting zero-width coverage at this helper boundary keeps that invariant in one place.Possible tweak
- assert(predEndPC >= startPC); + assert(predEndPC > startPC);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cpu/pred/btb/common.hh` around lines 23 - 29, The helper fetchCoverageSpan currently allows a zero-length span (predEndPC == startPC) which violates Fetch::fetchCacheLine's requestSpan > 0 contract; change the precondition in fetchCoverageSpan by replacing the non-strict assert with a strict one (assert(predEndPC > startPC)) so the function rejects zero-width coverage at the helper boundary and keeps the invariant centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cpu/o3/fetch.hh`:
- Around line 757-761: The current CacheRequest teardown drops stored PacketPtr
values via packets.clear() in Fetch::CacheRequest::reset(), which can leak when
a split fetch half is present; move packet destruction into the common teardown
used by both completion and squash/reset. Update Fetch::CacheRequest::reset() to
call releaseStoredPackets() (or replicate its proper destruction logic) instead
of a bare packets.clear(), and ensure processMultiCacheLineCompletion() no
longer double-destroys packets; reference functions/members:
Fetch::CacheRequest::reset, Fetch::releaseStoredPackets,
Fetch::processMultiCacheLineCompletion, and the packets/PacketPtr storage so the
stored packets are always destroyed on reset or completion.
---
Outside diff comments:
In `@src/cpu/o3/fetch.cc`:
- Around line 500-574: The current Fetch::handleMultiCacheLineFetch assumes at
most two cache-line requests by making the second request cover the entire
remainder (using fetchSize = requestSpan - fetchSize), which can still cross
additional lines; change this to either (preferred) loop-split the remainder
into per-cache-line RequestPtr objects and call
updateCacheRequestStatusByRequest/translateTiming for each new request until the
full requestSpan is covered (adding each to threads[tid].cacheReq and
incrementing req numbers), or (alternate) add an explicit guard that
fails/limits when requestSpan would require >2 lines; locate logic around
first_mem_req, second_mem_req, fetchSize, requestSpan and threads[tid].cacheReq
to implement the loop or the guard.
---
Nitpick comments:
In `@src/cpu/pred/btb/common.hh`:
- Around line 23-29: The helper fetchCoverageSpan currently allows a zero-length
span (predEndPC == startPC) which violates Fetch::fetchCacheLine's requestSpan >
0 contract; change the precondition in fetchCoverageSpan by replacing the
non-strict assert with a strict one (assert(predEndPC > startPC)) so the
function rejects zero-width coverage at the helper boundary and keeps the
invariant centralized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 590dbd8d-532d-42e5-a0ea-96a376ad5595
📒 Files selected for processing (7)
src/arch/riscv/decoder.ccsrc/arch/riscv/decoder.hhsrc/arch/riscv/decoder.test.ccsrc/cpu/o3/fetch.ccsrc/cpu/o3/fetch.hhsrc/cpu/pred/btb/common.hhsrc/cpu/pred/btb/test/fetch_coverage.test.cc
🚧 Files skipped from review as they are similar to previous changes (1)
- src/arch/riscv/decoder.cc
|
🚀 Performance test triggered: gcc12-spec06-0.8c |
|
[Generated by GEM5 Performance Robot] Ideal BTB PerformanceOverall Score
|
Keep the canonical control PC as startPC for debug and attribution. Add derived trigger/end/coverage helpers, extend predEndPC for split 4B taken branches, and switch fetch redirect matching to coverage-based logic with blocked/taken debug prints. Also add helper-level BTB cases for 2B/4B derived PC views and cross-boundary trigger coverage. Change-Id: I442b44312d6a6e9f09bb7c6621f2ca3c26107e6f
Limit fetch requests to the predicted coverage window instead of always consuming the full 66-byte buffer span. Keep 66 bytes as fetch-buffer capacity, but let the RISC-V decoder assemble split 32-bit instructions from partial-byte deliveries and request the following coverage block only when the trailing half is still missing. Also add decoder helper tests and a fetch-coverage helper test so the new coverage and split-decode behavior stays checked at helper level. Change-Id: Ifa6cfcfb99159084593d02827ffe8b051b1c0d36
Restore the legacy RISC-V decoder needMoreBytes semantics for 32-bit instructions, clamp fetchCoverageSpan before narrowing to unsigned, initialize BranchInfo::isDirect in the default constructor, and drop stored packet pointers after misaligned fetch completion. Also add helper-level regression coverage for the legacy decoder path and the widened-span clamp behavior. Change-Id: Ideb1c46ccc59251bf92f1a008eec533052bdea3a
Make predictor-visible BTB branch identity use the final-halfword control PC while preserving explicit architectural start-PC metadata for stats, trace, and fetch byte coverage. This commit also adds the pending OpenSpec baseline and dependent change docs for the new startPC/controlPC split. Change-Id: Id1e9694a4789b8ddf3738ef2526b3da5748b955f
Add fetch-side ownership migration for split taken control instructions by carrying decodeStartPC in FetchTarget, consuming the previous target before buildInst when the following target claims the instruction, and matching taken redirects on the owner target's branch start PC. This removes fetch's trigger_covered dependency while preserving controlPC-only predictor keying and startPC-based stats/debug semantics. Change-Id: Iebf699048cad58c65cffdc165e349a74015a5809
Keep predictor-visible control-PC history recovery aligned with speculative updates, skip FTQ owner migration in trace mode, and fix trace rollback to resume at the correct 1-based trace index. Also update the OpenSpec proposal/tasks to record the manifest-driven FS target12 plus trace66 regression gate and final aggregation outputs. Change-Id: If408a9237db1e886140bacc065aa6d78b12a2848
Use the architectural branch start PC when deriving call fall-through addresses in RAS/uRAS, while keeping predictor-visible controlPC for predictor lookup state. Also switch trace wrong-path NOP sizing back to predBranchInfo.startPC() and add directed BranchInfo/RAS coverage for split 4B instructions. Change-Id: Id84ac2624e1bbccae9d73e05d55a8bb2d9279ba2
Tighten fetch-side FTQ owner migration so it only fires for the\nsequential split-control handoff case, not for ordinary taken-target\nentries that happen to sit behind the current fetch head.\n\nAlso keep control squashes keyed by predictor-visible controlPC and add a\nunit test that blocks the backward-target regression. Change-Id: Ia8cb92e0ac44f84cbe64dd5b84b4d1af11d8e7da
The control-PC owner-migration validation uncovered two blockers. MicroTAGE redeclared inherited TimedBaseBTBPredictor params in Python, which generated shadow C++ param fields and left the base predictor with a zero blockSize at runtime. That in turn produced bogus branch-position warnings during trace runs. Fix this by overriding the inherited defaults without redeclaring the Param fields. Also track the OpenSpec validation assets for this change and make the trace_top66 task derive its status from the actual trace run directory instead of inline text matching. This keeps the manifest-driven FS+trace regression reproducible and prevents false task failures during final aggregation. Validation: - openspec validate update-bpu-control-pc-owner-migration --strict - scons -j128 build/RISCV/gem5.opt - FS checkpoint gate: 12/12 PASS - Trace gate: 66/66 PASS - final_verify overall PASS on control_pc_tail_halfword_20260401_recheck1 Change-Id: Ia2b3fc2f24a9873b41da88f39d0f94a77af30b46
Factor split-control ownership semantics into FetchTarget helpers so fetch-side handoff, owner-range checks, and taken matching share one contract. Rename decodeStartPC to ownerStartPC to match the actual domain concept, keep trace mode out of fetch-time owner migration, and update BTB tests to validate the shared helper behavior. Change-Id: I7fd65621b57fd1fada5b9528431a6bbedb6410d9 Validation: scons -j8 build/RISCV/gem5.opt; scons -j8 --unit-test build/RISCV/cpu/pred/btb/test/btb.test.opt build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt build/RISCV/cpu/pred/btb/test/uras.test.opt; build/RISCV/cpu/pred/btb/test/btb.test.opt; build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt; build/RISCV/cpu/pred/btb/test/uras.test.opt; XS_MAX_INSTS=10000 TRACE_FORMAT=champsim bash util/xs_scripts/trace/run_trace_champsim.sh /nfs/home/share/glr/champsim_traces/cvp1_public/compute_fp_1.gz
Update frontend docs to describe the control-PC tail-halfword view, FetchTarget ownerStartPC contract, fetch-side owner handoff boundaries, and the MicroTAGE inherited-parameter pitfall uncovered by the trace regression. The docs now point readers to the current helper contracts in common.hh, the fetch-side consumers in fetch.cc and TraceFetch.cc, and the refactor constraint that owner predicates must remain centralized in FetchTarget helpers. Change-Id: Ibe734b81a878fbb6dccba2370c7615a031eb16bc
b92a977 to
546514e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cpu/o3/trace/TraceFetch.cc (1)
950-989:⚠️ Potential issue | 🟠 MajorReturn early when a normal squash has no trace-index mapping.
If
findTraceIndexForSeqNum(seqNum)misses andsquash_itselfis false,indexstays 0 and this code computesdesired_trace_index = 1, which rewinds the reader to the start of the trace instead of failing the rollback. That can desynchronize the trace stream after any unmapped squash.Suggested fix
if (!found) { if (squash_itself) { // If squashing the instruction itself, try one earlier DPRINTF(Fetch, "rollbackTraceReader[sn:%lli]: No mapped trace index, " "trying earlier instruction\n", seqNum); index = findTraceIndexForSeqNum(seqNum - 1); if (index != 0) { found = true; reused_prev_mapping = true; } else { DPRINTF(Fetch, "rollbackTraceReader[sn:%lli]: No mapped trace index (skip)\n", seqNum); return false; } + } else { + DPRINTF(Fetch, + "rollbackTraceReader[sn:%lli]: No mapped trace index (skip)\n", + seqNum); + return false; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cpu/o3/trace/TraceFetch.cc` around lines 950 - 989, The code allows a normal (non-self) squash to proceed when findTraceIndexForSeqNum(seqNum) fails because index stays 0, causing desired_trace_index to become 1 and incorrectly rewind to trace start; fix this in the rollback logic (the if (!found) block in rollbackTraceReader) by adding an explicit else path for the case when found==false and squash_itself==false that logs the miss (use DPRINTF(Fetch,..., seqNum)) and returns false immediately, so variables like index, reused_prev_mapping and desired_trace_index are only used when a valid mapping was found or when you successfully reused the previous mapping.docs/Gem5_Docs/frontend/BPU.md (1)
171-212:⚠️ Potential issue | 🟡 MinorWrap the
BPU::tick()pseudo-code in a code fence.There is no opening fence before the pseudo-code, so the fence at the end opens a block instead of closing one and the following sections render as code.
Suggested fix
+``` BPU::tick() | |--if(!receivedPred && numOverrideBubbles == 0) | // 生成最终预测结果,并创建override bubbles @@ |--for (int i = 0; i < numComponents; i++) components[i]->putPCHistory(s0PC, s0History, predsOfEachStage);</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/Gem5_Docs/frontend/BPU.mdaround lines 171 - 212, The pseudo-code for
BPU::tick() is missing an opening code fence; add a matching opening
triple-backtick fence immediately before the "BPU::tick()" line so the existing
closing fence after the components[i]->putPCHistory(...) line properly closes
the block, ensuring the whole BPU::tick() pseudo-code (including
generateFinalPredAndCreateBubbles, processEnqueueAndBubbles,
requestNewPrediction and the for-loop) is rendered as a single fenced code
block.</details> </blockquote></details> </blockquote></details>🧹 Nitpick comments (2)
src/arch/riscv/decoder.hh (1)
171-174: KeeplegacyNeedMoreBytes()out of the public API if it's only an internal helper.This looks like decoder-state-machine plumbing, and exposing it on
Decodermakes that behavior harder to change later. If it's only needed bydecoder.ccand unit tests, consider keeping it private or file-local and testing throughmoreBytes()instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/arch/riscv/decoder.hh` around lines 171 - 174, The legacyNeedMoreBytes helper is exposed in the public Decoder API but is only decoder-state-machine plumbing; make it non-public by moving legacyNeedMoreBytes(ExtMachInst) out of the Decoder header and into decoder.cc as a static/file-local function (or mark it private inside the Decoder class if truly needed by tests), update any callers in decoder.cc to use the new file-local symbol, and ensure tests exercise the behavior via the public moreBytes() wrapper instead of calling legacyNeedMoreBytes directly.src/arch/riscv/decoder.test.cc (1)
48-62: ExerciseDecoder::reset()here, not justPartialInstBuffer::reset().This only proves the helper clears itself. The flush-sensitive behavior lives on
Decoder::reset()/hasPartialInst()insrc/arch/riscv/decoder.ccandsrc/arch/riscv/decoder.hh; if that wiring breaks, this test still stays green.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/arch/riscv/decoder.test.cc` around lines 48 - 62, The test currently exercises only PartialInstBuffer::reset(), so change it to exercise the Decoder-level API: create a Decoder (or use the existing decoder helper) and drive it to a partial state (by calling pushChunk or equivalent on its PartialInstBuffer via the Decoder), then call Decoder::reset() and assert using Decoder::hasPartialInst() (and the same expectations for instPC, fullBits()/assembledBytes via the Decoder accessors or by inspecting its internal PartialInstBuffer through the Decoder) to ensure the flush-sensitive behavior is tested rather than only PartialInstBuffer::reset().🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed. Inline comments: In `@src/cpu/pred/BranchPredictor.py`: - Line 1101: Production default for MicroTAGE latency was changed to zero by setting numDelay = 0 in BranchPredictor.py, but the test-only constructor in microtage.cc still calls setNumDelay(1), causing unit tests to exercise a different stage contract; update the test-only constructor in microtage.cc to call setNumDelay(0) (or remove the hard-coded setNumDelay call so it inherits the default) so that MicroTAGE::getDelay(), MicroTAGE::putPCHistory(), and the stagePreds behavior are identical between tests and production. --- Outside diff comments: In `@docs/Gem5_Docs/frontend/BPU.md`: - Around line 171-212: The pseudo-code for BPU::tick() is missing an opening code fence; add a matching opening triple-backtick fence immediately before the "BPU::tick()" line so the existing closing fence after the components[i]->putPCHistory(...) line properly closes the block, ensuring the whole BPU::tick() pseudo-code (including generateFinalPredAndCreateBubbles, processEnqueueAndBubbles, requestNewPrediction and the for-loop) is rendered as a single fenced code block. In `@src/cpu/o3/trace/TraceFetch.cc`: - Around line 950-989: The code allows a normal (non-self) squash to proceed when findTraceIndexForSeqNum(seqNum) fails because index stays 0, causing desired_trace_index to become 1 and incorrectly rewind to trace start; fix this in the rollback logic (the if (!found) block in rollbackTraceReader) by adding an explicit else path for the case when found==false and squash_itself==false that logs the miss (use DPRINTF(Fetch,..., seqNum)) and returns false immediately, so variables like index, reused_prev_mapping and desired_trace_index are only used when a valid mapping was found or when you successfully reused the previous mapping. --- Nitpick comments: In `@src/arch/riscv/decoder.hh`: - Around line 171-174: The legacyNeedMoreBytes helper is exposed in the public Decoder API but is only decoder-state-machine plumbing; make it non-public by moving legacyNeedMoreBytes(ExtMachInst) out of the Decoder header and into decoder.cc as a static/file-local function (or mark it private inside the Decoder class if truly needed by tests), update any callers in decoder.cc to use the new file-local symbol, and ensure tests exercise the behavior via the public moreBytes() wrapper instead of calling legacyNeedMoreBytes directly. In `@src/arch/riscv/decoder.test.cc`: - Around line 48-62: The test currently exercises only PartialInstBuffer::reset(), so change it to exercise the Decoder-level API: create a Decoder (or use the existing decoder helper) and drive it to a partial state (by calling pushChunk or equivalent on its PartialInstBuffer via the Decoder), then call Decoder::reset() and assert using Decoder::hasPartialInst() (and the same expectations for instPC, fullBits()/assembledBytes via the Decoder accessors or by inspecting its internal PartialInstBuffer through the Decoder) to ensure the flush-sensitive behavior is tested rather than only PartialInstBuffer::reset().🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID:
506b5a2f-8ae9-44e3-a0cb-04a608556570📒 Files selected for processing (31)
docs/Gem5_Docs/frontend/BPU.mddocs/Gem5_Docs/frontend/fetch-refactor.mddocs/Gem5_Docs/frontend/pc_data_flow.mddocs/Gem5_Docs/frontend/phr_design.mdsrc/arch/generic/decoder.hhsrc/arch/riscv/SConscriptsrc/arch/riscv/decoder.ccsrc/arch/riscv/decoder.hhsrc/arch/riscv/decoder.test.ccsrc/cpu/o3/fetch.ccsrc/cpu/o3/fetch.hhsrc/cpu/o3/trace/TraceFetch.ccsrc/cpu/pred/BranchPredictor.pysrc/cpu/pred/btb/abtb.ccsrc/cpu/pred/btb/btb_ittage.ccsrc/cpu/pred/btb/btb_mgsc.ccsrc/cpu/pred/btb/btb_tage.ccsrc/cpu/pred/btb/btb_ubtb.ccsrc/cpu/pred/btb/common.hhsrc/cpu/pred/btb/decoupled_bpred.ccsrc/cpu/pred/btb/decoupled_bpred.hhsrc/cpu/pred/btb/decoupled_bpred_stats.ccsrc/cpu/pred/btb/mbtb.ccsrc/cpu/pred/btb/microtage.ccsrc/cpu/pred/btb/ras.ccsrc/cpu/pred/btb/test/SConscriptsrc/cpu/pred/btb/test/btb.test.ccsrc/cpu/pred/btb/test/fetch_coverage.test.ccsrc/cpu/pred/btb/test/ras_test.ccsrc/cpu/pred/btb/test/uras_test.ccsrc/cpu/pred/btb/uras.cc✅ Files skipped from review due to trivial changes (3)
- docs/Gem5_Docs/frontend/fetch-refactor.md
- docs/Gem5_Docs/frontend/phr_design.md
- src/cpu/pred/btb/test/SConscript
🚧 Files skipped from review as they are similar to previous changes (7)
- src/arch/riscv/SConscript
- src/arch/generic/decoder.hh
- src/cpu/pred/btb/decoupled_bpred.cc
- src/cpu/pred/btb/test/btb.test.cc
- src/arch/riscv/decoder.cc
- src/cpu/pred/btb/decoupled_bpred.hh
- src/cpu/o3/fetch.cc
| numBanks = Param.Unsigned(4,"Number of banks for bank conflict simulation") | ||
| enableBankConflict = Param.Bool(False,"Enable bank conflict simulation") | ||
| numDelay = Param.Unsigned(0,"Prediction latency in cycles") | ||
| numDelay = 0 |
There was a problem hiding this comment.
Keep the UNIT_TEST latency in sync with the new default.
Line 1101 makes production MicroTAGE zero-cycle, but src/cpu/pred/btb/microtage.cc Line 51 still hard-codes setNumDelay(1) in the test-only constructor. Since MicroTAGE::putPCHistory() starts filling stagePreds from getDelay(), direct unit tests now exercise a different stage contract than production.
Suggested follow-up
diff --git a/src/cpu/pred/btb/microtage.cc b/src/cpu/pred/btb/microtage.cc
@@
- setNumDelay(1);
+ setNumDelay(0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cpu/pred/BranchPredictor.py` at line 1101, Production default for
MicroTAGE latency was changed to zero by setting numDelay = 0 in
BranchPredictor.py, but the test-only constructor in microtage.cc still calls
setNumDelay(1), causing unit tests to exercise a different stage contract;
update the test-only constructor in microtage.cc to call setNumDelay(0) (or
remove the hard-coded setNumDelay call so it inherits the default) so that
MicroTAGE::getDelay(), MicroTAGE::putPCHistory(), and the stagePreds behavior
are identical between tests and production.
Change-Id: Ibd82e8e60766d3969ab1c5999cf6862c22b42b61
Background
In the decoupled BTB frontend, split 4-byte RVI control instructions originally used the first halfword as the predictor-visible control-PC view, while fetch still tended to over-fetch a fixed larger window instead of following the predicted coverage span.
After switching the predictor-visible control identity to the tail halfword, the frontend also needs an explicit ownership model for split control instructions so that fetch, trace, RAS/uRAS, and stats/trace attribution can consistently use the right PC view.
What This PR Changes
This PR keeps the whole control-PC / owner-migration stack together and includes the following pieces:
Align decoupled BTB control-PC coverage
Align fetch range with predicted coverage
Switch predictor-visible control identity to tail halfword
startPCas the architectural start-PC view.controlPCfor predictor-visible identity when the split-control case requires it.Migrate split-control ownership to the following fetch target
FetchTarget.Fix follow-up regressions and simplify the final contract
BranchPredictor.py.decodeStartPCtoownerStartPCand centralize owner predicates inFetchTargethelpers so fetch/tests stop duplicating the protocol.Update frontend documentation
startPC / controlPC / ownerStartPCsplit, the fetch-side owner handoff contract, and the MicroTAGE parameter pitfall indocs/Gem5_Docs/frontend/.Scope Notes
Validation
scons -j8 build/RISCV/gem5.optscons -j8 --unit-test build/RISCV/cpu/pred/btb/test/btb.test.opt build/RISCV/cpu/pred/btb/test/fetch_coverage.test.opt build/RISCV/cpu/pred/btb/test/uras.test.optbuild/RISCV/cpu/pred/btb/test/btb.test.optbuild/RISCV/cpu/pred/btb/test/fetch_coverage.test.optbuild/RISCV/cpu/pred/btb/test/uras.test.optXS_MAX_INSTS=10000 TRACE_FORMAT=champsim bash util/xs_scripts/trace/run_trace_champsim.sh /nfs/home/share/glr/champsim_traces/cvp1_public/compute_fp_1.gzSummary by CodeRabbit
New Features
Refactor
Tests
Documentation