Microarch ideas — fetch prefetch & memory subsystem¶
Status: design notes only; not a commitment or specification.
Scope: started as instruction fetch / prefetch (stage01_fetch,icacheinterface); expanded here with D-side notes (store buffer, loads, non-blocking D-cache direction) as a single living scratchpad.
Living document¶
This file is meant to accumulate ideas over time: new bullets, references, and measured outcomes can be appended as the team experiments. Keep entries short and dated if you add lab notes (e.g. “2026-03: next-line +1% on CoreMark in RTL sim — config X”).
Current baseline¶
The core today combines:
prefetcher_wrapper/next_line_prefetcher: After a demand miss completes on a cacheable line, requests one following cache line (line-aligned), serialized with the single low-level instruction port viaicache’s prefetch handshake.icache: WhenENABLE_ICACHE_PREFETCHis set, can accept a prefetch request when there is no demand miss, sharing fill logic with normal misses.align_buffer+fetch: Compressed ISA support, PC progression, and branch prediction interact with when sequential addresses are actually consumed.
This is cache-line-oriented prefetch, not (by default) a shallow 32-bit fetch FIFO in front of the bus.
Reference core — lowRISC Ibex ibex_prefetch_buffer¶
Ibex documents this block as a prefetch buffer for a 32-bit memory interface: it caches instructions in a small FIFO and, critically, supports NUM_REQS = 2 outstanding instruction bus transactions with explicit branch discard of speculative data (fifo_clear = branch_i; comments note interaction with FENCE.I).
Ideas transferable to Level-v (conceptual)¶
| Ibex mechanism | Possible Level-v analogue |
|---|---|
| 2-deep outstanding IF requests | Allow more than one I-side transaction in flight if the cache / Wishbone path can honor gnt/rvalid ordering without breaking single-port assumptions—or add a thin IF queue that legally overlaps handshakes. |
Fetch FIFO + fifo_ready back-pressure |
If the critical path is “IF waits one beat per valid”, a small queue between align logic and I$ could hide fixed multi-cycle RAM latency (orthogonal to line prefetch). |
branch_discard_* on outstanding data |
Any extra outstanding fetch must track kill on flush / mispredict / fence semantics so wrong-path instructions never reach decode—mirror Ibex’s discard flags in a minimal form. |
Word-aligned address bump (+4) |
Level-v is RVC-aware; a shallow queue must respect 16/32-bit step width or attach after align resolution. |
What Ibex does not replace¶
Ibex’s buffer is not a replacement for next-cache-line warmup: it mainly pipelines narrow fetches and cuts combinational depth to the I-side. Level-v’s next-line policy still matters for miss-dominated sequential code when fills are slow.
Why measured IPC / benchmark gains can stay modest (expectations)¶
It is common to expect large gains from instruction prefetch and from load/store or store buffers, then see only a few percent (or noise) in simulation or on FPGA. Useful reasons to keep in mind:
1. The workload is often not memory-latency limited¶
CoreMark, Dhrystone, and many short tests already mostly hit in L1 (or in fast SRAM models). Hiding DRAM latency only helps when the pipeline would otherwise stall on miss. If the limiter is ALU chains, branches, or cache hits, extra buffering adds little.
2. Amdahl /Instruction supply vs execution¶
Even a perfect I-side only speeds up the fetch slice of execution. If decode/execute or mispredict recovery dominates, I-side tweaks have a low ceiling.
3. Single port and serialization¶
If the I-cache (or unified backing port) allows only one outstanding fill or strictly serializes prefetch + demand, next-line still helps only on the next miss, not on streams of misses. MSHR depth and arbiter rules cap the benefit.
4. Branchy code and short traces¶
Next-line prefetch assumes sequential consumption soon. Frequent branches, OS / trap boundaries, or ICMPI-heavy control flow reduce useful prefetch distance.
5. Store / load buffers (D-side)¶
A store buffer mainly decouples retirement from D-cache write timing and can merge traffic; it does not turn a compute-bound kernel into memory-bound upside. Gains show up when stores (or WAW ordering) would stall the pipeline without the buffer, or when non-blocking behavior overlaps late-fill loads. If loads already hit or the buffer is shallow relative to latency, improvement is small—same miss-rate / port story as the I-side.
6. RTL sim vs silicon¶
Fast RAM models (or zero external contention) understate real memory delay, so prefetch and buffers look less heroic in sim; conversely, very fast L1 models already idealize the case where buffering helps least.
Practical takeaway: quantify miss cycles, stall events, and MPC / IPC before and after; attribute gains to I-miss, D-miss, store stall, or branch buckets. That avoids tuning the wrong knob.
Direction 1 — Implement reserved PREFETCH_TYPE strategies¶
level_param and prefetcher_wrapper already reserve types 2–4 for richer policies. Concrete options:
- Stride / PC-delta prefetcher
- Maintain a small fully associative table keyed by PC region or loop PC, learning
(PC, Δ)where loads/stores or backward branches repeat. - On a hit, prefetch
PC + k·Δfork = 1 .. PREFETCH_DEGREE(parameter already hinted inlevel_param). -
Best for numeric kernels and dense loops; must be throttled to avoid bus storms.
-
Stream / sequential depth
- Extend next-line to degree N: after confirming sequential consumption (PC advances by 2/4 bytes predictably), arm multiple line addresses subject to MSHR / port limits.
-
Requires clear rules when taken branches or exceptions flush training state.
-
Hybrid
- Next-line as default; escalate to stride when the same PC band triggers repeated next-line hits (optional confidence counter).
Interface needs: today the prefetch FSM sees miss_addr_i and miss status. Stride/stream may need observed PC and retire or fetch-valid hints from fetch (or a narrow side channel) without bloating the critical path—likely registered observation samples.
Direction 2 — I-cache / memory subsystem coupling¶
Prefetch usefulness is bounded by how many fills can be in flight and how replacement interacts with pollution.
- Multiple outstanding misses (MSHR depth): If the memory side evolves to support more than one I-side fill, prefetch degree can be raised without serializing every line.
- Prefetch-only hint to L2: Mark prefetch fills as lower priority than demand if an arbiter or L2 queue exists, so latency-sensitive demand wins under load.
- Victim / way prediction: Aggressive prefetch can increase conflict misses; tie-in with PLRU / eviction policy (or a small prefetch filter “do not prefetch if set is thrashing”) is worth measuring in simulation.
Direction 3 — Alignment with control flow¶
- Branch predictor integration: On high-confidence predicted taken branches, optional target-line prefetch (one line at target) may help I$ cold paths; must respect PMA (uncached regions) and not fight FENCE.I semantics.
- RVC (compressed) streams: Sequential PC steps are 2 or 4 bytes; a future prefetcher could use effective fetch width from the align buffer to avoid over-fetch assumptions.
Direction 4 — Shallow fetch queue (Ibex-style, same as “thin front-end”)¶
If profiling shows bubbles not explained by misses (e.g. strict single-beat gating on the path to I$), a small FIFO + limited outstanding requests—along the lines of Ibex’s prefetch buffer—may help only if the Wishbone / L2 contract allows overlapping or pipelined transactions safely.
This is a larger microarch change than extending next_line_prefetcher; validate with cycle-accurate traces and flush / discard rules consistent with compressed fetch and fences.
Direction 5 — Store / load path and non-blocking D-cache (future)¶
Today’s baseline (see Memory, Store buffer) is in-order at the MEM stage, with a FIFO store buffer (drain oldest to D$, store→load forwarding and conflict handling on the same word) and no separate load queue—pending loads are a small bit of FSM, not an OoO-style LDQ.
Store buffer — possible evolutions¶
- Deeper / parameterized depth (
SB_DEPTH) tuned after profiling real workloads (not only CoreMark). - Coalescing / merging of adjacent or same-line stores before drain (reduces D$ traffic if policy stays correct for RV weak ordering and fences).
- Clearer FENCE / FENCE.I interaction documentation + tests if SB + D$ ordering is extended.
- If D$ becomes non-blocking, SB drain may need retry / back-pressure semantics and arbitration with in-flight load misses (see below).
Load path — possible evolutions¶
- A minimal load miss buffer (MSHR-lite): track one or few outstanding D$ read misses so the pipeline or later independent ops (where hazards allow) are not serialized as aggressively as today—only if the rest of the core can exploit it without breaking in-order semantics.
- Explicit load queue is usually coupled with wider issue or non-blocking D$; treat as a stepping stone, not a standalone knob.
Non-blocking D-cache (longer-term)¶
The L2 path already explores non-blocking, multi-bank behavior (nbmbmp_l2_cache, see L2 cache). A natural stretch goal is a non-blocking L1 D-cache:
- Multiple outstanding misses (small MSHR table): overlap fill latency with independent instructions (limited by in-order issue, but still useful for late-hit or critical-word-first style fills).
- Load hits while one miss is outstanding (classic “hit under miss” subset): requires banking / tag–data separation and careful store-buffer ordering vs. returning fill data.
- Prefetch integration: D-side hardware prefetch (stride/stream) pays off most when misses can overlap; non-blocking D$ raises the ceiling for those policies.
Risk / cost: verification cost jumps (memory ordering, SB forwarding vs. fill race, FENCE). Prefer parameterized NONBLOCKING_DDEPTH or similar and grow tests incrementally.
Relation to high-end cores (sanity check)¶
OoO cores (e.g. XiangShan class) combine deep store queues + load queues + non-blocking caches with out-of-order scheduling; Level-v gains remain bounded by sequential issue until the front-end / issue model changes. Non-blocking D$ still helps miss latency hiding even in-order.
Appendix — Reference repos: store vs load buffering & forwarding (notes)¶
Purpose: when Level-v adds a distinct load buffer / queue (beyond today’s
load_pending) and evolves store–load forwarding, these projects are starting points for RTL or architecture mining.
Caveat: rows are README / high-level unless you open the RTL; module names and queue depths change with branches—re-validate before citing in a paper or porting code.
| Project | Store-side structure | Load-side structure | Forwarding / disambiguation (high level) | Relevance to Level-v |
|---|---|---|---|---|
| Level-v (this tree) | store_buffer.sv: FIFO, drain to D$, byte-wise merge |
No LDQ; memory.sv load_pending + D$ arb |
SB→load combinational merge; fwd_conflict → stall / drain |
Baseline to extend |
| ApogeoRV | README: store buffer | OoO core: likely load queue / LSQ inside LSU (not spelled “LB” on README) | README: load forwarding from store path; OoO ordering | MIT SV; study Hardware/ for STQ/LDQ split & replay |
| XiangShan | Deep store queue (Chisel LSU) | Load queue + miss handling | Full LSQ disambiguation, non-blocking memory | Best architecture reference for split SB/LB (scale is huge) |
| riscv-boom | STQ | LAQ / load miss track | Documented LSQ; replays, youngest-store rules | Academic OoO; clearer block diagram → RTL trail than some industrial repos |
| Ibex | LSU-integrated (no big split SB/LB like OoO) | Same LSU path | Stall / single outstanding style; see ibex_load_store_unit |
In-order reference; not where to copy LDQ from |
| AngeloJacobo/RISC-V | No real store buffer | No load buffer | rv32i_forwarding.v = register forwarding only |
Teaching core; not LS-buffer reference |
| quasiSoC CPU | Simple multi-cycle store path | Simple load path | Software / timing model differs from pipelined LSQ | SoC/Linux, not LS microarch |
| phoeniX | No store buffer; store in MW comb. to bus | No LDQ; load in same MW stage | Register fwd only (EX/MW→DE); load-use = stall; subword via frame_mask on word bus |
Contrast to Level-v SB; see detail below |
phoeniX — load/store / stall / forward (RTL summary, main)¶
Source: top phoeniX.v, Load_Store_Unit.v, Hazard_Forward_Unit.v. GPL-3.0.
- Stores — not “merged” in a buffer
- Store data is registered as
rs2_MW_regin the combined Memory/Writeback stage;Load_Store_Unitdrives a 32-bit word-aligned address and a 4-bitframe_mask(byte lane enables). -
Subword stores place
store_datain the selected lane(s) and drivezon unused lanes on the bidirectionalmemory_interface_data(classic masked write pattern). There is no FIFO of pending stores and no store-to-load forwarding path in hardware. -
Loads — not merged across transactions
- Load uses the same unit: one memory read presents a full word; extract / sign-extend of byte/halfword is purely combinational from
memory_interface_dataandfunct3/frame_mask. -
Unaligned access is not supported in LSU (README); Level-v + specs are stricter/softer per implementation.
-
Register forwarding (not LSQ)
Hazard_Forward_Unit: two writeback sources compared tors1/rs2in decode —write_index_EX_reg+execution_result_EX_wire(with special cases for LUI/AUIPC/SYSTEM) andwrite_index_MW_reg+write_data_MW_wire(includes ALU/MUL/DIV/JAL/LUI/LOAD/CSR result after MW).-
So ALU results can bypass from EX or MW; load result only exists in
write_data_MW_wireafter the load completes in MW. -
Stalls
stall_condition[1]: MUL/DIV busy (mul_busy_EX_wire || div_busy_EX_wire).-
stall_condition[2]: load–use hazard —opcode_EX_reg == LOADand EX writes a register that decode still needs asrs1orrs2(read_enablegating). Pipeline inserts bubbles (PC/decode freeze; EX can be cleared on this stall per the DE→EXalwayssensitivity). No bypass from a load still in EX. -
Register_Loading_Table - Side table logging load target address when a load completes (
write_enable_MW_reg && opcode_MW_reg == LOAD); not used in the forwarding/storedatapath in the snippet above—approximate / profiling hook per project docs.
Takeaway for Level-v: phoeniX is a minimal in-order reference: simpler than your store buffer + youngest-byte merge. Use it to contrast “bare LSU + load stall” vs your “SB drain + SB→load forward + conflict stall” when documenting design trade-offs.
Level-v direction (reminder): splitting load buffer from “pending bit” usually implies: (1) tracking multiple in-flight loads if D$ allows, (2) explicit RAW resolution vs SB youngest-first byte merge, (3) FENCE / flush interaction with both structures—align tests with memory model docs (and related notes) / ordering goals.
Direction 6 — RTOS & peripheral bring-up (software / SoC goals)¶
Hardware microarch is only part of the story; product-style goals include FreeRTOS-class RTOS, a thin BSP over UART/GPIO/CLINT/PLIC, and scripted build+sim like reference teaching cores.
- Reference: AngeloJacobo/RISC-V — RV32I+Zicsr Verilog, FreeRTOS,
test/regression, peripheral C library. Use for workflows and SDK layout, not as an RTL substitute (different ISA subset, Harvard vs Wishbone). - Where tracked in-tree: SoC roadmap — section Software, peripherals, and RTOS (community reference).
Validation and metrics¶
Any new policy should be gated behind parameters and evaluated with:
- CoreMark / Embench (realistic I-footprint),
- RISC-V DV or long random ISA runs (stress predictor + flush),
- Hit/miss/prefetch-fill counters exported (if not already) to quantify pollution vs. coverage,
- Optional: synthetic I-miss heavy microbenchmarks (large cold loops) to stress prefetch unlike tiny compliance tests.
References (in-tree)¶
- External LSQ-oriented open cores (to mine with caution): ApogeoRV, XiangShan, riscv-boom, in-order counter-example Ibex LSU area; minimal LSU + load-stall reference phoeniX
- Fetch module
- Align buffer
- Next-line prefetcher
- Prefetcher wrapper
- I-cache
- Memory stage / store path (D-side buffering context): Memory module, Store buffer
- D-cache, L2 cache (non-blocking direction)