This document explains the RTL pipeline accounting enabled by **+define+LOG_PERF_STALL** (see rtl/tracer/perf_stall_counters.sv, instantiated from rtl/core/cpu.sv). It is meant to help interpret **results/logs/verilator/<test>/perf_pipeline.log** (and the same block at the end of **verilator_run.log**).
Enabling and outputs
- Build:
make verilate TEST_CONFIG=coremark (CoreMark profile sets LOG_PERF_STALL=1 in script/config/tests/coremark.conf), or pass LOG_PERF_STALL=1 when verilating.
- Run:
make run_coremark SIM_FAST=1 (or your usual Verilator flow).
- Artifacts:
results/logs/verilator/<test>/perf_pipeline.log — extracted summary (when the Python runner sees the LOG_PERF_STALL banner in simulator output).
- Full simulator stdout/stderr:
verilator_run.log.
Counters reset when CPU rst_ni is low, except **cycles_clk_total**, which counts every clock edge for the whole simulation (including reset).
Denominator for percentages: **cycles_active** = cycles with rst_ni == 1 (the “benchmark window” in typical bring-up runs). Your workload may include reset/release and shutdown; interpret percentages accordingly.
Stall causes (cycles)
A stall cycle is any cycle where stall_cause != NO_STALL in cpu.sv. Only one stall reason is recorded per cycle (priority below).
Priority order (hardware)
Highest to lowest:
FENCEI_STALL — FENCE.I / related D-cache writeback stall
IMISS_STALL — I-cache miss
DMISS_STALL — D-cache / memory miss
LOAD_RAW_STALL — load-use hazard (fe_stall || de_stall from hazard_unit)
ALU_STALL — multi-cycle multiply/divide
If a lower-priority condition could also be true in the same cycle, the counter only attributes the winning cause.
LOAD_RAW_STALL
|
|
| Meaning |
lw is in EX while decode still needs its rd (lw_stall in hazard_unit.sv). One FE/DE stall cycle is normal; then the consumer enters EX while the load is in MEM. |
| Typical sources |
lw immediately followed by an op that uses the loaded register. |
| Hardware |
MEM→EX bypass must use pipe3.read_data for loads (and pc_incr for JAL), not pipe3.alu_result (effective address for loads). Implemented as ex_mem_bypass_data feeding execution.alu_result_i in cpu.sv. |
| Improvement ideas |
Schedule an independent instruction in the slot after lw; deeper cores (dual-issue, OoO) hide latency. |
IMISS_STALL
|
|
| Meaning |
Instruction fetch blocked on I-cache / refill (front-end not delivering an instruction on time). |
| Typical sources |
Working-set larger than I-cache, conflict misses, cold start, interaction with L2/backing memory latency. |
| Improvement ideas |
Larger or more associative I-cache; line size / prefetch (next-line or stream buffer); reduce code footprint (-Os, hot/cold splitting); align critical loops to reduce tag pressure; tune branch-fetch so fewer wrong-path fills pollute I-cache. |
DMISS_STALL
|
|
| Meaning |
Pipeline stalled on D-cache miss or downstream memory latency (unified path through memory subsystem as wired in cpu.sv). |
| Typical sources |
Data working set vs D-cache capacity; capacity/conflict misses; long L2/AXI latency in simulation or silicon. |
| Improvement ideas |
Larger D-cache or higher sets/ways; write policy / write buffer tuning; L2 size and latency; critical word first / early restart if supported; software data layout (structure packing, blocking) to improve locality. Often the largest stall bucket on embedded kernels like CoreMark. |
ALU_STALL
|
|
| Meaning |
Execute held on multi-cycle M-extension operations (multiply/divide implementation as configured in RTL). |
| Typical sources |
Hot loops with mul/div/rem; soft divisions from runtime/helper code. |
| Improvement ideas |
Faster divider (radix, pipelining) if area/timing allow; fused patterns where ISA permits; compiler strength reduction (shifts for powers of two); avoid divide in inner loops if possible. |
FENCEI_STALL
|
|
| Meaning |
Stall while FENCE.I semantics complete (including D-cache dirty writeback / coherence with I-stream as implemented). |
| Typical sources |
Self-modifying code, JIT barriers, explicit fence.i in firmware. |
| Improvement ideas |
Rare in benchmarks; if non-zero, ensure minimal fence.i in runtime; faster writeback path so fence completes sooner. |
Flush events (counted once per cycle)
A flush event is recorded in at most one category per cycle, using this priority:
- EX trap —
priority_flush == 3 (exception taken in execute).
- DE trap —
priority_flush == 2 (decode/front-side exception path).
- FENCE.I / MISA front flush —
fencei_flush (I-cache / decode flush for fence.i or misa write affecting decode, as in cpu.sv).
- BP miss redirect —
de_flush_en effective: branch predictor wrong (hazard_unit pc_sel_ex_i is wired to **!ex_spec_hit** in cpu.sv, not raw “branch taken”).
- Load-use EX only —
ex_flush_en && !de_flush_en: load-use inserts an EX-stage bubble without a decode-stage redirect from misprediction.
flush_de_o / flush_ex_o are masked when stall_cause is one of IMISS_STALL, DMISS_STALL, ALU_STALL, or FENCEI_STALL; counting uses the masked de_flush_en / ex_flush_en signals.
EX trap / DE trap
|
|
| Meaning |
Trap to M-mode: EX-stage fault wins over DE/FE in priority encoding. |
| Improvement ideas |
Should be near zero on validated benchmarks; nonzero → debug PMA, alignment, illegal opcode, interrupt storm. |
FENCE.I / MISA front flush
|
|
| Meaning |
Front-end invalidated for instruction stream coherence or compressed ISA change (misa.C). |
| Improvement ideas |
Minimize **fence.i** in steady-state code; avoid toggling **misa** in performance paths. |
BP miss redirect
|
|
| Meaning |
Speculation wrong: resolved PC disagrees with predicted path (!ex_spec_hit). |
| Typical sources |
Hard branches, BTB/RAS misses, PHT aliasing, short history. |
| Improvement ideas |
Deeper GHR, larger BTB/RAS, tagged or partial predictors; loop predictor for inner loops; better return address stack pairing with call/ret; compiler branch layout to reduce pressure on same index bits. |
Load-use EX only
|
|
| Meaning |
**flush_ex_o** from load-use (`lw_stall |
| Overlap note |
Often correlates with **LOAD_RAW_STALL; **cycles_stall_with_flush** counts cycles where both a stall and a counted flush occurred. Do not add** stall % and heuristic flush bubble % and treat as independent losses. |
Reading the summary lines
- Stall cycles (% of active): fraction of post-reset cycles spent not advancing the front-end for the dominant stall reason above.
- Cycles w/o stall: complement of stall cycles (same denominator).
- Flush events (% of cycles with ≥1 flush): at most one event per cycle, so this is “what fraction of cycles squashed the pipe for a counted flush reason.”
- Flush ×2–×3 heuristic: rough bubble cost if each flush costs 2–3 fetch/decode slots — overlaps with cycles that are also stalled; use as intuition, not as an additive IPC formula.
| Item |
Location |
| Stall enum & pipeline doc |
rtl/pkg/level_param.sv, docs/core/cpu_module.md |
| Hazard / flush sources |
rtl/core/hazard_unit.sv, rtl/core/cpu.sv |
| Feature define |
rtl/include/level_defines.svh (LOG_PERF_STALL) |
| CoreMark profile |
script/config/tests/coremark.conf |
Limitations
- Counts reflect this core’s stall/flush encoding; they are not a substitute for cycle-accurate power or bus tracing.
- EEMBC CoreMark “valid run” rules (e.g. minimum time) are orthogonal; you can still use this log to profile the pipeline on short or validation runs.
- RTL simulation includes memory model latency; DMISS dominance may differ on FPGA vs ASIC vs fast RAM models.