Fence.i Instruction Implementation for Dcache¶
Overview¶
This document explains how the RISC-V fence.i instruction is implemented on the dcache side. The fence.i instruction is used to maintain consistency between the instruction cache and the data cache.
Problem statement¶
The RISC-V architecture supports self-modifying code. A program can write new instructions to memory and then execute them. For that to work:
- New instructions written to the data cache must be flushed to memory
- The instruction cache must be invalidated
- The pipeline must be cleaned up
Implementation details¶
1. Dirty array register structure¶
Previous state: Dirty bits lived inside SRAM, making it impossible to see all dirty states in a single cycle.
Solution: Hold the dirty array in registers:
// Dirty bits as register array for instant access
logic [NUM_WAY-1:0] dirty_reg_q [NUM_SET];
logic [NUM_WAY-1:0] dirty_reg_d [NUM_SET];
This allows: - Reading any set’s dirty state in one cycle - Fast dirty scanning for the fence.i state machine
2. Fence.i state machine¶
We implemented an 8-state FSM for the dcache:
FI_IDLE → FI_SCAN → FI_CHECK_WAY → FI_WRITEBACK_REQ → FI_WRITEBACK_WAIT → FI_MARK_CLEAN → FI_NEXT_WAY → FI_DONE
↑ ↓ ↓
└─────────────────────┴──────────────────────────────────────────────────────────────────────────────┘
| State | Description |
|---|---|
FI_IDLE |
Idle; wait for fence.i |
FI_SCAN |
Set index; send address to SRAM |
FI_CHECK_WAY |
Check if current set has a dirty way |
FI_WRITEBACK_REQ |
Issue write to memory via LowX interface |
FI_WRITEBACK_WAIT |
Wait for memory write to complete |
FI_MARK_CLEAN |
Clear dirty bit |
FI_NEXT_WAY |
Advance to next way or next set |
FI_DONE |
Done; deassert stall |
3. Pipeline stall mechanism¶
We added a new stall type to stop the pipeline during fence.i:
// level_param.sv
typedef enum logic [2:0] {
NO_STALL = 0,
DMEM_STALL = 1,
IMEM_STALL = 2,
MUL_STALL = 3,
DIV_STALL = 4,
FENCEI_STALL = 5 // New
} stall_e;
Stall priority: FENCEI_STALL has highest priority (checked before other stalls).
4. Flush signal management¶
Problem: When fence.i arrives, both icache flush and dcache dirty writeback are needed. But a shared flush signal affected both caches and cleared dcache tags immediately — before dirty writeback finished.
Solution: Block flush writes to the dcache tag array while fence.i is active:
// Tag array write — block flush writes during fence.i
for (int i = 0; i < NUM_WAY; i++)
tsram.way[i] = (flush && !fi_active) ? '1 : (cache_wr_way[i] && tag_array_wr_en);
5. Fence.i start condition¶
We detect fence.i reliably with rising-edge detection:
logic flush_i_prev;
always_ff @(posedge clk_i or negedge rst_ni) begin
if (!rst_ni) flush_i_prev <= 1'b0;
else flush_i_prev <= flush_i;
end
wire fencei_rising_edge = flush_i && !flush_i_prev;
6. Pipe2 flush condition¶
Problem: While the fence.i instruction was in pipe2, fencei_flush flushed pipe2, which dropped fence.i itself.
Solution: Removed fencei_flush from the pipe2 flush condition:
// Before (buggy):
if (!rst_ni || ex_flush_en || priority_flush == 3 || priority_flush == 2 || fencei_flush)
// After (fixed):
if (!rst_ni || ex_flush_en || priority_flush == 3 || priority_flush == 2)
Test used¶
rv32ui-p-fence_i¶
This test checks that fence.i behaves correctly:
- Fetches an instruction from memory (e.g. from
0x80002000) - Stores a new instruction to the same address
- Executes
fence.i - Jumps to the newly written address
- Checks that the new instruction executes correctly
Test flow:
PC=0x80000144: sh x13, 0x80002004 # Write new instruction (0x8693)
PC=0x8000014c: sh x11, 0x80002006 # Write new instruction (0x14d6)
PC=0x80000150: fence.i # Dcache flush + Icache invalidate
PC=0x8000015c: jalr x6, x15 # Jump to 0x80002004
PC=0x80002004: addi x13, x13, 0x14d6 # Execute newly written instruction
Unused / alternative approaches¶
1. SRAM-based dirty array (not used)¶
Reason: SRAM read has 1-cycle latency. Scanning all sets during fence.i would take too many cycles.
Preferred: Register-based dirty array — O(1) access time.
2. Blocking cache during fence.i (partially used)¶
We block the cache with a pipeline stall but ensured the normal cache state machine is not corrupted.
3. Write-through cache (not used)¶
Reason: Every store goes straight to memory, simplifying fence.i but hurting performance.
Preferred: Write-back cache — better performance; fence.i requires dirty writeback.
4. Invalidate-only approach (not used)¶
Reason: Invalidating alone loses data. Dirty lines would be dropped without writeback to memory.
Preferred: Dirty writeback + invalidate.
File changes¶
| File | Change |
|---|---|
rtl/core/mmu/cache.sv |
Fence.i FSM, dirty register array, fencei_stall_o port |
rtl/core/stage04_memory/memory.sv |
fencei_stall_o passthrough |
rtl/core/cpu.sv |
FENCEI_STALL integration, pipe2 flush fix |
rtl/pkg/level_param.sv |
FENCEI_STALL enum value |
rtl/include/writeback_log.svh |
FENCEI_STALL logging condition |
Test result¶
✅ MATCH | PC=0x80002004 INST=0x14d68693 x13 0x000001bc | PC=0x80002004 INST=0x14d68693 x13 0x000001bc
The test passed. The dcache wrote dirty data to memory and the icache fetched the new instruction correctly.
Future improvements¶
- Parallel dirty scan: Scan dirty bits across sets in parallel for faster writeback
- Priority encoder: Ordering when multiple dirty ways exist
- Writeback buffer: Buffer back-to-back writebacks to optimize memory bandwidth