Skip to content

🧠 Level RISC-V processor — AI/ML improvement plans

This document describes artificial intelligence and machine learning based improvements that can be added to the Level RISC-V processor.


📋 Table of contents

  1. Neural Branch Predictor (GShare + Perceptron)
  2. Neural Cache Prefetcher
  3. Learned Cache Replacement Policy
  4. Load/Store Stride Predictor
  5. Hazard Prediction Unit
  6. Workload-Aware Power Management

1. Neural Branch Predictor

1.1 Current state

  • File: rtl/core/stage01_fetch/gshare_bp.sv
  • Current algorithm: Tournament predictor (GShare + bimodal + loop predictor)
  • Components:
  • GHR (Global History Register)
  • PHT (Pattern History Table) - 2-bit saturating counter
  • BTB (Branch Target Buffer)
  • RAS (Return Address Stack)
  • Loop Predictor
  • IBTC (Indirect Branch Target Cache)

1.2 Proposed improvement: perceptron branch predictor

┌─────────────────────────────────────────────────────────────────┐
│                    Perceptron Branch Predictor                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Global History (GHR)         Perceptron Weights Table         │
│   ┌─┬─┬─┬─┬─┬─┬─┬─┐           ┌────────────────────────┐        │
│   │1│0│1│1│0│1│0│1│    ───►   │ W0  W1  W2 ... Wn  Bias│        │
│   └─┴─┴─┴─┴─┴─┴─┴─┘           └────────────────────────┘        │
│         │                              │                         │
│         │         ┌────────────────────┘                         │
│         │         ▼                                              │
│         │    ┌─────────────────────┐                             │
│         └───►│  Dot Product + Bias │                             │
│              │  Σ(xi * wi) + w0    │                             │
│              └─────────────────────┘                             │
│                        │                                         │
│                        ▼                                         │
│              ┌─────────────────────┐                             │
│              │   Sign(result) > 0  │───► Taken                   │
│              │   Sign(result) ≤ 0  │───► Not Taken               │
│              └─────────────────────┘                             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

1.3 Design details

// Perceptron branch predictor module
module perceptron_bp #(
    parameter HISTORY_LEN = 32,        // Global history length
    parameter TABLE_SIZE  = 256,       // Perceptron table size
    parameter WEIGHT_BITS = 8          // Weight bit width
)(
    input  logic                     clk_i,
    input  logic                     rst_ni,
    input  logic [XLEN-1:0]          pc_i,
    input  logic [HISTORY_LEN-1:0]   ghr_i,
    input  logic                     update_i,
    input  logic                     actual_taken_i,
    output logic                     predict_taken_o,
    output logic                     high_confidence_o
);

    // Perceptron weight table
    logic signed [WEIGHT_BITS-1:0] weights [TABLE_SIZE][HISTORY_LEN+1];

    // Prediction threshold (theta = 1.93 * history_len + 14)
    localparam int THETA = (193 * HISTORY_LEN) / 100 + 14;

    // Index calculation
    logic [$clog2(TABLE_SIZE)-1:0] idx;
    assign idx = pc_i[$clog2(TABLE_SIZE)+1:2];

    // Dot product calculation
    logic signed [WEIGHT_BITS+$clog2(HISTORY_LEN)+1:0] sum;

    always_comb begin
        sum = weights[idx][0]; // Bias
        for (int i = 0; i < HISTORY_LEN; i++) begin
            if (ghr_i[i])
                sum = sum + weights[idx][i+1];
            else
                sum = sum - weights[idx][i+1];
        end

        predict_taken_o = (sum >= 0);
        high_confidence_o = (sum > THETA) || (sum < -THETA);
    end

    // Training logic
    always_ff @(posedge clk_i) begin
        if (!rst_ni) begin
            // Initialize weights to 0
            for (int i = 0; i < TABLE_SIZE; i++)
                for (int j = 0; j <= HISTORY_LEN; j++)
                    weights[i][j] <= '0;
        end else if (update_i) begin
            // Update only on misprediction or low confidence
            if ((predict_taken_o != actual_taken_i) || !high_confidence_o) begin
                // Update bias
                if (actual_taken_i)
                    weights[idx][0] <= sat_inc(weights[idx][0]);
                else
                    weights[idx][0] <= sat_dec(weights[idx][0]);

                // Update history weights
                for (int i = 0; i < HISTORY_LEN; i++) begin
                    if (ghr_i[i] == actual_taken_i)
                        weights[idx][i+1] <= sat_inc(weights[idx][i+1]);
                    else
                        weights[idx][i+1] <= sat_dec(weights[idx][i+1]);
                end
            end
        end
    end

endmodule

1.4 Integration plan

  1. Add the perceptron module inside gshare_bp.sv
  2. Integrate it as a third option in the tournament predictor
  3. Use a meta-predictor to choose among GShare, bimodal, and perceptron

1.5 Expected benefit

  • Misprediction rate: 5–15% reduction
  • Area cost: ~2KB SRAM (256 entries × 33 weights × 8-bit)
  • Latency: 1 cycle (with parallel dot product)

2. Neural Cache Prefetcher

2.1 Current state

  • File: rtl/core/mmu/cache.sv
  • Current prefetch: None
  • Cache organization: N-way set associative, PLRU replacement

2.2 Proposed improvement: perceptron-based prefetcher

┌─────────────────────────────────────────────────────────────────┐
│                    Neural Cache Prefetcher                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────┐                                              │
│  │ Access History │                                              │
│  │ ┌────────────┐ │    ┌─────────────────────────────────────┐   │
│  │ │ PC₀, Δ₀   │ │    │         Perceptron Network          │   │
│  │ │ PC₁, Δ₁   │ │───►│  ┌─────┐  ┌─────┐  ┌─────┐         │   │
│  │ │ PC₂, Δ₂   │ │    │  │ W₀  │  │ W₁  │  │ W₂  │  ...    │   │
│  │ │   ...     │ │    │  └──┬──┘  └──┬──┘  └──┬──┘         │   │
│  │ │ PCₙ, Δₙ   │ │    │     │        │        │             │   │
│  │ └────────────┘ │    │     └────────┼────────┘             │   │
│  └────────────────┘    │              ▼                      │   │
│                        │     ┌────────────────┐              │   │
│                        │     │  Σ + Threshold │              │   │
│                        │     └───────┬────────┘              │   │
│                        └─────────────┼───────────────────────┘   │
│                                      │                           │
│                                      ▼                           │
│                        ┌─────────────────────────┐               │
│                        │  Prefetch Decision      │               │
│                        │  • Delta to prefetch    │               │
│                        │  • Confidence level     │               │
│                        └─────────────────────────┘               │
│                                      │                           │
│                                      ▼                           │
│                        ┌─────────────────────────┐               │
│                        │  Prefetch Queue         │               │
│                        │  addr = current + Δ     │               │
│                        └─────────────────────────┘               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

2.3 Design details

module neural_prefetcher #(
    parameter HISTORY_DEPTH = 16,      // Access history depth
    parameter TABLE_SIZE    = 64,      // PC-indexed table size
    parameter DELTA_BITS    = 12,      // Delta (address difference) bit width
    parameter WEIGHT_BITS   = 6,       // Weight bit width
    parameter NUM_DELTAS    = 4        // Number of deltas to predict
)(
    input  logic                 clk_i,
    input  logic                 rst_ni,

    // Cache access information
    input  logic                 access_valid_i,
    input  logic [XLEN-1:0]      access_pc_i,
    input  logic [XLEN-1:0]      access_addr_i,
    input  logic                 cache_hit_i,

    // Prefetch outputs
    output logic                 prefetch_valid_o,
    output logic [XLEN-1:0]      prefetch_addr_o,
    output logic [1:0]           prefetch_confidence_o
);

    // Delta history per PC
    typedef struct packed {
        logic [DELTA_BITS-1:0] delta;
        logic                  valid;
    } delta_entry_t;

    delta_entry_t delta_history [TABLE_SIZE][HISTORY_DEPTH];
    logic [XLEN-1:0] last_addr [TABLE_SIZE];

    // Perceptron weights: predict next delta based on delta history
    logic signed [WEIGHT_BITS-1:0] weights [TABLE_SIZE][HISTORY_DEPTH];
    logic signed [WEIGHT_BITS-1:0] bias [TABLE_SIZE];

    // Delta pattern detection
    logic [$clog2(TABLE_SIZE)-1:0] pc_idx;
    logic signed [DELTA_BITS-1:0] current_delta;
    logic signed [WEIGHT_BITS+$clog2(HISTORY_DEPTH):0] prediction_sum;

    assign pc_idx = access_pc_i[$clog2(TABLE_SIZE)+1:2];
    assign current_delta = access_addr_i - last_addr[pc_idx];

    // Stride detection + neural prediction hybrid
    always_comb begin
        prediction_sum = bias[pc_idx];

        for (int i = 0; i < HISTORY_DEPTH; i++) begin
            if (delta_history[pc_idx][i].valid) begin
                // Feature: delta match contributes positively
                if (delta_history[pc_idx][i].delta == current_delta)
                    prediction_sum = prediction_sum + weights[pc_idx][i];
                else
                    prediction_sum = prediction_sum - (weights[pc_idx][i] >>> 1);
            end
        end

        // Prefetch decision
        prefetch_valid_o = (prediction_sum > THRESHOLD) && access_valid_i;
        prefetch_addr_o = access_addr_i + {{(XLEN-DELTA_BITS){current_delta[DELTA_BITS-1]}}, current_delta};

        // Confidence based on prediction strength
        if (prediction_sum > HIGH_THRESHOLD)
            prefetch_confidence_o = 2'b11;
        else if (prediction_sum > MED_THRESHOLD)
            prefetch_confidence_o = 2'b10;
        else if (prediction_sum > LOW_THRESHOLD)
            prefetch_confidence_o = 2'b01;
        else
            prefetch_confidence_o = 2'b00;
    end

    // Training on cache access
    always_ff @(posedge clk_i) begin
        if (!rst_ni) begin
            // Reset
        end else if (access_valid_i) begin
            // Update history
            for (int i = HISTORY_DEPTH-1; i > 0; i--)
                delta_history[pc_idx][i] <= delta_history[pc_idx][i-1];
            delta_history[pc_idx][0] <= '{delta: current_delta, valid: 1'b1};

            last_addr[pc_idx] <= access_addr_i;

            // Update weights based on hit/miss
            if (cache_hit_i) begin
                // Reward pattern that led to hit
                // (prefetch was useful)
            end else begin
                // Penalize - should have prefetched
            end
        end
    end

endmodule

2.4 Integration points

  1. Add a prefetch interface to the cache.sv module
  2. Coordinate prefetch queue with memory.sv
  3. Prefetch buffer (separate small buffer or cache way)

2.5 Expected benefit

  • Cache hit rate: 10–25% increase
  • Memory latency: ~15–30% average reduction
  • Area cost: ~1KB (64 entries × 16 history × 6-bit weight)

3. Learned Cache Replacement Policy

3.1 Current state

  • Algorithm: PLRU (pseudo least recently used)
  • File: rtl/core/mmu/cache.sv
  • Functions: update_node(), compute_evict_way()

3.2 Proposed improvement: Hawkeye-inspired learned replacement

┌─────────────────────────────────────────────────────────────────┐
│              Learned Cache Replacement Policy                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────┐     ┌─────────────────────────────┐    │
│  │   OPT Simulator     │     │   Predictor (Perceptron)    │    │
│  │   (Offline/Shadow)  │     │                             │    │
│  │                     │     │   PC ──► ┌─────────┐        │    │
│  │   Belvedere's OPT   │────►│          │ Weights │ ──► Prediction
│  │   approximation     │     │   Type ─►└─────────┘        │    │
│  │                     │     │                             │    │
│  └─────────────────────┘     └─────────────────────────────┘    │
│           │                              │                       │
│           │ Training                     │ Eviction              │
│           │ Labels                       │ Decision              │
│           ▼                              ▼                       │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                  Cache Controller                        │    │
│  │   • cache-friendly → high priority (keep)               │    │
│  │   • cache-averse → low priority (evict first)           │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

3.3 Design details

module learned_replacement #(
    parameter NUM_WAY     = 4,
    parameter NUM_SET     = 64,
    parameter PC_BITS     = 12,
    parameter WEIGHT_BITS = 4
)(
    input  logic                     clk_i,
    input  logic                     rst_ni,

    // Access info
    input  logic                     access_valid_i,
    input  logic [PC_BITS-1:0]       access_pc_i,
    input  logic [$clog2(NUM_SET)-1:0] set_idx_i,
    input  logic [NUM_WAY-1:0]       hit_way_i,
    input  logic                     is_load_i,

    // Eviction decision
    output logic [NUM_WAY-1:0]       evict_way_o,
    output logic [NUM_WAY-1:0]       priority_o  // Per-way priority
);

    // RRIP-style priority counters per cache line
    logic [2:0] rrpv [NUM_SET][NUM_WAY];  // Re-Reference Prediction Value

    // PC-indexed predictor: is this PC cache-friendly?
    logic signed [WEIGHT_BITS-1:0] pc_weights [2**PC_BITS];

    // Prediction
    logic cache_friendly;
    assign cache_friendly = (pc_weights[access_pc_i] >= 0);

    // Update RRPV on access
    always_ff @(posedge clk_i) begin
        if (!rst_ni) begin
            // Initialize all RRPV to distant (7)
            for (int s = 0; s < NUM_SET; s++)
                for (int w = 0; w < NUM_WAY; w++)
                    rrpv[s][w] <= 3'd7;
        end else if (access_valid_i) begin
            if (|hit_way_i) begin
                // Hit: set RRPV based on prediction
                for (int w = 0; w < NUM_WAY; w++) begin
                    if (hit_way_i[w]) begin
                        if (cache_friendly)
                            rrpv[set_idx_i][w] <= 3'd0;  // Near re-reference
                        else
                            rrpv[set_idx_i][w] <= 3'd2;  // Intermediate
                    end
                end
            end
        end
    end

    // Eviction: choose way with highest RRPV
    always_comb begin
        evict_way_o = '0;
        priority_o = '0;

        // Find way with max RRPV (least likely to be reused)
        logic [2:0] max_rrpv = 3'd0;
        for (int w = 0; w < NUM_WAY; w++) begin
            priority_o[w] = rrpv[set_idx_i][w];
            if (rrpv[set_idx_i][w] >= max_rrpv) begin
                max_rrpv = rrpv[set_idx_i][w];
                evict_way_o = (1 << w);
            end
        end
    end

    // Training: update PC weights based on hit/miss
    always_ff @(posedge clk_i) begin
        if (access_valid_i) begin
            if (|hit_way_i) begin
                // Hit: this PC is cache-friendly
                if (pc_weights[access_pc_i] < MAX_WEIGHT)
                    pc_weights[access_pc_i] <= pc_weights[access_pc_i] + 1;
            end else begin
                // Miss: this PC might be cache-averse
                if (pc_weights[access_pc_i] > MIN_WEIGHT)
                    pc_weights[access_pc_i] <= pc_weights[access_pc_i] - 1;
            end
        end
    end

endmodule

3.4 Integration plan

  1. Change compute_evict_way() in cache.sv
  2. Use learned replacement instead of PLRU
  3. Hybrid mode: fall back to PLRU when confidence is low

3.5 Expected benefit

  • Hit rate: 5–12% increase (workload-dependent)
  • Streaming access: Large improvement (more scan-resistant)

4. Load/Store Stride Predictor

4.1 Current state

  • File: rtl/core/stage04_memory/memory.sv
  • Stride detection: None

4.2 Proposed improvement: neural stride predictor

┌─────────────────────────────────────────────────────────────────┐
│                   Neural Stride Predictor                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Load/Store PC ──────┬──────────────────────────────────────┐   │
│                       │                                       │   │
│                       ▼                                       │   │
│            ┌─────────────────────┐                           │   │
│            │  Stride History     │                           │   │
│            │  Table (PC-indexed) │                           │   │
│            │  ┌───────────────┐  │                           │   │
│            │  │Last Addr      │  │                           │   │
│            │  │Stride         │  │                           │   │
│            │  │Confidence     │  │                           │   │
│            │  │State (train/  │  │                           │   │
│            │  │ steady/no)    │  │                           │   │
│            │  └───────────────┘  │                           │   │
│            └─────────────────────┘                           │   │
│                       │                                       │   │
│                       ▼                                       │   │
│            ┌─────────────────────┐                           │   │
│            │  Perceptron Layer   │                           │   │
│            │  (stride pattern    │                           │   │
│            │   prediction)       │                           │   │
│            └─────────────────────┘                           │   │
│                       │                                       │   │
│                       ▼                                       │   │
│            ┌─────────────────────┐     ┌──────────────────┐  │   │
│            │ Predicted Next Addr │────►│ Prefetch Request │  │   │
│            └─────────────────────┘     └──────────────────┘  │   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

4.3 Design details

module stride_predictor #(
    parameter TABLE_SIZE  = 64,
    parameter STRIDE_BITS = 16,
    parameter CONF_BITS   = 3
)(
    input  logic                 clk_i,
    input  logic                 rst_ni,

    // Memory access
    input  logic                 mem_valid_i,
    input  logic [XLEN-1:0]      mem_pc_i,
    input  logic [XLEN-1:0]      mem_addr_i,
    input  logic                 is_load_i,

    // Prediction output
    output logic                 predict_valid_o,
    output logic [XLEN-1:0]      predict_addr_o,
    output logic [CONF_BITS-1:0] confidence_o
);

    typedef enum logic [1:0] {
        INIT,       // First access
        TRAINING,   // Learning stride
        STEADY,     // Stable stride, predicting
        NO_STRIDE   // Irregular pattern
    } state_e;

    typedef struct packed {
        logic [XLEN-1:0]        last_addr;
        logic signed [STRIDE_BITS-1:0] stride;
        logic [CONF_BITS-1:0]   confidence;
        state_e                 state;
    } stride_entry_t;

    stride_entry_t table [TABLE_SIZE];

    logic [$clog2(TABLE_SIZE)-1:0] idx;
    logic signed [STRIDE_BITS-1:0] current_stride;

    assign idx = mem_pc_i[$clog2(TABLE_SIZE)+1:2];
    assign current_stride = mem_addr_i - table[idx].last_addr;

    // Prediction logic
    always_comb begin
        predict_valid_o = 1'b0;
        predict_addr_o = '0;
        confidence_o = '0;

        if (table[idx].state == STEADY && table[idx].confidence > 4) begin
            predict_valid_o = 1'b1;
            predict_addr_o = mem_addr_i + {{(XLEN-STRIDE_BITS){table[idx].stride[STRIDE_BITS-1]}}, 
                                           table[idx].stride};
            confidence_o = table[idx].confidence;
        end
    end

    // Training FSM
    always_ff @(posedge clk_i) begin
        if (!rst_ni) begin
            for (int i = 0; i < TABLE_SIZE; i++) begin
                table[i] <= '{default: '0, state: INIT};
            end
        end else if (mem_valid_i && is_load_i) begin
            case (table[idx].state)
                INIT: begin
                    table[idx].last_addr <= mem_addr_i;
                    table[idx].state <= TRAINING;
                end

                TRAINING: begin
                    table[idx].stride <= current_stride;
                    table[idx].last_addr <= mem_addr_i;
                    table[idx].confidence <= 1;
                    table[idx].state <= STEADY;
                end

                STEADY: begin
                    table[idx].last_addr <= mem_addr_i;
                    if (current_stride == table[idx].stride) begin
                        // Stride matches, increase confidence
                        if (table[idx].confidence < MAX_CONF)
                            table[idx].confidence <= table[idx].confidence + 1;
                    end else begin
                        // Stride mismatch
                        if (table[idx].confidence > 0)
                            table[idx].confidence <= table[idx].confidence - 1;
                        else begin
                            table[idx].stride <= current_stride;
                            table[idx].state <= TRAINING;
                        end
                    end
                end

                NO_STRIDE: begin
                    // Periodically retry
                    table[idx].state <= INIT;
                end
            endcase
        end
    end

endmodule

4.4 Expected benefit

  • Array traversal: Very high hit rate
  • Linked list: Low (pointer chasing)
  • Matrix ops: 80%+ prefetch accuracy with stride patterns

5. Hazard Prediction Unit

5.1 Current state

  • File: rtl/core/hazard_unit.sv
  • Current behavior: Reactive forwarding and stalls

5.2 Proposed improvement: predictive hazard detection

┌─────────────────────────────────────────────────────────────────┐
│                  Predictive Hazard Unit                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐                                            │
│  │ Instruction     │                                            │
│  │ Sequence        │                                            │
│  │ History         │                                            │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │            Pattern Matching / Neural Network             │    │
│  │                                                          │    │
│  │  Sequence ──► [Hazard Pattern DB] ──► Prediction        │    │
│  │                                                          │    │
│  │  Features:                                               │    │
│  │  • Opcode sequence                                       │    │
│  │  • Register dependency graph                             │    │
│  │  • Memory access pattern                                 │    │
│  │                                                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Proactive Actions:                                      │    │
│  │  • Early stall insertion                                 │    │
│  │  • Speculative forwarding path activation                │    │
│  │  • Instruction reordering hints                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

5.3 Use cases

  • Load-use hazard prediction
  • Long-latency operation detection (div/mul)
  • Memory stall prediction

6. Workload-Aware Power Management

6.1 Current state

  • No dynamic power management

6.2 Proposed improvement: neural DVFS controller

┌─────────────────────────────────────────────────────────────────┐
│              Neural Power Management Unit                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                 Activity Monitors                        │    │
│  │  • IPC (Instructions Per Cycle)                         │    │
│  │  • Cache miss rate                                       │    │
│  │  • Branch misprediction rate                            │    │
│  │  • Memory bandwidth utilization                          │    │
│  │  • Stall cycles                                          │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                      │
│                           ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Neural Network Predictor                    │    │
│  │                                                          │    │
│  │  Input: Activity metrics (sliding window)               │    │
│  │  Output: Predicted workload phase                        │    │
│  │                                                          │    │
│  │  Phases:                                                 │    │
│  │  • Compute-intensive (high freq needed)                 │    │
│  │  • Memory-bound (can reduce freq)                        │    │
│  │  • Idle (aggressive power saving)                        │    │
│  │  • Mixed                                                 │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                           │                                      │
│                           ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              DVFS Controller                             │    │
│  │                                                          │    │
│  │  • Frequency scaling                                     │    │
│  │  • Voltage adjustment                                    │    │
│  │  • Clock gating decisions                                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

6.3 Design details

module neural_power_manager #(
    parameter WINDOW_SIZE = 1024,  // Sampling window
    parameter NUM_FEATURES = 8,
    parameter HIDDEN_SIZE = 16
)(
    input  logic        clk_i,
    input  logic        rst_ni,

    // Performance counters
    input  logic [31:0] cycle_count_i,
    input  logic [31:0] inst_count_i,
    input  logic [31:0] cache_miss_i,
    input  logic [31:0] branch_miss_i,
    input  logic [31:0] stall_cycles_i,

    // Power control outputs
    output logic [2:0]  freq_level_o,      // 0=lowest, 7=highest
    output logic        clock_gate_o,       // Gate unused units
    output logic [3:0]  active_units_o      // Which units to keep active
);

    // Feature extraction
    logic [15:0] ipc;           // Instructions per cycle (fixed point)
    logic [15:0] miss_rate;     // Cache miss rate
    logic [15:0] stall_rate;    // Stall percentage

    // Simple perceptron for phase detection
    typedef enum logic [1:0] {
        PHASE_COMPUTE,
        PHASE_MEMORY,
        PHASE_IDLE,
        PHASE_MIXED
    } phase_e;

    phase_e current_phase;

    // Phase detection logic
    always_ff @(posedge clk_i) begin
        if (!rst_ni) begin
            current_phase <= PHASE_MIXED;
            freq_level_o <= 3'd4;  // Medium frequency
        end else begin
            // Simple heuristic-based phase detection
            // Can be replaced with trained neural network
            if (ipc > HIGH_IPC_THRESHOLD && miss_rate < LOW_MISS_THRESHOLD) begin
                current_phase <= PHASE_COMPUTE;
                freq_level_o <= 3'd7;  // Max frequency
            end else if (miss_rate > HIGH_MISS_THRESHOLD) begin
                current_phase <= PHASE_MEMORY;
                freq_level_o <= 3'd3;  // Lower frequency (memory bound)
            end else if (stall_rate > IDLE_THRESHOLD) begin
                current_phase <= PHASE_IDLE;
                freq_level_o <= 3'd1;  // Minimum frequency
                clock_gate_o <= 1'b1;
            end else begin
                current_phase <= PHASE_MIXED;
                freq_level_o <= 3'd4;
            end
        end
    end

endmodule

📊 Comparison table

Feature Area cost Design complexity Expected benefit Priority
Neural branch predictor ~2KB Medium 5–15% misprediction↓ ⭐⭐⭐⭐
Neural cache prefetcher ~1KB Medium 10–25% hit rate↑ ⭐⭐⭐⭐⭐
Learned replacement ~0.5KB Low 5–12% hit rate↑ ⭐⭐⭐
Stride predictor ~0.5KB Low High for array workloads ⭐⭐⭐
Hazard prediction ~0.25KB Medium 5–10% stall↓ ⭐⭐
Power management ~0.5KB High 20–40% power↓ ⭐⭐⭐

🚀 Implementation roadmap

Phase 1: Core infrastructure (1–2 weeks)

  1. Add performance counters (IPC, cache miss, branch miss)
  2. Mechanism to collect training data

Phase 2: First AI module (2–3 weeks)

  1. Neural cache prefetcher implementation
  2. Hybrid approach with stride predictor
  3. Benchmark tests

Phase 3: Branch predictor upgrade (2–3 weeks)

  1. Perceptron predictor module
  2. Tournament predictor integration
  3. A/B test infrastructure

Phase 4: Advanced features (3–4 weeks)

  1. Learned cache replacement
  2. Power management
  3. Full system integration

📚 References

  1. Perceptron Branch Predictor: Jiménez & Lin, "Dynamic Branch Prediction with Perceptrons", HPCA 2001
  2. Hawkeye Cache: Jain & Lin, "Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement", ISCA 2016
  3. Neural Prefetching: Hashemi et al., "Learning Memory Access Patterns", ICML 2018
  4. RRIP Replacement: Jaleel et al., "High Performance Cache Replacement Using Re-Reference Interval Prediction", ISCA 2010

This document contains AI/ML improvement plans for the Level RISC-V processor.
Last updated: December 2025