Skip to content

Instantly share code, notes, and snippets.

@icedmoca
Created September 6, 2025 14:12
Show Gist options
  • Select an option

  • Save icedmoca/31f80739b2a05a5c557c53f9c90b97d3 to your computer and use it in GitHub Desktop.

Select an option

Save icedmoca/31f80739b2a05a5c557c53f9c90b97d3 to your computer and use it in GitHub Desktop.
verilog sha256_90r_fpga.c optimization

SHA256-90R FPGA Pipeline Design Research Documentation

This document provides an overview and detailed explanation of the Verilog implementation of a 90-stage SHA256-90R FPGA pipeline, converted from a provided C code simulation of a hardware pipeline. The design implements a fully pipelined SHA-256 hash function with 90 rounds, optimized for FPGA hardware with constant-time operation to mitigate timing attacks.

Overview

The SHA256-90R pipeline is a hardware implementation of the SHA-256 cryptographic hash function, extended to 90 rounds for enhanced security or specific application requirements. The design processes a 512-bit input block and produces a 256-bit hash output, achieving a throughput of one hash per clock cycle after an initial pipeline fill-up period. The implementation is constant-time, using arithmetic masking to ensure consistent execution regardless of input data.

Key Features

  • 90-Stage Pipeline: Each stage performs one round of the SHA-256 compression function.
  • Constant-Time Operation: Uses masking to prevent timing-based side-channel attacks.
  • Input/Output: Accepts a 512-bit input block and produces a 256-bit hash.
  • Message Schedule: Pre-computes 90 message words from the input block.
  • Synchronous Design: Operates with a single clock and active-low reset.

Module Structure

The Verilog code consists of two main modules:

  1. sha256_round: A combinational module that implements a single SHA-256 round with masking.
  2. sha256_90r_pipeline: The top-level module that manages the 90-stage pipeline, message schedule computation, and output generation.

1. sha256_round Module

This module performs a single SHA-256 round computation, equivalent to the fpga_round_masked function in the original C code.

Inputs

  • a, b, c, d, e, f, g, h (32-bit each): Current state variables.
  • w (32-bit): Message word for the round.
  • k (32-bit): Round constant.
  • valid (1-bit): Valid flag to enable/disable computation.

Outputs

  • a_out, b_out, c_out, d_out, e_out, f_out, g_out, h_out (32-bit each): Updated state variables.

Functionality

  • Computes the SHA-256 round function:
    • t1 = h + Σ1(e) + Ch(e,f,g) + k + w, where:
      • Σ1(e) = (e >>> 6) ^ (e >>> 11) ^ (e >>> 25)
      • Ch(e,f,g) = (e & f) ^ (~e & g)
    • t2 = Σ0(a) + Maj(a,b,c), where:
      • Σ0(a) = (a >>> 2) ^ (a >>> 13) ^ (a >>> 22)
      • Maj(a,b,c) = (a & b) ^ (a & c) ^ (b & c)
  • Updates state variables:
    • h_out = g, g_out = f, f_out = e, e_out = d + t1
    • d_out = c, c_out = b, b_out = a, a_out = t1 + t2
  • Applies a valid_mask (0xFFFFFFFF if valid=1, else 0) to ensure constant-time updates by conditionally preserving the current state if valid=0.

2. sha256_90r_pipeline Module

This is the top-level module that implements the 90-stage pipeline and message schedule computation.

Inputs

  • clk: Clock signal for synchronous operation.
  • rst_n: Active-low reset signal.
  • input_valid: Indicates a valid 512-bit input block.
  • data_in (512-bit): Input block to be hashed.

Outputs

  • output_valid: Indicates a valid hash output.
  • hash_out (256-bit): The computed SHA-256 hash.

Internal Components

  • Constants (k): A 90-entry array of 32-bit round constants, initialized with the SHA-256 constants extended to 90 rounds (with padding zeros).
  • Pipeline Registers:
    • stage_a, stage_b, ..., stage_h (90 x 32-bit): State variables for each pipeline stage.
    • stage_w (90 x 32-bit): Message words.
    • stage_k (90 x 32-bit): Round constants.
    • stage_valid (90 x 1-bit): Valid flags.
    • current_stage (7-bit): Tracks the current round (0–89).
    • pipeline_filled: Indicates when the pipeline is fully loaded.
  • Message Schedule Registers:
    • m (90 x 32-bit): Stores the pre-computed message schedule.
    • msg_cycle (7-bit): Tracks the message schedule computation progress.

Functionality

  • Message Schedule Computation:
    • On input_valid=1, computes 90 message words (m[0:89]):
      • First 16 words are extracted from data_in (big-endian).
      • Words 16–89 are computed using the SHA-256 expansion formula:
        m[i] = m[i-16] + σ0(m[i-15]) + m[i-7] + σ1(m[i-2])
        where:
        • σ0(x) = (x >>> 7) ^ (x >>> 18) ^ (x >> 3)
        • σ1(x) = (x >>> 17) ^ (x >>> 19) ^ (x >> 10)
    • Takes 90 clock cycles to complete.
  • Pipeline Operation:
    • On each clock cycle:
      • Shifts pipeline stages (stages 1–89 copy from stage 0–88).
      • Stage 0 is initialized with SHA-256 initial state (H0H7) if input_valid=1, masked to preserve existing values if input_valid=0.
      • Each stage (1–89) applies the sha256_round module to compute the next state.
    • Updates current_stage and pipeline_filled when input_valid=1.
  • Output:
    • output_valid is set to stage_valid[89].
    • hash_out is set to {stage_a[89], ..., stage_h[89]} when valid.

Operation

  1. Reset: On rst_n=0, all registers are cleared.
  2. Input Phase: When input_valid=1, the module loads the 512-bit data_in and computes the message schedule (m[0:89]) over 90 clock cycles.
  3. Pipeline Processing: Each cycle, the pipeline shifts data, with stage 0 loading initial state and message words, and stages 1–89 performing round computations.
  4. Output Phase: After 179 cycles (90 input + 89 drain), output_valid goes high, and hash_out contains the final hash.

Constant-Time Design

  • The pipeline uses a valid_mask in the sha256_round module to ensure constant-time operation, preventing timing attacks by performing computations regardless of input validity.
  • All pipeline stages are updated every clock cycle, even if input_valid=0, maintaining consistent timing.

Resource Usage

Based on the original C code’s estimation:

  • LUTs: ~500 per stage x 90 stages = ~45,000 LUTs.
  • Flip-Flops: ~256 per stage x 90 stages = ~23,040 FFs.
  • BRAM: ~4 blocks for constants and message storage.
  • DSP Slices: 0 (pure logic implementation).
  • Max Frequency: Estimated at 300 MHz, though actual performance depends on the FPGA and synthesis tool.

Limitations

  • Batch Processing: The original C code included batch processing for multiple pipelines, which is omitted here for simplicity but can be added as multiple instances of sha256_90r_pipeline.
  • Simulation: The design requires 179 cycles to produce a hash, which may be slow on online simulators like JDoodle due to large array sizes.

Testbench

A testbench (sha256_90r_pipeline_tb.v) is provided to simulate the design:

  • Generates a 100 MHz clock.
  • Applies a reset, then inputs a padded zero block (0x8000...0040).
  • Displays the hash after 179 cycles.
  • Can be combined with the main module for simulation on JDoodle or other tools.

Usage

To use the module:

  1. Instantiate sha256_90r_pipeline in your design.
  2. Provide a 512-bit input block and assert input_valid for one cycle.
  3. Wait 179 cycles for output_valid to go high and read hash_out.
  4. For continuous operation, input new blocks every cycle after the pipeline is filled (steady-state throughput: 1 hash/clock).

Future Improvements

  • Optimization: Further pipeline the round function to reduce critical path and increase clock frequency.
  • Batch Processing: Implement multiple pipelines for parallel processing, as in the original C code’s fpga_batch_pipeline_t.
  • Testbench Enhancements: Add more test cases with known SHA-256 outputs for verification.
  • Synthesis Testing: Validate on a real FPGA (e.g., Xilinx Vivado) to confirm resource usage and timing.

This implementation provides a robust, constant-time SHA256-90R pipeline suitable for FPGA deployment, with clear mappings from the original C code to hardware constructs.

module sha256_round (
    input wire [31:0] a, b, c, d, e, f, g, h,
    input wire [31:0] w, k,
    input wire valid,
    output wire [31:0] a_out, b_out, c_out, d_out, e_out, f_out, g_out, h_out
);
    wire [31:0] valid_mask = valid ? 32'hFFFFFFFF : 32'h0;
    wire [31:0] t1 = h + (((e >> 6) | (e << 26)) ^ ((e >> 11) | (e << 21)) ^ ((e >> 25) | (e << 7))) +
                     ((e & f) ^ (~e & g)) + k + w;
    wire [31:0] t2 = (((a >> 2) | (a << 30)) ^ ((a >> 13) | (a << 19)) ^ ((a >> 22) | (a << 10))) +
                     ((a & b) ^ (a & c) ^ (b & c));

    assign a_out = ((t1 + t2) & valid_mask) | (a & ~valid_mask);
    assign b_out = (a & valid_mask) | (b & ~valid_mask);
    assign c_out = (b & valid_mask) | (c & ~valid_mask);
    assign d_out = (c & valid_mask) | (d & ~valid_mask);
    assign e_out = ((d + t1) & valid_mask) | (e & ~valid_mask);
    assign f_out = (e & valid_mask) | (f & ~valid_mask);
    assign g_out = (f & valid_mask) | (g & ~valid_mask);
    assign h_out = (g & valid_mask) | (h & ~valid_mask);
endmodule

module sha256_90r_pipeline (
    input wire clk,                    // Clock input
    input wire rst_n,                  // Active-low reset
    input wire input_valid,            // Input valid signal
    input wire [511:0] data_in,        // 512-bit input block
    output reg output_valid,           // Output valid signal
    output reg [255:0] hash_out        // 256-bit hash output
);

    // SHA-256 constants (k_90r_fpga)
    reg [31:0] k [0:95];
    initial begin
        k[0] = 32'h428a2f98; k[1] = 32'h71374491; k[2] = 32'hb5c0fbcf; k[3] = 32'he9b5dba5;
        k[4] = 32'h3956c25b; k[5] = 32'h59f111f1; k[6] = 32'h923f82a4; k[7] = 32'hab1c5ed5;
        k[8] = 32'hd807aa98; k[9] = 32'h12835b01; k[10] = 32'h243185be; k[11] = 32'h550c7dc3;
        k[12] = 32'h72be5d74; k[13] = 32'h80deb1fe; k[14] = 32'h9bdc06a7; k[15] = 32'hc19bf174;
        k[16] = 32'he49b69c1; k[17] = 32'hefbe4786; k[18] = 32'h0fc19dc6; k[19] = 32'h240ca1cc;
        k[20] = 32'h2de92c6f; k[21] = 32'h4a7484aa; k[22] = 32'h5cb0a9dc; k[23] = 32'h76f988da;
        k[24] = 32'h983e5152; k[25] = 32'ha831c66d; k[26] = 32'hb00327c8; k[27] = 32'hbf597fc7;
        k[28] = 32'hc6e00bf3; k[29] = 32'hd5a79147; k[30] = 32'h06ca6351; k[31] = 32'h14292967;
        k[32] = 32'h27b70a85; k[33] = 32'h2e1b2138; k[34] = 32'h4d2c6dfc; k[35] = 32'h53380d13;
        k[36] = 32'h650a7354; k[37] = 32'h766a0abb; k[38] = 32'h81c2c92e; k[39] = 32'h92722c85;
        k[40] = 32'ha2bfe8a1; k[41] = 32'ha81a664b; k[42] = 32'hc24b8b70; k[43] = 32'hc76c51a3;
        k[44] = 32'hd192e819; k[45] = 32'hd6990624; k[46] = 32'hf40e3585; k[47] = 32'h106aa070;
        k[48] = 32'h19a4c116; k[49] = 32'h1e376c08; k[50] = 32'h2748774c; k[51] = 32'h34b0bcb5;
        k[52] = 32'h391c0cb3; k[53] = 32'h4ed8aa4a; k[54] = 32'h5b9cca4f; k[55] = 32'h682e6ff3;
        k[56] = 32'h748f82ee; k[57] = 32'h78a5636f; k[58] = 32'h84c87814; k[59] = 32'h8cc70208;
        k[60] = 32'h90befffa; k[61] = 32'ha4506ceb; k[62] = 32'hbef9a3f7; k[63] = 32'hc67178f2;
        k[64] = 32'hc67178f2; k[65] = 32'hca273ece; k[66] = 32'hd186b8c7; k[67] = 32'heada7dd6;
        k[68] = 32'hf57d4f7f; k[69] = 32'h06f067aa; k[70] = 32'h0a637dc5; k[71] = 32'h113f9804;
        k[72] = 32'h1b710b35; k[73] = 32'h28db77f5; k[74] = 32'h32caab7b; k[75] = 32'h3c9ebe0a;
        k[76] = 32'h431d67c4; k[77] = 32'h4cc5d4be; k[78] = 32'h597f299c; k[79] = 32'h5fcb6fab;
        k[80] = 32'h6c44198c; k[81] = 32'h7ba0ea2d; k[82] = 32'h7eabf2d0; k[83] = 32'h8dbe8d03;
        k[84] = 32'h90bb1721; k[85] = 32'h99a2ad45; k[86] = 32'h9f86e289; k[87] = 32'ha84c4472;
        k[88] = 32'hb3df34fc; k[89] = 32'hb99bb8d7; k[90] = 32'h0; k[91] = 32'h0;
        k[92] = 32'h0; k[93] = 32'h0; k[94] = 32'h0; k[95] = 32'h0;
    end

    // Pipeline stage registers
    reg [31:0] stage_a [0:89];
    reg [31:0] stage_b [0:89];
    reg [31:0] stage_c [0:89];
    reg [31:0] stage_d [0:89];
    reg [31:0] stage_e [0:89];
    reg [31:0] stage_f [0:89];
    reg [31:0] stage_g [0:89];
    reg [31:0] stage_h [0:89];
    reg [31:0] stage_w [0:89];
    reg [31:0] stage_k [0:89];
    reg stage_valid [0:89];
    reg [6:0] current_stage; // Tracks pipeline progress (0-89)
    reg pipeline_filled;

    // Message schedule registers
    reg [31:0] m [0:89];
    reg [6:0] msg_cycle; // Tracks message schedule generation (0-89)

    // SHA-256 initial state
    localparam [31:0] H0 = 32'h6a09e667;
    localparam [31:0] H1 = 32'hbb67ae85;
    localparam [31:0] H2 = 32'h3c6ef372;
    localparam [31:0] H3 = 32'ha54ff53a;
    localparam [31:0] H4 = 32'h510e527f;
    localparam [31:0] H5 = 32'h9b05688c;
    localparam [31:0] H6 = 32'h1f83d9ab;
    localparam [31:0] H7 = 32'h5be0cd19;

    // Message schedule computation
    integer i;
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            msg_cycle <= 0;
            for (i = 0; i < 90; i = i + 1) begin
                m[i] <= 0;
            end
        end else if (input_valid && msg_cycle < 90) begin
            if (msg_cycle < 16) begin
                m[msg_cycle] <= data_in[511 - msg_cycle*32 -: 32];
            end else begin
                m[msg_cycle] <= m[msg_cycle-16] +
                    (((m[msg_cycle-15] >> 7) | (m[msg_cycle-15] << 25)) ^
                     ((m[msg_cycle-15] >> 18) | (m[msg_cycle-15] << 14)) ^
                     (m[msg_cycle-15] >> 3)) +
                    m[msg_cycle-7] +
                    (((m[msg_cycle-2] >> 17) | (m[msg_cycle-2] << 15)) ^
                     ((m[msg_cycle-2] >> 19) | (m[msg_cycle-2] << 13)) ^
                     (m[msg_cycle-2] >> 10));
            end
            msg_cycle <= msg_cycle + 1;
        end
    end

    // Pipeline logic
    wire [31:0] a_next [0:89];
    wire [31:0] b_next [0:89];
    wire [31:0] c_next [0:89];
    wire [31:0] d_next [0:89];
    wire [31:0] e_next [0:89];
    wire [31:0] f_next [0:89];
    wire [31:0] g_next [0:89];
    wire [31:0] h_next [0:89];

    // Instantiate round modules for each stage (except first stage)
    genvar j;
    generate
        for (j = 1; j < 90; j = j + 1) begin : round_gen
            sha256_round round (
                .a(stage_a[j-1]), .b(stage_b[j-1]), .c(stage_c[j-1]), .d(stage_d[j-1]),
                .e(stage_e[j-1]), .f(stage_f[j-1]), .g(stage_g[j-1]), .h(stage_h[j-1]),
                .w(stage_w[j-1]), .k(stage_k[j-1]), .valid(stage_valid[j-1]),
                .a_out(a_next[j]), .b_out(b_next[j]), .c_out(c_next[j]), .d_out(d_next[j]),
                .e_out(e_next[j]), .f_out(f_next[j]), .g_out(g_next[j]), .h_out(h_next[j])
            );
        end
    endgenerate

    // First stage round computation
    wire [31:0] input_mask = input_valid ? 32'hFFFFFFFF : 32'h0;
    assign a_next[0] = (H0 & input_mask) | (stage_a[0] & ~input_mask);
    assign b_next[0] = (H1 & input_mask) | (stage_b[0] & ~input_mask);
    assign c_next[0] = (H2 & input_mask) | (stage_c[0] & ~input_mask);
    assign d_next[0] = (H3 & input_mask) | (stage_d[0] & ~input_mask);
    assign e_next[0] = (H4 & input_mask) | (stage_e[0] & ~input_mask);
    assign f_next[0] = (H5 & input_mask) | (stage_f[0] & ~input_mask);
    assign g_next[0] = (H6 & input_mask) | (stage_g[0] & ~input_mask);
    assign h_next[0] = (H7 & input_mask) | (stage_h[0] & ~input_mask);

    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            for (i = 0; i < 90; i = i + 1) begin
                stage_a[i] <= 0;
                stage_b[i] <= 0;
                stage_c[i] <= 0;
                stage_d[i] <= 0;
                stage_e[i] <= 0;
                stage_f[i] <= 0;
                stage_g[i] <= 0;
                stage_h[i] <= 0;
                stage_w[i] <= 0;
                stage_k[i] <= 0;
                stage_valid[i] <= 0;
            end
            current_stage <= 0;
            pipeline_filled <= 0;
            output_valid <= 0;
            hash_out <= 0;
        end else begin
            // Shift pipeline stages and update
            for (i = 0; i < 90; i = i + 1) begin
                if (i == 0) begin
                    stage_a[0] <= a_next[0];
                    stage_b[0] <= b_next[0];
                    stage_c[0] <= c_next[0];
                    stage_d[0] <= d_next[0];
                    stage_e[0] <= e_next[0];
                    stage_f[0] <= f_next[0];
                    stage_g[0] <= g_next[0];
                    stage_h[0] <= h_next[0];
                    stage_w[0] <= (m[current_stage] & input_mask) | (stage_w[0] & ~input_mask);
                    stage_k[0] <= (k[current_stage] & input_mask) | (stage_k[0] & ~input_mask);
                    stage_valid[0] <= input_valid ? 1 : stage_valid[0];
                end else begin
                    stage_a[i] <= a_next[i];
                    stage_b[i] <= b_next[i];
                    stage_c[i] <= c_next[i];
                    stage_d[i] <= d_next[i];
                    stage_e[i] <= e_next[i];
                    stage_f[i] <= f_next[i];
                    stage_g[i] <= g_next[i];
                    stage_h[i] <= h_next[i];
                    stage_w[i] <= stage_w[i-1];
                    stage_k[i] <= stage_k[i-1];
                    stage_valid[i] <= stage_valid[i-1];
                end
            end

            // Update pipeline state
            if (input_valid && !pipeline_filled) begin
                current_stage <= current_stage + 1;
                if (current_stage >= 89) begin
                    pipeline_filled <= 1;
                end
            end

            // Output logic
            output_valid <= stage_valid[89];
            if (stage_valid[89]) begin
                hash_out <= {stage_a[89], stage_b[89], stage_c[89], stage_d[89],
                             stage_e[89], stage_f[89], stage_g[89], stage_h[89]};
            end
        end
    end

endmodule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment