Skip to content
Tianchu Ji edited this page May 13, 2026 · 5 revisions

This document provides the details of the three-level-hierarchical SpMM core implementation. The three levels are:

+------------------------------------------------+
|                                                |
|  +------------------------------------------+  |
|  |                                          |  |
|  |   +-----+ +-----+ +-----+      +-----+   |  |
|  |   |     | |     | |     |      |     |   |  |
|  |   |  TB | |  TB | |  TB | ...  |  TB |   |  |
|  |   |chain| |chain| |chain|      |chain|   |  |
|  |   +-----+ +-----+ +-----+      +-----+   |  |
|  |       .                  .        .      |  |
|  |       .                    .      .      |  |
|  |       .                      .    .      |  |
|  |   +-----+ +-----+ +-----+      +-----+   |  |
|  |   |     | |     | |     |      |     |   |  |
|  |   |  TB | |  TB | |  TB | ...  |  TB |   |  |
|  |   |chain| |chain| |chain|      |chain|   |  |
|  |   +-----+ +-----+ +-----+      +-----+   |  |
|  |                                          |  |
|  |                                          |  |
|  |            TensorCoreChainArray          |  |
|  +------------------------------------------+  |
|            tensor_core_array_wrapper           |
+------------------------------------------------+

In a nutshell:

  • tensor_core_array_wrapper is the top-level wrapper providing interface between Gidel-generated shell and the SpMM core
  • TensorCoreChainArray is the core of the SpMM core which finishes the block aggregation and computaiton
  • TensorCoreChainBf12 is the BFP12 Tensor Block chain. Each Tensor Block chain is a PE with several cascaded Tensor Blocks in it. Many Tensor Block chains form the Tensor Core Chain Array

Note: This document assumes the reader has read and understood the block aggregation paper.

The tensor_core_array_wrapper module is the top level wrapper that handles the communication between the Gidel-generated shell and the SpMM core. The wrapper accepts the job submitted by the host through the Gidel API (which controls the Gidel shell and eventually controls the tensor_core_array_wrapper), stores them in the job queue. Each job (representing a "large block", or LB) contains multiple tiles in the SpMM. Multiple jobs (LBs) are processed to finish a SpMM. The host then starts the index and data DMA engine, acknowledges the SpMM core to start the computation. Then the SpMM core consumes all the jobs in the queue, and finishes the computation. During this process, matrix A streaming, index streaming and computation are running in parallel asynchronously.

Structure and port definition

tensor_core_array_wrapper

The architecture of the tensor_core_array_wrapper is shown above. The important ports that are exposed to the Gidel shell are:

Port name Direction Type Description
tcarray_in[0] in data input stream data port for matrix A col index and matrix V data
tcarray_in[1]-[5] in data input stream data port for matrix A
tcarray_in[6]-[7] in data input stream data port for matrix V (for future use)
tc_ctrl in control control signal for the SpMM core, implemented as "Gidel registers"
buf_ld_sel in control 1 -> tcarray_in[0] for loading matrix V, 0 -> tcarray_in[0] for loading matrix A col index during computation
mbvec_size in control how many BFP12 blocks are there in each column of matrix V. This number varies for different context lengths
mbidx_rd_bound in control, queued as job-specific parameter when buf_ld_sel == 0 row buffer writing boundary for matrix V loading and matrix A col index loading
ma_rd_bound in control, queued as job-specific parameter col buffer writing boundary for matrix A loading, representing how many transactions need to be transferred for each job (large block)
tcarray_out[0]-[1] out data output stream data port
err_info out control asserted if error occurs
lat_counter out control latency counter for performance evaluation

Port tcarray_in and tcarray_out are the main data ports supporting the Gidel multiport streaming protocol. Each channel of tcarray_in and tcarray_out has 256-bit data bus width and has its own DMA. Refer to Gidel multiport stream protocol page for the details of the protocol.

tcarray_in is reorganized to feed matrix A blocks with row index into many column buffers (colBuffer). The organization is configured by two parameters: HBM_MATA_CHAN_GRP and MATA_CHAN_PER_GRP. HBM_MATA_CHAN_GRP is a list of arrays, for each array it describes the tcarray_in ports that are grouped together to be split into multiple column buffers. MATA_CHAN_PER_GRP is the number of column buffers each group of tcarray_in ports is split into. The following figure shows an example of splitting tcarray_in ports into 2 groups, with each group feeding 7 and 5 colBuffers respectively.

col_buf_layout

In this example, HBM_MATA_CHAN_GRP=[[1,2,3], [4,5]] and MATA_CHAN_PER_GRP=[7,5]. Thus tcarray_in[1]-[3] are grouped to feed the first 7 colBuffers in parallel, and tcarray_in[4]-[5] are grouped to feed the last 5 colBuffers in parallel. Within each group, the hardware treats all the tcarray_in port as a same "wide" port with the same start and select signals. This is not a good design choice, but it simplifies the control logic.

Port tc_ctrl has several fields, each associated with a specific function:

Bits Name Description
tc_ctrl[0] start_load 1 -> start loading matrix V or index
tc_ctrl[1] soft_reset 1 -> soft reset the SpMM core
tc_ctrl[2] cal_en_start 1 -> start computation
tc_ctrl[3] boundaries_push 1 -> push the current mbidx_rd_bound and ma_rd_bound into the job queue
tc_ctrl[5:4] perf_counters_rd_sel 0 -> read latency including the II of first matrix A loading; 1 -> read latency without II of the first matrix A loading and the job switching latency, only counting for the computational time; 2 -> read matrix V loading latency; 3 -> reserved

Job execution and its pipelining

The reason to have an extra level of blocking strategy (called "jobs" or "large blocks") is to forcely sychronize the column buffer load. During the computation, the different column buffers are loaded with data from different rows of the matrix A. Due to the sparsity variation, this eventually caused the load unbalanced among the column buffers, and a column buffer with much more data will drop and lose the incoming data while other buffers are still able to accept their inputs, because the current buffer design does not support back pressure. To walk around this problem, I manually set a large enough depth for the column buffer, and force the compute units to exhaust the data in all the column buffers for each "job" to "synchronize" the column buffer loads.

The jobs are executed in a pipelined fashion to reduce the latency: when the last tile of the previous job is being computed, the next job starts to load its first tile. Technically from the perspective of the Tensor Block chain array, there is no difference between processing two consecutive tiles of the same job, and processing the last tile of one job and the first tile of the next job. To achieve this, the column buffers needs to be big enough to hold at least two tiles (to double buffer the matrix A), and the execution needs to follow a ping-pong manner. The following figure shows the timing diagram for the job switching. (In this example, the computation is assumed to dominate the latency and is 2x slower than matrix A loading.)

job-switching

The ping-pong operation is managed by two status pointers, dBuffLdPtr and dBuffCompPtr, and a 2-bit status register matADbuffRdy. Each bit in matADbuffRdy associates with a specific half of the column buffer. When the value of matADbuffRdy is 0, it means the associated half has nothing. When the value is 1 it means the associated half has data and is ready to be computed. The dBuffLdPtr always points to the loading half (where the next matrix A tile is loaded into), and dBuffCompPtr points to the computing half (where the current matrix A tile is computed). During the process of a multi-job execution, the dBuffLdPtr and dBuffCompPtr toggle between two halves, creating a ping-pong effect to ensure continuous computation (the switching logic is described in the pseudo code in the figure above):

  • Initially, both halves are empty, so matADbuffRdy is 00, and both pointers are pointing to matADbuffRdy[0].
  • The first tile is loaded into the first half, to which the dBuffLdPtr is pointing
  • Once the first half finishes loading (indicated by matAFullyLoaded), matADbuffRdy[0] is set to 1. dBuffLdPtr toggles to point to matADbuffRdy[1].
  • dBuffComputePtr is still pointing to matADbuffRdy[0]. The hardware finds out the matADbuffRdy[dBuffComputePtr] becomes 1, so it starts computing and consuming the first half column buffers. During this time, the second half, which the dBuffLdPtr is pointing to, starts being loaded.
  • Once the second half is fully loaded, the matADbuffRdy[1] becomes 1, and the dBuffLdPtr toggles to point to matADbuffRdy[0]. The matADbuffRdy[0] is still 1, meaning that the computation has not been finished. So the matrix A loading pauses to wait for the computation to finish.
  • Once the first half finishes computation (indicated by tcArray.io.calFin), the matADbuffRdy[0] becomes 0, and the dBuffComputePtr toggles to point to matADbuffRdy[1]. The hardware finds out the matADbuffRdy[1] is 1, so it starts computing.
  • At the same time, the matADbuffRdy[dBuffLdPtr] becomes 0, indicating the matrix A loading is safe to restart. So the hardware starts to load the first half column buffers again.
  • This process repeats, until all the jobs are finished. The two pointers meet to point to the same half.

Index generation and buffering

The column index generation, which will be used to perform the block aggregation, runs asynchronously with the matrix A loading and computation. The index generation starts before the computation. The computation will not start unless it confirms the index FIFOs are almost full. This ensures the computation will not be blocked by the index generation. The index is a simple input stream, it does not distinguish between different jobs or tiles.

The column indices of array_col rows in matrix A from the HBM will be merged by the IndexGenerator, and the merged indices will be stored in two FIFOs, idxGenFifoSlow and idxGenFifoFast, each holding a copy of the merged matrix A column index stream. The idxGenFifoSlow feeds the CasLoadBubbleInsert to decide whether to insert a bubble or pop out a block from column buffer; The idxGenFifoFast feeds the rowMem to fetch the associated matrix V rows.

The reason to keep two FIFOs is to decouple the two index consumption processes, bubble insertion and matrix V access, because these two processes exhibits different data rates.

  • Bubble insertion is slow, it only consumes one index per three cycle, because each 3 columns of matrix A BFP12 blocks are popped out from the column buffer and are loaded sequentially into the Tensor Block chain.
  • Matrix V access consumes one index per cycle because each BFP12 block of matrix V will be fed into the Tensor Block chain per cycle. It means at each cycle a new matrix V block needs to be fetched.

At the end of a job, both FIFOs should pop out the same last column index of the tail tile of the job. So the two FIFOs are load balanced by the job (or "large block") mechanism.

The Tensor Block chain array is the core of the SpMM computation. Its architecture is explained in the block aggregation paper. There are three engineering topics that are excluded from the paper: the matrix V buffer's data layout, the matrix V transposition, and the output buffering.

Note: In the paper, the row buffer and the column buffer are swapped for reading consistency. The paper's "row buffer" is the "column buffer" (colBuffer) in the implementation, and the paper's "column buffer" is the "row memory" (rowMem) in the implementation.

Tensor Block chain array

The figure above shows the structure of the Tensor Block chain array. The sparse matrix A is stored in colbuffer and loaded using the cascade loading chain of the Tensor Blocks; The dense matrix V is stored in rowMem and sent parallelly into all the Tensor Blocks of each chain. All the outputs of each column of the Tensor Block chain array are concatenated and converted from FP32 to BFP12 to save output bandwidth, and are cached in the outBuffer before being written to the HBM.

Matrix V data layout in rowMem

The matrix V is buffered in the rowMem in row-major order. The data layout is shown as below:

Matrix V data layout in rowMem

Each rowMem has multiple banks, with each bank storing one column of the matrix V blocks. A rowMem has NUM_MATB_VEC_PER_ROW banks. For each cycle, NUM_MATB_VEC_PER_ROW blocks of V will be fetched in parallel, but sent to the Tensor Block chain array sequentially (first the block from bank 0, then the block from bank 1, etc.). This is because for each cycle, a Tensor Block chain requires parallel blocks of a column of V rather than a row of V. Thus a transposition is required (details in next section).

The matrix V columns are interleaved across multiple rows. Thus on different rows, a row block of result matrix with consecutive elements will be output at the same cycle, making the BFP conversion easier, and letting the subsequent matrix multiplication start as early as possible.

Matrix V Transposition

Each row of matrix A needs to be dot-producted with all the columns of the V matrix. Thus, one column index of matrix A can be used to fetch an entire row of V to improve the bandwidth of the rowMem. However, each Tensor Block chain requires a small block of column vector of V, which is orthogonal to the parallelly fetched V rows. This requires a transposition of the matrix V after it is fetched from rowMem. The module transposeBuffer, which utilizes multiple asymetrical FIFOs, is responsible for this operation.

Each bank of the rowMem stores a column vector of V. A rowMem contains NUM_MATB_VEC_PER_ROW banks. At each cycle, NUM_MATB_VEC_PER_ROW blocks of V will be fetched in parallel. These blocks are loaded into one asymetrical FIFO in the transposeBuffer. Next cycle, the new group of V blocks are loaded into the next asymetrical FIFO. Eventually the asymetrical FIFOs are all loaded with V blocks in round-robin manner, and an entire column block of V can be read out from all the asymetrical FIFOs in parallel. The transposeBuffer has CHAIN_LEN asymetrical FIFOs, so it provides CHAIN_LEN blocks of V to the Tensor Block chain at each cycle.

Such a transpose buffer design generates a latency of the matrix V fetching: the transposeBuffer needs NUM_MATB_VEC_PER_ROW cycles to gather enough V blocks to produce the first CHAIN_LEN output blocks. To hide this latency, the transposeBuffer is designed to double buffer the V blocks, so when the first half of the asymetrical FIFOs is outputting, the second half is being loaded without stalling.

To prevent from implementing a too big transposeBuffer with too wide input port, it is designed to be reusable. Parameter TRANSRAM_FOLDING_FACTOR is introduced to control how many times an asymetrical FIFO is reused. For example, if TRANSRAM_FOLDING_FACTOR is 2, it means each asymetrical FIFO is reused twice, effectively it is used as two FIFOs logically. Thus the total number of physical asymetrical FIFOs is halved, and each FIFO's depth is doubled. A larger TRANSRAM_FOLDING_FACTOR means fewer physical FIFOs, but larger pop-out latency. The TRANSRAM_FOLDING_FACTOR needs to be decided considering the matrix V fetching latency and the routing complexity.

Output conversion and buffering

Each Tensor Block chain generates 3 FP32 (truncated to 24-bit words) results per cycle. This creates enormous output bandwidth across the entire core. To reduce the output bandwidth demand, the outputs are grouped and converted to BFP12 format before writing to the HBM. This is done by the outBfpConv and buffered in outBuffer.

All the outputs of each column of the Tensor Block chain array are concatenated to be converted to multiple blocks of BFP12 data. Since each Tensor Block (TB) chain in a column of the array produces one output of the same row of result matrix $H_l=AV$, the outputs from the different TB chain in the same column of the array need to be concatenated to perform the BFP12 conversion. For each TB chain, there are 3 columns of the dot-product results per cycle. These 3 columns are labeled as different colors in the figure above. Along each column, there is a outConvedDat which converts the output stream from array_row-word width to 18-word width, and each group of 18 words will be converted to a BFP12 block. (18 words because it is the max multiple of array_row=6 within 20, and 20 is the default group size of BFP12. Keeping the word width to be the multiple of array_row reduces the complexity of the outConvedDat module.) After word-width conversion, each group of 18 words will be sent to outBfpConv to be converted to BFP12 blocks. The output of the outBfpConv is then buffered in outBuffer before being written to the HBM.

Note: the current implementation does not support any back pressure from the HBM writing logic. If the output is too fast and the HBM cannot keep up, the data will simply be lost.