Compute-in-Memory Primitives for Energy-Efficient Machine Learning
thesisposted on 26.07.2021, 11:01 by Amogh AgrawalAmogh Agrawal
Machine Learning (ML) workloads, being memory and compute-intensive, consume large amounts of power running on conventional computing systems, restricting their implementations to large-scale data centers. Thus, there is a need for building domain-specific hardware primitives for energy-efficient ML processing at the edge. One such approach is in-memory computing, which eliminates frequent and unnecessary data-transfers between the memory and the compute units, by directly computing the data where it is stored. Most of the chip area is consumed by on-chip SRAMs in both conventional von-Neumann systems (e.g. CPU/GPU) as well as application-specific ICs (e.g. TPU). Thus, we propose various circuit techniques to enable a range of computations such as bitwise Boolean and arithmetic computations, binary convolution operations, non-Boolean dot-product operations, lookup-table based computations, and spiking neural network implementation - all within standard SRAM memory arrays.
First, we propose X-SRAM, where, by using skewed sense amplifiers, bitwise Boolean operations such as NAND/NOR/XOR/IMP etc. can be enabled within 6T and 8T SRAM arrays. Moreover, exploiting the decoupled read/write ports in 8T SRAMs, we propose read-compute-store scheme where the computed data can directly be written back in the array simultaneously.
Second, we propose Xcel-RAM, where we show how binary convolutions can be enabled in 10T SRAM arrays for accelerating binary neural networks. We present charge sharing approach for performing XNOR operations followed by a population count (popcount) using both analog and digital techniques, highlighting the accuracy-energy tradeoff.
Third, we take this concept further and propose CASH-RAM, to accelerate non-Boolean operations, such as dot-products within standard 8T-SRAM arrays by utilizing the parasitic capacitances of bitlines and sourcelines. We analyze the non-idealities that arise due to analog computations and propose a self-compensation technique which reduces the effects of non-idealities, thereby reducing the errors.
Fourth, we propose ROM-embedded caches, RECache, using standard 8T SRAMs, useful for lookup-table (LUT) based computations. We show that just by adding an extra word-line (WL) or a source-line (SL), the same bit-cell can store a ROM bit, as well as the usual RAM bit, while maintaining the performance and area-efficiency, thereby doubling the memory density. Further we propose SPARE, an in-memory, distributed processing architecture built on RECache, for accelerating spiking neural networks (SNNs), which often require high-order polynomials and transcendental functions for solving complex neuro-synaptic models.
Finally, we propose IMPULSE, a 10T-SRAM compute-in-memory (CIM) macro, specifically designed for state-of-the-art SNN inference. The inherent dynamics of the neuron membrane potential in SNNs allows processing of sequential learning tasks, avoiding the complexity of recurrent neural networks. The highly-sparse spike-based computations in such spatio-temporal data can be leveraged for energy-efficiency. However, the membrane potential incurs additional memory access bottlenecks in current SNN hardware. IMPULSE triew to tackle the above challenges. It consists of a fused weight (WMEM) and membrane potential (VMEM) memory and inherently exploits sparsity in input spikes. We propose staggered data mapping and re-configurable peripherals for handling different bit-precision requirements of WMEM and VMEM, while supporting multiple neuron functionalities. The proposed macro was fabricated in 65nm CMOS technology. We demonstrate a sentiment classification task from the IMDB dataset of movie reviews and show that the SNN achieves competitive accuracy with only a fraction of trainable parameters and effective operations compared to an LSTM network.
These circuit explorations to embed computations in standard memory structures shows that on-chip SRAMs can do much more than just store data and can be re-purposed as on-demand accelerators for a variety of applications.