Hardware-Software Codesign For Efficient Machine Learning Using In-Memory Computing
thesisposted on 07.12.2020, 23:03 by Aayush Ankit
General-purpose computing systems have benefited from technology scaling for several decades but are now hitting a performance/energy wall. This trend has led to a growing interest in domain-specific accelerators. Machine Learning (ML) workloads in particular have received tremendous attention because of their pervasiveness across applications. ML workloads tend to be data-intensive and perform many matrix operations. Their execution on digital CMOS hardware is typically characterized by high data movement costs. To overcome this limitation, in-memory computing primitives (CMOS, NVM) have been demonstrated to perform matrix operations with high efficiency by overcoming the low memory bandwidth and high memory energy issues. While such primitives have shown tremendous potential at the device-circuit levels, the system-level implications remain unclear, as they are not a drop-in replacement for traditional memory structures (register file, caches etc.).
First, this thesis explores the potential of in-memory computing towards domain-specific accelerators for inference, sparse inference, and training. To improve inference efficiency PUMA is proposed, which is a spatial architecture built with Non-Volatile Memory (NVM) crossbars. PUMA leverages the high on-chip storage density and analog computation capabilities of NVM to accelerate a wide range of ML applications. It supplements NVM crossbars with CMOS digital units, an instruction execution pipeline, and a specialized ISA to enable efficient coverage across different ML workloads and enhance programmability. It also includes a hardware-software codesign to optimize data movement. The evaluations show that PUMA achieves significant energy and latency improvements for ML inference compared to the state-of-the-art GPUs, CPUs, and ASICs. Subsequently, the proposed accelerator is extended for sparse inference. The over-parametrized nature of typical ML models has motivated model-compression techniques such as network pruning to improve the inference efficiency. However, sparse models obtained by typical network pruning lead to inefficient execution particularly on in-memory hardware. To address this, TraNNsformer is proposed which prunes the connectivity matrix while forming clusters with the remaining connections during the training process to transform the original model into an optimally pruned and maximally clustered mapping. Next, the applicability of the proposed in-memory accelerator (PUMA) is explored for ML training. Despite the storage density and analog computing benefits, NVMs suffer from high write cost, which is detrimental for ML training because of weight updates being a common case. To address this, a bit-slicing technique is proposed that enables performing high-precision analog outer-products on NVM crossbars to realize the weight update without using typical reads and writes. This technique is incorporated into an ISA-programmable architecture namely PANTHER, which enables different training algorithms and different layers types.
Despite the effectiveness of domain-specific accelerators for ML acceleration, general-purpose systems such as General Purpose Graphics Processing Unit (GPGPU) have continued to play an important role in ML inference and training owing to their wider availability and lower engineering costs for hardware-software developments. Subsequently, modern GPGPUs have introduced domain-specific units for matrix multiplication namely tensor core to meet the growing performance demands of ML. However, even with the improved throughput, tensor cores are inherently limited by the bandwidth of large register file sizes in GPGPU. To this effect, the implication of in-memory computing based tensor cores on GPGPU is explored. We show that GPGPU augmented with in-memory tensor cores enables overcoming the bandwidth limitation of register file, and provides the required flexibility for emerging application trends such as quantization and sparsity.