Toward Energy-Efficient Machine Learning: Algorithms and Analog Compute-In-Memory Hardware
thesisposted on 28.07.2021, 14:51 by Indranil ChakrabortyIndranil Chakraborty
The ‘Internet of Things’ has increased the demand for artificial intelligence (AI)-based edge computing in applications ranging from healthcare monitoring systems to autonomous vehicles. However, the growing complexity of machine learning workloads requires rethinking to make AI amenable to resource constrained environments such as edge devices. To that effect, the entire stack of machine learning, from algorithms to hardware primitives, have been explored to enable energy-efficient intelligence at the edge.
From the algorithmic aspect, model compression techniques such as quantization are powerful tools to address the growing computational cost of ML workloads. However, quantization, particularly, can result in substantial loss of performance for complex image classification tasks. To address this, a principal component analysis (PCA)-driven methodology to identify the important layers of a binary network, and design mixed-precision networks. The proposed Hybrid-Net achieves a significant improvement in classification accuracy over binary networks such as XNOR-Net for ResNet and VGG architectures on CIFAR-100 and ImageNet datasets, while still achieving up remarkable energy-efficiency.
Having explored compressed neural networks, there is a need to investigate suitable computing systems to further the energy efficiency. Memristive crossbars have been extensively explored as an alternative to traditional CMOS based systems for deep learning accelerators due to their high on-chip storage density and efficient Matrix Vector Multiplication (MVM) compared to digital CMOS. However, the analog nature of computing poses significant issues due to various non-idealities such as: parasitic resistances, non-linear I-V characteristics of the memristor device etc. To address this, a simplified equation-based modelling of the non-ideal behavior of crossbars is performed and correspondingly, a modified technology aware training algorithm is proposed. Building on the drawbacks of equation-based modeling, a Generalized Approach to Emulating Non-Ideality in Memristive Crossbars using Neural Networks (GENIEx) is proposed where a neural network is trained on HSPICE simulation data to learn the transfer characteristics of the non-ideal crossbar. Next, a functional simulator was developed which includes key architectural facets such as tiling, and bit-slicing to analyze the impact of non-idealities on the classification accuracy of large-scale neural networks.
To truly realize the benefits of hardware primitives and the algorithms on top of the stack, it is necessary to build efficient devices that mimic the behavior of the fundamental units of a neural network, namely, neurons and synapses. However, efforts have largely been invested in implementations in the electrical domain with potential limitations of switching speed, functional errors due to analog computing, etc. As an alternative, a purely photonic operation of an Integrate-and-Fire Spiking neuron is proposed, based on the phase change dynamics of Ge2Sb2Te5 (GST) embedded on top of a microring resonator, which alleviates the energy constraints of PCMs in electrical domain. Further, the inherent parallelism of wavelength-division multiplexing (WDM) was leveraged to propose a photonic dot-product engine. The proposed computing platform was used to emulate a SNN inferencing engine for image-classification tasks. These explorations at different levels of the stack can enable energy-efficient machine learning for edge intelligence.
Having explored various domains to design efficient DNN models and studying various hardware primitives based on emerging technologies, we focus on Silicon implementation of compute-in-memory (CIM) primitives for machine learning acceleration based on the more available CMOS technology. CIM primitives enable efficient matrix-vector multiplications (MVM) through parallelized multiply-and-accumulate operations inside the memory array itself. As CIM primitives deploy bit-serial computing, the computations are exposed bit-level sparsity of inputs and weights in a ML model. To that effect, we present an energy-efficient sparsity-aware reconfigurable-precision compute-in-memory (CIM) 8T-SRAM macro for machine learning (ML) applications. Standard 8T-SRAM arrays are re-purposed to enable MAC operations using selective current flow through the read-port transistors. The proposed macro dynamically leverages workload sparsity by reconfiguring the output precision in the peripheral circuitry without degrading application accuracy. Specifically, we propose a new energy-efficient reconfigurable-precision SAR ADC design with the ability to form (n+m)-bit precision using n-bit and m-bit ADCs. Additionally, the transimpedance amplifier (TIA) –required to convert the summed current into voltage before conversion—is reconfigured based on sparsity to improve sense margin at lower output precision. The proposed macro, fabricated in 65 nm technology, provides 35.5-127.2 TOPS/W as the ADC precision varies from 6-bit to 2-bit, respectively. Building on top of the fabricated macro, we next design a hierarchical CIM core micro-architecture that addresses the existing CIM scaling challenges. The proposed CIM core micro-architecture consists of 32 proposed sparsity-aware CIM macros. The 32 macros are divided into 4 matrix-vector multiplication units (MVMUs) consisting of 8 macros each. The core has three unique features: i) it can adaptively reconfigure ADC precision to achieve energy-efficiency and lower latency based on input and weight sparsity, determined by a sparsity controller, ii) it deploys row-gating feature to maintain SNR requirements for accurate DNN computations, and iii) hardware support for load balancing to balance latency mismatches occurring due to different ADC precisions in different compute units. Besides the CIM macros, the core micro-architecture consists of input, weight, and output memories, along with instruction memory and control circuits. The instruction set architecture allows for flexible dataflows and mapping in the proposed core micro-architecture. The sparsity-aware processing core is scheduled to be taped out next month. The proposed CIM demonstrations complemented by our previous analysis on analog CIM systems progressed our understanding of this emerging paradigm in pertinence to ML acceleration.