Approximate Computing: From Circuits to Software
thesisposted on 2021-03-01, 23:28 authored by Younghoon KimYounghoon Kim
Many modern workloads such as multimedia, recognition, mining, search, vision, etc. possess the characteristic of intrinsic application resilience: The ability to produce acceptable-quality outputs despite their underlying computations being performed in an approximate manner. Approximate computing has emerged as a paradigm that exploits intrinsic application resilience to design systems that produce outputs of acceptable quality with significant performance/energy improvement. The research community has proposed a range of approximate computing techniques spanning across circuits, architecture, and software over the last decade. Nevertheless, approximate computing is yet to be incorporated into mainstream HW/SW design processes largely due to the deviation from the conventional design flow and the lack of runtime approximation controllability by the user.
The primary objective of this thesis is to provide approximate computing techniques across different layers of abstraction that possess the two following characteristics: (i) They can be applied with minimal change to the conventional design flow, and (ii) the approximation is controllable at runtime by the user with minimal overhead. To this end, this thesis proposes three novel approximate computing techniques: Clock overgating which targets HW design at the Register Transfer Level (RTL), value similarity extensions which enhance general-purpose processors with a set of microarchitectural and ISA extensions, and data subsetting which targets SW executing for commodity platforms.
The thesis first explores clock overgating, which extends the concept of clock gating: A conventional low-power technique that turns off the clock to a Flip-Flop (FF) when the value remains unchanged. In contrast to traditional clock gating, in clock overgating the clock signals to selected FFs in the circuit are gated even when the circuit functionality is sensitive to their state. This saves additional power in the clock tree, the gated FFs and in their downstream logic, while a quality loss occurs if the erroneous FF states propagate to the circuit outputs. This thesis develops a systematic methodology to identify an energy-efficient clock overgating configuration for any given circuit and quality constraint. Towards this end, three key strategies for efficiently pruning the large space of possible overgating configurations are proposed: Significance-based overgating, grouping FFs into overgating islands, and utilizing internal signals of the circuit as triggers for overgating. Across a suite of 6 machine learning accelerators, energy benefits of 1.36X on average are achieved at the cost of a very small (<0.5%) loss in classification accuracy.
The thesis also explores value similarity extensions, a set of lightweight micro-architectural and ISA extensions for general-purpose processors that provide performance improvements for computations on data structures with value similarity. The key idea is that programs often contain repeated instructions that are performed on very similar inputs (e.g., neighboring pixels within a homogeneous region of an image). In such cases, it may be possible to skip an instruction that operates on data similar to a previously executed instruction, and approximate the skipped instruction's result with the saved result of the previous one. The thesis provides three key strategies for realizing this approach: Identifying potentially skippable instructions from user annotations in SW, obtaining similarity information for future load values from the data cache line currently being accessed, and a mechanism for saving & reusing results of potentially skippable instructions. As a further optimization, the thesis proposes to replace multiple loop iterations that produce similar results with a specialized instruction sequence. The proposed extensions are modeled on the gem5 architectural simulator, achieving speedup of 1.81X on average across 6 machine-learning benchmarks running on a microcontroller-class in-order processor.
Finally, the thesis explores a data-centric approach to approximate computing called data subsetting that shifts the focus of approximation from computations to data. The key idea is to restrict the application's data accesses to a subset of its elements so that the overall memory footprint becomes smaller. Constraining the accesses to lie within a smaller memory footprint renders the memory accesses more cache-friendly, thereby improving performance. This thesis presents a C++ data structure template called SubsettableTensor, which embodies mechanisms to define an accessible subset of data and redirect accesses away from non-subset elements, for realizing data subsetting in SW. The proposed concept is evaluated on parallel SW implementations of 7 machine learning applications on a 48-core AMD Opteron server. Experimental results indicate that 1.33X-4.44X performance improvement can be achieved within a <0.5% loss in classification accuracy.
In summary, the proposed approximation techniques have shown significant efficiency improvements for various machine learning applications in circuits, architecture and SW, underscoring their promise as designer-friendly approaches to approximate computing.
CSR: Small: Quality Programmable Processing Platforms for Approximate Computing
Directorate for Computer & Information Science & EngineeringFind out more...