VOICE COMMAND RECOGNITION WITH DEEP NEURAL NETWORK ON EDGE DEVICES
thesisposted on 26.07.2021, 20:33 by Md Naim MiahMd Naim Miah
Interconnected devices are becoming attractive solutions to integrate physical parameters and making them more accessible for further analysis. Edge devices, located at the end of the physical world, measure and transfer data to the remote server using either wired or wireless communication. The exploding number of sensors, being used in the Internet of Things (IoT), medical fields, or industry, are demanding huge bandwidth and computational capabilities in the cloud, to be processed by Artificial Neural Networks (ANNs) – especially, processing audio, video and images from hundreds of edge devices. Additionally, continuous transmission of information to the remote server not only hampers privacy but also increases latency and takes more power. Deep Neural Network (DNN) is proving to be very effective for cognitive tasks, such as speech recognition, object detection, etc., and attracting researchers to apply it in edge devices. Microcontrollers and single-board computers are the most commonly used types of edge devices. These have gone through significant advancements over the years and capable of performing more sophisticated computations, making it a reasonable choice to implement DNN. In this thesis, a DNN model is trained and implemented for Keyword Spotting (KWS) on two types of edge devices: a bare-metal embedded device (microcontroller) and a robot car. The unnecessary components and noise of audio samples are removed, and speech features are extracted using Mel-Frequency Cepstral Co-efficient (MFCC). In the bare-metal microcontroller platform, these features are efficiently extracted using Digital Signal Processing (DSP) library, which makes the calculation much faster. A Depth wise Separable Convolutional Neural Network (DSCNN) based model is proposed and trained with an accuracy of about 91% with only 721 thousand trainable parameters. After implementing the DNN on the microcontroller, the converted model takes only 11.52 Kbyte (2.16%) RAM and 169.63 Kbyte (8.48%) Flash of the test device. It needs to perform 287,673 Multiply-and-Accumulate (MACC) operations and takes about 7ms to execute the model. This trained model is also implemented on the robot car, Jetbot, and designed a voice-controlled robotic vehicle. This robot accepts few selected voice commands-such as “go”, “stop”, etc. and executes accordingly with reasonable accuracy. The Jetbot takes about 15ms to execute the KWS. Thus, this study demonstrates the implementation of Neural Network based KWS on two different types of edge devices: a bare-metal embedded device without any Operating System (OS) and a robot car running on embedded Linux OS. It also shows the feasibility of bare-metal offline KWS implementation for autonomous systems, particularly autonomous vehicles.