Highlights

Research Highlights

Hardware Reused Architecture for Deep Neural Network Accelerator

Lesser hardware resources and lower power consumption are significant for increasingly important edge computing solutions. In this respect, we have designed an efficient DNN architecture with better resource utilization and other performance parameters. The proposed hardware implementation technique uses multiplexed and serialized data path that allows reusing the AF. We have serialized output of the MAC array within the layer using shift register in each layer and efficiently reused single AF for excitation. The embedded design approach is divided into two parts: firstly, a DNN core is designed for proposed hardware-reused architecture with a control unit for weight/bias access that includes optimized AF. An efficient log2 quantized scheme is used for serial data access using FIFO. Secondly, an embedded block design is implemented on the Zybo FPGA board to verify the design and compare physical performance parameters. Additionally, to address the ASIC DNN implementation, our AF is synthesized at 45nm technology, and physical performance parameters are compared with the state of the art.

Activation function resue architecture for DNN

Reconfigurable and Efficient Computing-in-Memory for Edge AI

Compute-in-memory (CIM) is a new computing paradigm that addresses the von-Neumann bottleneck in hardware accelerator design for deep learning. The input vector and weight matrix multiplication, i.e., the multiply-and-accumulate (MAC) operation, could be performed in the analog domain within the memory sub-array, leading to significant improvements in throughput and energy efficiency. Static random access memory (SRAM) and emerging non-volatile memories such as resistive random access memory (RRAM) are promising candidates to store the weights of deep neural network (DNN) models. We will discuss the recent progress in SRAM and RRAM-based CIM macros that have been demonstrated in silicon and FPGA implementation. Then we discuss the general design challenges of the CIM chips. In-memory Computing (IMC) architectures exhibit an intrinsic trade-off between computational accuracy and energy efficiency. Ultra-low precision networks such as binary neural networks (BNNs) have gained momentum in recent times, since the reduced precision alleviates the costs associated with storage, computation, and communication, enabling inference at the edge. Resistive Random Access Memory (RRAM) crossbar-based BNN accelerators have shown tremendous potential in boosting the speed and energy efficiency of compute-intensive Deep Learning applications at the edge.

Block Diagram of Edge AI System

RECON: Resource-Efficient CORDIC-Based Neuron Architecture

Contemporary hardware implementations of artificial neural networks face the burden of excess area requirement due to resource-intensive elements such as multiplier and non-linear activation functions. The present work addresses this challenge by proposing a resource-efficient Co-ordinate Rotation Digital Computer (CORDIC)-based neuron architecture (RECON) which can be configured to compute both multiply-accumulate (MAC) and non-linear activation function (AF) operations. The CORDIC-based architecture uses linear and trigonometric relationships to realize MAC and AF operations respectively. The proposed design is synthesized and verified at 45nm technology using Cadence Virtuoso for all physical parameters. Implementation of the signed fixed-point 8-bit MAC using our design, shows 60% less area, latency, and power product (ALP) and shows improvement by 38% in area, 27% in power dissipation, and 15% in latency with respect to the state-of-the-art MAC design. Further, Monte-Carlo simulations for process-variations and device-mismatch are performed for both the proposed model and the state-of-the-art to evaluate expectations of functions of randomness in dynamic power variation. The dynamic power variation for our design shows that worst-case mean is 189.73μW which is 63% of the state-of-the-art

RECON Architeture

BitMAC: Bit-Serial Computation-Based Efficient Multiply-Accumulate Unit for DNN Accelerator

Contemporary hardware implementations of deep neural networks face the burden of excess area requirement due to resource-intensive elements such as a multiplier. A semi-custom ASIC approach-based VLSI circuit design of the multiply-accumulate unit in a deep neural network faces the chip area limitation. Therefore, an area and power-efficient architecture for the multiply-accumulate unit is imperative to down the burden of excess area requirement for digital design exploration. The present work addresses this challenge by proposing an efficient processing and bit-serial computation-based multiply-accumulate unit implementation. The proposed architecture is verified using simulation output and synthesized using Synopsys design vision at 180 nm and 45 nm technology and extracted all physical parameters using Cadence Virtuoso. At 45 nm, design shows 34.35% less area-delay-product (ADP). It shows improvement by 25.94% in area, 35.65% in power dissipation, and 14.30% in latency with respect to the state-of-the-art multiply-accumulate unit design. Furthermore, at lower technology node gets higher leakage power dissipation. In order to save leakage power, we exploit the power-gated design for the proposed architecture. The used coarse-grain power-gating technique saves 52.79% leakage/static power with minimal area overhead.

Neuron Arcitecture

Implantable Pacemaker Chip (iPACE-CHIP)

A cardiac pacemaker is a medical device that generates electrical impulses delivered by electrodes to cause the heart muscle chambers to contract and therefore pump blood. This device has the potential to replace and/or regulate the function of the natural electrical conduction system of the heart. The primary purpose of a pacemaker is to maintain an adequate heart rate, either because the heart's natural pacemaker is not fast enough, or because there is a block in the heart's electrical conduction system. Design and fabricate the next generation chip (iPACE-CHIP) for the implantable pacemaker to be used in the commercial product of Shree Pacetronix Ltd. with enhanced features. The proposed chip would have pacing, sensing, EGM, Battery Measurement, telemetry and controller blocks. The proposed blocks would be designed with reduced power, area and improved reliability.

Pacemaker Chip

Early Breast Cancer Diagnosis using Cogent Activation Function-based Deep Learning Implementation on Screenedmammograms

Breast cancer is detected in one out of eight females worldwide. Principallybiomedical image processing techniques work with images captured by a microscope and then analyzed with the help of different algorithms and methods. Instead of microscopic image diagnosis, machine learning algorithms are now incorporated to detect and diagnose therapeutic imagery. Computer-aided mechanisms are used for better efficiency and reliability compared with manual pathological detection systems. Machine learning algorithms detect tumors by extracting features through a convolutional neural network (CNN)and then classifying them using a fully connected network. As Machine learning does not require prior expertise, it is profoundly used in biomedical imaging. We modified a CNN by mathematical modeling of a proposed activation function. We have obtained an appreciable prediction accuracy of up to 99%, along with a precision of 0.97

Modified VGG-16 Archtecture

Error-Tolerant Reconfigurable VDD 10T SRAM Architecture for IoT Applications

An error-tolerant reconfigurable VDD (R-VDD) scaled SRAM architecture, which significantly reduces the read and hold power using the supply voltage scaling technique. The data-dependent low-power 10T (D2LP10T) SRAM cell is used for the R-VDD scaled architecture with the improved stability and lower power consumption. The R-VDD scaled SRAM architecture is developed to avoid unessential read and hold power using VDD scaling. In this work, the cells are implemented and analyzed considering a technologically relevant 65 nm CMOS node. We analyze the failure probability during read, write, and hold mode, which shows that the proposed D2LP10T cell exhibits the lowest failure rate compared to other existing cells. Furthermore, the D2LP10T cell design offers 1.66×, 4.0×, and 1.15× higher write, read, and hold stability, respectively, as compared to the 6T cell. Moreover, leakage power, write power-delay-product (PDP), and read PDP has been reduced by 89.96%, 80.52%, and 59.80%, respectively, compared to the 6T SRAM cell at 0.4 V supply voltage. The functional improvement becomes even more apparent when the quality factor (QF) is evaluated, which is 458× higher for the proposed design than the 6T SRAM cell at 0.4 V supply voltage. A significant improvement of power dissipation, i.e., 46.07% and 74.55%, can also be observed for the R-VDD scaled architecture compared to the conventional array for the respective read and hold operation at 0.4 V supply voltage.

Reconfigurable VDD scaled memory architecture

A Reliable, Multi-bit Error Tolerant 11T SRAM Memory Design for Wireless Sensor Nodes

The work proposes an 11T SRAM cell which confirms its reliability for Internet of Things (IoT) based health monitoring system. The cell executes improved write and read ability using data-dependent feedback cutting and read decoupled access path mechanism respectively. The write and read stabilities of proposed cell are 2:67 and 1:98 higher than the conventional 6T cell with 1:53 area overhead. Moreover, the improved soft error tolerance and better reliability against negative bias temperature instability (NBTI) of proposed 11T SRAM cell as compared to other considered cells make it suitable for the bio medical implant. A low-power double adjacent bit error detection and correction (DAEDC) scheme is proposed to further improve the robustness of designed 1 Kb bit-interleaved memory against the soft error occurrence. The leakage power of proposed cell is controlled by the stacking devices used in its cross-coupled inverter pair and the column based read ground signal (RGND) further controls the unnecessary bit line switching power of the array.

Layout design of 1 Kb SRAM macro