I am a PhD researcher in the Computer Architecture Research Lab of the University of Cambridge, England. My current research lies at the intersection of cutting edge machine learning and emerging computer architecture. Before embarking on a full-time research career, I spent almost a decade in the industry (ARM UK and Broadcom UK) as a CPU subsystem architect and an ASIC design engineer, respectively. I was fortunate enough to receive several excellence awards and scholarships including a Mentor Graphics prize for outstanding achievement in the master's degree. My earlier research work on Network-on-Chip architecture for the Gannet multi-core System-on-a-Chip won a best poster and a best presentation award from Epson Europe and the IET UK, respectively. I am also an Ex-Chevening scholar at the University of Edinburgh and a fellow at the Cambridge Philosophical Society. My current research is funded by an EPSRC doctoral award. I'm very passionate about machine learning systems design and in spare time, I love tinkering with deep neural networks. I'm also a community mentor at www.deeplearning.ai, where I foster other learners in the field of deep learning.
Connect me on Linked-in.
Computing Architecture for Machine Learning and Computer Vision
Over the past few years, deep learning has helped forge advances in many different areas as diverse as image classification, machine translation, and voice recognition - all research topics that have long been difficult for AI researchers to crack. A subcategory of machine learning, deep learning deals with the use of neural networks to learn features automatically than writing handcrafted programs. This has become possible because computers become faster than it was a decade before. Most deep learning models are massive and take a long time to train. Currently, researchers use multiple high-end GPUs to train and infer such networks. Considering its potential in the future, it would be very beneficial if we can deploy deep nets (CNN, RNN, LSTM etc.) on embedded platforms. But, the large and complex network models work against the small silicon footprint and very limited power budget required for portable computers, tablets, and smartphones. My research interests lie in understanding the behaviour of these complex models, optimising them, and development of novel low power compute accelerator architectures that can support running them efficiently.
Ongoing Research Projects
Model Compression Using Low Rank Approximation - ADaPT
Model Agnostic Compute Intensity Reduction Using Fast Arithmetic and Software-hardware Co-design - 1DFALCON
@ARM Ltd.: Energy Efficient Inference on Application Class Processor (e.g. ARM Cortex-A)
@ARM Ltd.: Low Power Hardware Accelerator Design for Deep Learning
As cloud based solution does not fit every market segment (due to issues like privacy, security and real-time performance), there is a huge demand for low power high performance neural network accelerator for battery powered embedded products. In this project, I was investigating ways to reduce compute complexities in current state-of- the-art neural network in use. I was looking for alternative methods to implement convolution which can be run on embedded products. Just like FFT, Winograd’s fast convolution uses modulus arithmetic which helps to compute convolution faster. I investigated if Winograd based solution can be effective for embedded domains. The outcome of this research resulted in a product proposal which will use Winograd scheme for ARM’s future neural network accelerator. The final proposed solution has a potential to reduce overall compute complexities of state-of- the-art convolutional neural network by 2-4x. This would not only help to run neural network based application faster, but will also reduce the total power budget.
Publications & Patents
Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, R. Mullins. The 26th International Conference on Artificial Neural Networks (ICANN), 2017 (Full, ORAL - Best Paper Candidate)
Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract important information to help advance healthcare, make our cities smarter, and innovate in smart home technology. Deep convolutional neural networks, which are at the heart of many emerging Internet-of-Things (IoT) applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity in convolutional layers, limiting their deployability. In this paper, we present an easy-to-implement acceleration scheme, named ADaPT, which can be applied to already available pre-trained networks. Our proposed technique exploits redundancy present in the convolutional layers to reduce computation and storage requirements. Additionally, we also decompose each convolution layer into two consecutive one-dimensional stages to make full use of the approximate model. This technique can easily be applied to existing low power processors, GPUs or new accelerators. We evaluated this technique using four diverse and widely used benchmarks, on hardware ranging from embedded CPUs to server GPUs. Our experiments show an average 3-5x speed-up in all deep models and a maximum 8-9x speed-up on many individual convolutional layers. We demonstrate that unlike iterative pruning based methodology, our approximation technique is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speed-ups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, D. Bates, A. Chadwick, and R. Mullins. In the ACM International Conference on Internet of Things and Machine Learning (IML), 2017 (Full, ORAL)
Deep Convolutional Networks (ConvNets) have demonstrated state-of-the-art performance in many machine learning problems involving image classification and speech recognition. Over the last few years several advances in the design of ConvNets have not only led to a further boost in achieved accuracy on image recognition tasks but also played a crucial role as a feature generator for other machine learning tasks such as object detection, localization, semantic segmentation and image retrieval. However, the complexity and size of ConvNets have limited their use in mobile applications and embedded system. The aim of my research is to find ways to optimize these deep neural networks using model-architecture co-design and enable mass deployment of deep-learning based applications in consumer products.
P. Maji, R. Mullins. In the ARM-Cambridge Research Showcase,
Poster Session, Cambridge Big Data,
Maxwell Centre, University of Cambridge, Dec 2016.
A bidirectional communications link between a master device and a slave device includes first endpoint circuitry coupled to the master device generating forward data packets, second endpoint circuitry coupled to the slave device for receiving reverse data packets, and bidirectional communication circuitry for transferring forward data packets from the first endpoint circuitry to the second endpoint circuitry and reverse data packets from the second endpoint circuitry to the first endpoint circuitry. In response to a power down condition requiring a power down of at least one of the first endpoint circuitry and the second endpoint circuitry, performance of said power down is deferred until both said outstanding forward credit signal and said outstanding reverse credit signal have been deasserted.
P. Maji, S. R. Mellor, ARM Ltd. US Patent #20130268705 A1, Issued Oct 2013.
A system-on-chip integrated circuit includes a packet transmitter for generating data packets to be sent via a communication circuit to a packet receiver containing a buffer circuit. A transmitter counter stores a transmitter count value counting data packets sent. A receiver counter stores a receiver count value tracking data packets emptied from the buffer circuit. A comparison circuitry is used to compare the transmitter count value and the receiver count value to determine whether or not there is storage space available within the buffer circuit to receive transmission of further data packets. The packet transmitter operates in a transmitter clock domain that is asynchronous from a receiver clock domain in which the packet receiver operates. One of the count values is passed across this asynchronous clock boundary in order that the comparison may be performed and flow control exercised.
P. Maji, S. R. Mellor, ARM Ltd. US Patent #20130251006 A1, Issued Sep 2013.
Networks-on-Chip (NoC) have emerged as alternative to buses to provide a packet-switched communication medium for modular development of large Systems-on-Chip. However, to successfully replace its predecessor, the NoC has to be able to efficiently exchange all types of traffic including collective communications. The latter is especially important for e.g. cache updates in multicore systems. The Quarc NoC architecture  has been introduced as a Networks-on-Chip which is highly efficient in exchanging all types of traffic including broadcast and multicast. In this paper we present the hardware implementation of the switch architecture and the network adapter (transceiver) of the Quarc NoC. Moreover, the paper presents an analysis and comparison of the cost and performance between the Quarc and the Spidergon NoCs implemented in Verilog targeting the Xilinx Virtex FPGA family. We demonstrate a dramatic improvement in performance over the Spidergon especially for broadcast traffic, at no additional hardware cost.
P. Maji, M. Moadeli, and W. Vanderbauwhede. In the IEEE 23rd International Parallel & Distributed Processing Symposium (IPDPS), Rome, Italy, May 2009.
P. Maji, R Mullins. In the Microsoft PhD Summer School Poster Session, Microsoft Research (MSR), Cambridge, July 2016.
M. Moadeli, P. Maji, and W. Vanderbauwhede. In the IEEE 23rd International Conference on Advanced Information Networking & Applications (AINA), Bradford, UK, May 2009.
M. Moadeli, A. Shahrabi, W. Vanderbauwhede and P. Maji. In proceedings of the Journal of Systems Architecture – Embedded Systems Design (JSA), 2010.
P. Maji, W. Vanderbauwhede, and F. Rodriguez. In iSLI Annual day student poster presentation, Alba Centre, Livingston, Scotland (Awarded Best Poster).
P. Maji. In the IET Scotland Present Around the World competition, 2009, at the Old School, the University of Edinburgh (Awarded Best Presentation).
P. Maji, MSc Thesis, September 2008.