I am a PhD researcher in the Computer Architecture Research Lab of the University of Cambridge, England. My current research lies at the intersection of cutting edge machine learning and emerging computer architecture. Before embarking on a full-time research career, I spent almost a decade in the industry (ARM UK and Broadcom UK) as a CPU subsystem architect and an ASIC design engineer, respectively. I was fortunate enough to receive several excellence awards and scholarships including a Mentor Graphics prize for outstanding achievement in the master's degree. My earlier research work on Network-on-Chip architecture for the Gannet multi-core System-on-a-Chip won a best poster and a best presentation award from Epson Europe and the IET UK, respectively. I am also an Ex-Chevening scholar at the University of Edinburgh and a fellow at the Cambridge Philosophical Society. My current research is funded by an EPSRC doctoral award. I'm very passionate about machine learning systems design and in spare time, I love tinkering with deep neural networks. I'm also a community mentor at www.deeplearning.ai, where I foster other learners in the field of deep learning.
Connect me on Linked-in.
Optimisation of Computing Platform for Emerging Machine Learning Workloads
Over the past few years, deep learning has helped forge advances in many different areas as diverse as image classification, machine translation, and voice recognition - all research topics that have long been difficult for AI researchers to crack. A subcategory of machine learning, deep learning deals with the use of neural networks to learn features automatically than writing handcrafted programs. This has become possible because computers become faster than it was a decade before. Most deep learning models are massive and take a long time to train. Currently, researchers use multiple high-end GPUs to train and infer such networks. Considering its potential in the future, it would be very beneficial if we can deploy deep nets (CNN, RNN, LSTM etc.) on embedded platforms. But, the large and complex network models work against the small silicon footprint and very limited power budget required for portable computers, tablets, and smartphones. My research interests lie in understanding the behaviour of these complex models, optimising them, and development of novel low power compute accelerator architectures that can support running them efficiently. My approach heavily involves cross-layer optimisation across the stack - from model, implementation to underlying hardware architecture. I also aim at solutions which are mathematically guided and have a solid reasoning.
Ongoing Research Projects
- ADaPT: Compressing Deep Convolutional Neural Network (CNN) Using Low Rank Approximation (Sparse to Dense Model)
- 1D-FALCON: Model Agnostic Compute Intensity Reduction Using Fast Arithmetic and Software-hardware Co-design
- Energy Efficient Inference on ARM Cortex-A Mobile Processors Using Winograd Convolution (On-going collaboration with ARM research group)
- Low Power Custom Hardware Accelerator Design for Deep Learning
- On Calibration of Confidence in Deep Networks - Estimate, Improve and Deploy
Note: If you are interested in collaborating with me, I would love to hear from you.
Publications & Patents
Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom–Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, R. Mullins. Entropy 2018, 20, 305. - An MDPI Journal
Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, R. Mullins. The 26th International Conference on Artificial Neural Networks (ICANN), 2017 (Full, ORAL - Best Paper Candidate)
Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract important information to help advance healthcare, make our cities smarter, and innovate in smart home technology. Deep convolutional neural networks, which are at the heart of many emerging Internet-of-Things (IoT) applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity in convolutional layers, limiting their deployability. In this paper, we present an easy-to-implement acceleration scheme, named ADaPT, which can be applied to already available pre-trained networks. Our proposed technique exploits redundancy present in the convolutional layers to reduce computation and storage requirements. Additionally, we also decompose each convolution layer into two consecutive one-dimensional stages to make full use of the approximate model. This technique can easily be applied to existing low power processors, GPUs or new accelerators. We evaluated this technique using four diverse and widely used benchmarks, on hardware ranging from embedded CPUs to server GPUs. Our experiments show an average 3-5x speed-up in all deep models and a maximum 8-9x speed-up on many individual convolutional layers. We demonstrate that unlike iterative pruning based methodology, our approximation technique is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speed-ups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, D. Bates, A. Chadwick, and R. Mullins. In the ACM International Conference on Internet of Things and Machine Learning (ACM - IML), 2017 (Full, ORAL)
Energy Efficient Inference on ARM Cortex-A Mobile Processors (Work-in-Progress)
Deep learning has led to advances in many embedded applications that require real-time processing such as image classification, machine translation, and voice recognition. However, the size and compute complexity of many deep neural networks frequently necessitate performing inferences using a cloud-based infrastructure or making modifications to the network by pruning or reducing bit-precision. In this paper, we explore how state-of-the-art deep convolutional neural networks (CNNs) can be implemented directly on a modern Arm Cortex-A CPU, widely used in smartphones and tablets today. Specifically, we demonstrate a reduction in compute complexity and time through the use of Winograd convolution algorithms, and by effectively leveraging the Armv8-A NEON SIMD instruction set. We evaluated these techniques on an Arm Cortex-A73 with several representative CNNs and show an average 1.5-2.5x per-layer speedup over aggressively optimized im2row/im2col techniques, and a peak 3.5x speedup on many individual layers. Our techniques can be readily deployed to other ARM Cortex-A processors and provide an alternative to the use of cloud-based solutions while preserving the model architecture and pre-trained accuracy.
A bidirectional communications link between a master device and a slave device includes first endpoint circuitry coupled to the master device generating forward data packets, second endpoint circuitry coupled to the slave device for receiving reverse data packets, and bidirectional communication circuitry for transferring forward data packets from the first endpoint circuitry to the second endpoint circuitry and reverse data packets from the second endpoint circuitry to the first endpoint circuitry. In response to a power down condition requiring a power down of at least one of the first endpoint circuitry and the second endpoint circuitry, performance of said power down is deferred until both said outstanding forward credit signal and said outstanding reverse credit signal have been deasserted.
P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8924612B2, Granted on 2014-12-30.
Deep Convolutional Networks (ConvNets) have demonstrated state-of-the-art performance in many machine learning problems involving image classification and speech recognition. Over the last few years several advances in the design of ConvNets have not only led to a further boost in achieved accuracy on image recognition tasks but also played a crucial role as a feature generator for other machine learning tasks such as object detection, localization, semantic segmentation and image retrieval. However, the complexity and size of ConvNets have limited their use in mobile applications and embedded system. The aim of my research is to find ways to optimize these deep neural networks using model-architecture co-design and enable mass deployment of deep-learning based applications in consumer products.
P. Maji, R. Mullins. In the ARM-Cambridge Research Showcase,
Poster Session, Cambridge Big Data,
Maxwell Centre, University of Cambridge, Dec 2016.
A system-on-chip integrated circuit includes a packet transmitter for generating data packets to be sent via a communication circuit to a packet receiver containing a buffer circuit. A transmitter counter stores a transmitter count value counting data packets sent. A receiver counter stores a receiver count value tracking data packets emptied from the buffer circuit. A comparison circuitry is used to compare the transmitter count value and the receiver count value to determine whether or not there is storage space available within the buffer circuit to receive transmission of further data packets. The packet transmitter operates in a transmitter clock domain that is asynchronous from a receiver clock domain in which the packet receiver operates. One of the count values is passed across this asynchronous clock boundary in order that the comparison may be performed and flow control exercised.
P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8630358B2, Granted on 2014-01-14.
Networks-on-Chip (NoC) have emerged as alternative to buses to provide a packet-switched communication medium for modular development of large Systems-on-Chip. However, to successfully replace its predecessor, the NoC has to be able to efficiently exchange all types of traffic including collective communications. The latter is especially important for e.g. cache updates in multicore systems. The Quarc NoC architecture  has been introduced as a Networks-on-Chip which is highly efficient in exchanging all types of traffic including broadcast and multicast. In this paper we present the hardware implementation of the switch architecture and the network adapter (transceiver) of the Quarc NoC. Moreover, the paper presents an analysis and comparison of the cost and performance between the Quarc and the Spidergon NoCs implemented in Verilog targeting the Xilinx Virtex FPGA family. We demonstrate a dramatic improvement in performance over the Spidergon especially for broadcast traffic, at no additional hardware cost.
P. Maji, M. Moadeli, and W. Vanderbauwhede. In the IEEE 23rd International Parallel & Distributed Processing Symposium (IPDPS), Rome, Italy, May 2009.
P. Maji, R Mullins. In the Microsoft PhD Summer School Poster Session, Microsoft Research (MSR), Cambridge, July 2016.
M. Moadeli, P. Maji, and W. Vanderbauwhede. In the IEEE 23rd International Conference on Advanced Information Networking & Applications (AINA), Bradford, UK, May 2009.
M. Moadeli, A. Shahrabi, W. Vanderbauwhede and P. Maji. In proceedings of the Journal of Systems Architecture – Embedded Systems Design (JSA), 2010.
P. Maji, W. Vanderbauwhede, and F. Rodriguez. In iSLI Annual day student poster presentation, Alba Centre, Livingston, Scotland (Awarded Best Poster).
P. Maji. In the IET Scotland Present Around the World competition, 2009, at the Old School, the University of Edinburgh (Awarded Best Presentation).
P. Maji, MSc Thesis, September 2008.