I am a research scientist at Arm Cambridge Research. I am a PhD researcher in the Computer Architecture Research Lab of the University of Cambridge, England. My current research lies at the intersection of cutting edge machine learning and emerging computer architecture. Before embarking on a full-time research career, I spent almost a decade in the industry (ARM UK and Broadcom UK) as a CPU subsystem architect and an ASIC design engineer, respectively. I was fortunate enough to receive several excellence awards and scholarships including a Mentor Graphics prize for outstanding achievement in the master's degree. My earlier research work on Network-on-Chip architecture for the Gannet multi-core System-on-a-Chip won the best poster and the best presentation award from Epson Europe and the IET UK. I am also an Ex-Chevening scholar at the University of Edinburgh and a fellow at the Cambridge Philosophical Society. My current research is funded by an EPSRC doctoral award. I'm very passionate about machine learning systems design and in spare time, I love tinkering with deep neural networks. I'm also a community mentor at www.deeplearning.ai, where I foster other learners in the field of deep learning.
You can find more up-to-date information about me on Linkedin.
Co-design of Machine Learning Architecture and Hardware for Embedded Systems
Over the past few years, deep learning has helped forge advances in many different areas as diverse as image classification, machine translation, and voice recognition - all research topics that have long been difficult for AI researchers to crack. A subcategory of machine learning, deep learning deals with the use of neural networks to learn features automatically than writing handcrafted programs. This has become possible because computers become faster than it was a decade before. Most deep learning models are massive and take a long time to train. Currently, researchers use multiple high-end GPUs to train and infer such networks. Considering its potential in the future, it would be very beneficial if we can deploy deep nets (CNN, RNN, LSTM etc.) on embedded platforms. But, the large and complex network models work against the small silicon footprint and very limited power budget required for portable computers, tablets, and smartphones. My research interests lie in understanding the behavior of these complex models, optimizing them, and development of novel low power compute accelerator architectures that can support running them efficiently. My approach heavily involves cross-layer optimization across the stack - from model, implementation to underlying hardware architecture. I also aim at solutions which are mathematically guided and have a solid reasoning.
Ongoing Research Projects
- ADaPT: Compressing Deep Convolutional Neural Network (CNN) Using Low Rank Approximation (Sparse to Dense Model)
- 1D-FALCON: Model Agnostic Compute Intensity Reduction Using Fast Arithmetic and Software-hardware Co-design
- Energy Efficient Inference on ARM Cortex-A Mobile Processors Using Winograd or Cook-Toom Class of Convolution (Collaboration with ARM ML research group)
- Low Power Custom Hardware (ASIC) Accelerator Design for Deep Learning (specifically convolutional neural network)
- Energy Efficient Inference by Conditional Computation - Skipping, Dropping, or Exiting Early
(Note: If you are interested in collaborating with me, I would love to hear from you. )
Publications & Patents
P. Maji (Cambridge), A. Mundy (ARM Research), G. Dasika (ARM Research), J. Beu (ARM Research), M. Mattina (ARM Research), and R. Mullins (Cambridge).
The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs). Although there has been a lot of research done on model and algorithmic optimization of CNN, little attention has been paid to the efficient implementation of these algorithms on embedded CPUs, which usually have frugal memory and low power budget. This research work aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today. Specifically, we demonstrate a reduction in inference latency by using a set of optimization strategies that improve the utilization of computational resources, and by effectively leveraging the ARMv8-A NEON SIMD instruction set. We evaluated our proposed region-wise multi-channel implementations on Arm Cortex-A73 platform using several representative CNNs. The results show significant performance improvements in full network, up to 60%, over existing im2row/im2col based optimization techniques.
P. Maji, R. Mullins. (Journal Entropy 2018 - Special Issue with Selected Papers)
Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom–Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, R. Mullins. The 26th International Conference on Artificial Neural Networks (ICANN), 2017 (Full, ORAL - Best Paper Candidate)
Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, D. Bates, A. Chadwick, and R. Mullins. In the ACM International Conference on Internet of Things and Machine Learning (ACM - IML), 2017 (Full, ORAL)
Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract important information to help advance healthcare, make our cities smarter, and innovate in smart home technology. Deep convolutional neural networks, which are at the heart of many emerging Internet-of-Things (IoT) applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity in convolutional layers, limiting their deployability. In this paper, we present an easy-to-implement acceleration scheme, named ADaPT, which can be applied to already available pre-trained networks. Our proposed technique exploits redundancy present in the convolutional layers to reduce computation and storage requirements. Additionally, we also decompose each convolution layer into two consecutive one-dimensional stages to make full use of the approximate model. This technique can easily be applied to existing low power processors, GPUs or new accelerators. We evaluated this technique using four diverse and widely used benchmarks, on hardware ranging from embedded CPUs to server GPUs. Our experiments show an average 3-5x speed-up in all deep models and a maximum 8-9x speed-up on many individual convolutional layers. We demonstrate that unlike iterative pruning based methodology, our approximation technique is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speed-ups solely from convolutional layers with no loss in baseline accuracy.
P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8924612B2, Granted on 2014-12-30.
A bidirectional communications link between a master device and a slave device includes first endpoint circuitry coupled to the master device generating forward data packets, second endpoint circuitry coupled to the slave device for receiving reverse data packets, and bidirectional communication circuitry for transferring forward data packets from the first endpoint circuitry to the second endpoint circuitry and reverse data packets from the second endpoint circuitry to the first endpoint circuitry. In response to a power down condition requiring a power down of at least one of the first endpoint circuitry and the second endpoint circuitry, performance of said power down is deferred until both said outstanding forward credit signal and said outstanding reverse credit signal have been deasserted.
Deep Convolutional Networks (ConvNets) have demonstrated state-of-the-art performance in many machine learning problems involving image classification and speech recognition. Over the last few years several advances in the design of ConvNets have not only led to a further boost in achieved accuracy on image recognition tasks but also played a crucial role as a feature generator for other machine learning tasks such as object detection, localization, semantic segmentation and image retrieval. However, the complexity and size of ConvNets have limited their use in mobile applications and embedded system. The aim of my research is to find ways to optimize these deep neural networks using model-architecture co-design and enable mass deployment of deep-learning based applications in consumer products.
P. Maji, R. Mullins. In the ARM-Cambridge Research Showcase,
Poster Session, Cambridge Big Data,
Maxwell Centre, University of Cambridge, Dec 2016.
P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8630358B2, Granted on 2014-01-14.
A system-on-chip integrated circuit includes a packet transmitter for generating data packets to be sent via a communication circuit to a packet receiver containing a buffer circuit. A transmitter counter stores a transmitter count value counting data packets sent. A receiver counter stores a receiver count value tracking data packets emptied from the buffer circuit. A comparison circuitry is used to compare the transmitter count value and the receiver count value to determine whether or not there is storage space available within the buffer circuit to receive transmission of further data packets. The packet transmitter operates in a transmitter clock domain that is asynchronous from a receiver clock domain in which the packet receiver operates. One of the count values is passed across this asynchronous clock boundary in order that the comparison may be performed and flow control exercised.
My Hobby Set Top Box Project
Inspired by industrial scale streaming Set Top Boxes, I built a very elementary networked Set Top Box on my spare time using just discrete chips. It uses 8 symmetric 32-bit processors (cores) with 32KB of RAM running at 80MHz. The complete STB solution works with any NTSC or PAL TV that has composite (RCA) input, with any network router that supports DHCP. It does not have any support for wifi. I also built few basic widgets to go with the hardware. The final PCB is in the middle between a Costa club card and a Raspberry pi.
P. Maji, M. Moadeli, and W. Vanderbauwhede. In the IEEE 23rd International Parallel & Distributed Processing Symposium (IPDPS), Rome, Italy, May 2009.
Networks-on-Chip (NoC) have emerged as alternative to buses to provide a packet-switched communication medium for modular development of large Systems-on-Chip. However, to successfully replace its predecessor, the NoC has to be able to efficiently exchange all types of traffic including collective communications. The latter is especially important for e.g. cache updates in multicore systems. The Quarc NoC architecture  has been introduced as a Networks-on-Chip which is highly efficient in exchanging all types of traffic including broadcast and multicast. In this paper we present the hardware implementation of the switch architecture and the network adapter (transceiver) of the Quarc NoC. Moreover, the paper presents an analysis and comparison of the cost and performance between the Quarc and the Spidergon NoCs implemented in Verilog targeting the Xilinx Virtex FPGA family. We demonstrate a dramatic improvement in performance over the Spidergon especially for broadcast traffic, at no additional hardware cost.
P. Maji, R Mullins. In the Microsoft PhD Summer School Poster Session, Microsoft Research (MSR), Cambridge, July 2016.
M. Moadeli, P. Maji, and W. Vanderbauwhede. In the IEEE 23rd International Conference on Advanced Information Networking & Applications (AINA), Bradford, UK, May 2009.
M. Moadeli, A. Shahrabi, W. Vanderbauwhede and P. Maji. In proceedings of the Journal of Systems Architecture – Embedded Systems Design (JSA), 2010.
P. Maji, W. Vanderbauwhede, and F. Rodriguez. In iSLI Annual day student poster presentation, Alba Centre, Livingston, Scotland (Awarded Best Poster).
P. Maji. In the IET Scotland Present Around the World competition, 2009, at the Old School, the University of Edinburgh (Awarded Best Presentation).
P. Maji, MSc Thesis, September 2008.
An Idea vs Real Product - A Pragmatic Point of View
'''It’s the disease of thinking that a really great idea is 90% of the work. And if you just tell all these other people “here’s this great idea,” then of course they can go off and make it happen. And the problem with that is that there’s just a tremendous amount of craftsmanship in between a great idea and a great product. And as you evolve that great idea, it changes and grows. It never comes out like it starts because you learn a lot more as you get into the subtleties of it. And you also find there are tremendous trade-offs that you have to make. There are just certain things you can’t make electrons do. There are certain things you can’t make plastic do. Or glass do. Or factories do. Or robots do. Designing a product is keeping five thousand things in your brain and fitting them all together in new and different ways to get what you want.''' (Steve Jobs 1995)