Partha Maji is currently a Principal Research Scientist at Arm’s Machine Learning Research Lab based in Cambridge, where he drives the core research in the efficient implementation of probabilistic machine learning on the resource-constrained devices. Prior to this, he worked as a Staff SoC Design Engineer at Broadcom Ltd. where he led various front-end and back-end chip design and implementation activities. Partha started his career in the semiconductor industry as a Micro-architecture design engineer in Processor Division group at Arm. He also spent in the software industry for a while before embarking into chip design. He has extensive experience with the end-to-end chip design process starting from microarchitecture-level design to physical design through multiple tape-outs of low-power chips at 65/40/28/22nm deep-submicron CMOS process technology.

His current research interests lie in multiple disciplines that bridge the topics of machine learning, mobile/embedded systems, computer architecture and hardware implementation. Partha has received several excellence awards from the industry including a Mentor Graphics prize for outstanding achievement in the master’s degree. Partha also received multiple accolades for his research in on-chip interconnect including an award from Epson Europe and the IET, UK. He was also recognized by the European Neural Network Society for high-quality contribution in machine learning research. Partha was a recipient of the prestigious UK Chevening scholarship. He received a PhD in Computer Science from the University of Cambridge and an MSc in System-on-Chip design from the University of Edinburgh.

You can find more up-to-date information about Partha on Linkedin.

Research Area

Co-design of Machine Learning Architecture and Hardware for Embedded Systems

Over the past few years, deep learning has helped forge advances in many different areas as diverse as image classification, machine translation, and voice recognition - all research topics that have long been difficult for AI researchers to crack. A subcategory of machine learning, deep learning deals with the use of neural networks to learn features automatically than writing handcrafted programs. This has become possible because computers become faster than it was a decade before. Most deep learning models are massive and take a long time to train. Currently, researchers use multiple high-end GPUs to train and infer such networks. Considering its potential in the future, it would be very beneficial if we can deploy deep nets (CNN, RNN, LSTM etc.) on embedded platforms. But, the large and complex network models work against the small silicon footprint and very limited power budget required for portable computers, tablets, and smartphones. Partha's research interests lie in understanding the behavior of these complex models, optimizing them, incorporating uncertainty measures, and development of novel low power compute accelerator architectures that can support running them efficiently. His approach heavily involves cross-layer optimization across the stack - from model, implementation to underlying hardware architecture. He also aims at solutions which are mathematically guided and have a solid reasoning.

Ongoing Research Projects

  • Optimisation of Graph Neural Networks for non-Euclidean computer vision applications (e.g. visual SLAM, Point-Cloud segmentation, 3D object detection)

  • Hardware friendly uncertainty modelling in deep neural networks using probabilistic ML, Robust and explainable ML

  • Low Power Custom Hardware (ASIC) Accelerator Design for Deep Learning

  • Optimisation of deep neural network workload for low-power CPUs


(Note: If you are interested in collaborating with me, I would love to hear from you. )

Publications & Patents

Towards Efficient Point Cloud Graph Neural Networks Through Architectural Simplification

Shyam A. Tailor, René de Jong, Tiago Azevedo, Matthew Mattina, Partha Maji

Accepted in the Deep Learning for Geometric Computing workshop in ICCV 2021

In recent years graph neural network (GNN)-based approaches have become a popular strategy for processing point cloud data, regularly achieving state-of-the-art performance on a variety of tasks. To date, the research community has primarily focused on improving model expressiveness, with secondary thought given to how to design models that can run efficiently on resource constrained mobile devices including smartphones or mixed reality headsets. In this work we make a step towards improving the efficiency of these models by making the observation that these GNN models are heavily limited by the representational power of their first, feature extracting, layer. We find that it is possible to radically simplify these models so long as the feature extraction layer is retained with minimal degradation to model performance; further, we discover that it is possible to improve performance overall on ModelNet40 and S3DIS by improving the design of the feature extractor. Our approach reduces memory consumption by 20× and latency by up to 9.9× for graph layers in models such as DGCNN; overall, we achieve speed-ups of up to 4.5× and peak memory reductions of 72.5%.

On the Effects of Quantisation on Model Uncertainty in Bayesian Neural Networks

Martin Ferianc, Partha Maji, Matthew Mattina, Miguel Rodrigues

Accepted in the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021)

Bayesian neural networks (BNNs) are making significant progress in many research areas where decision-making needs to be accompanied by uncertainty estimation. Being able to quantify uncertainty while making decisions is essential for understanding when the model is over-/under-confident, and hence BNNs are attracting interest in safety-critical applications, such as autonomous driving, healthcare, and robotics. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their increased memory and compute costs. In this work, we investigate quantisation of BNNs by compressing 32-bit floating-point weights and activations to their integer counterparts, that has already been successful in reducing the compute demand in standard pointwise neural networks. We study three types of quantised BNNs, we evaluate them under a wide range of different settings, and we empirically demonstrate that a uniform quantisation scheme applied to BNNs does not substantially decrease their quality of uncertainty estimation.

Stochastic-Shield: A Probabilistic Approach Towards Training-Free Adversarial Defense in Quantized CNNs

Lorena Qendro, Sangwon Ha, René de Jong, Partha Maji

Accepted in MobiSys2021 workshop

Quantized neural networks (NN) are the common standard to efficiently deploy deep learning models on tiny hardware platforms. However, we notice that quantized NNs are as vulnerable to adversarial attacks as the full-precision models. With the proliferation of neural networks on small devices that we carry or surround us, there is a need for efficient models without sacrificing trust in the prediction in presence of malign perturbations. Current mitigation approaches often need adversarial training or are bypassed when the strength of adversarial examples is increased.

In this work, we investigate how a probabilistic framework would assist in overcoming the aforementioned limitations for quantized deep learning models. We explore Stochastic-Shield: a flexible defense mechanism that leverages input filtering and a probabilistic deep learning approach materialized via Monte Carlo Dropout. We show that it is possible to jointly achieve efficiency and robustness by accurately enabling each module without the burden of re-retraining or ad hoc fine-tuning.

Stochastic-YOLO: Efficient Probabilistic Object Detection under Dataset Shifts, Blog

T. Azevedo, R. Jong, M. Mattina, P. Maji

(NeurIPS 2020 Workshop on ML for Autonomous Driving)

In image classification tasks, the evaluation of models' robustness to increased dataset shifts with a probabilistic framework is very well studied. However, object detection (OD) tasks pose other challenges for uncertainty estimation and evaluation. For example, one needs to evaluate both the quality of the label uncertainty (i.e., what?) and spatial uncertainty (i.e., where?) for a given bounding box, but that evaluation cannot be performed with more traditional average precision metrics (e.g., mAP). In this paper, we adapt the well-established YOLOv3 architecture to generate uncertainty estimations by introducing stochasticity in the form of Monte Carlo Dropout (MC-Drop), and evaluate it across different levels of dataset shift. We call this novel architecture Stochastic-YOLO, and provide an efficient implementation to effectively reduce the burden of the MC-Drop sampling mechanism at inference time. Finally, we provide some sensitivity analyses, while arguing that Stochastic-YOLO is a sound approach that improves different components of uncertainty estimations, in particular spatial uncertainties.

DEff-ARTS: Differentiable Efficient Architecture Search

S. Sadiq, P. Maji, J. Hare, and G. Merrett

(NeurIPS 2020 Workshop on ML for Systems)

Manual design of efficient Deep Neural Networks (DNNs) for mobile and edge devices is an involved process which requires expert human knowledge to improve efficiency in different dimensions. In this paper, we present DEff-ARTS, a differentiable efficient architecture search method for automatically deriving CNN architectures for resource constrained devices. We frame the search as a multi-objective optimisation problem where we minimise the classification loss and the computational complexity of performing inference on the target hardware. Our formulation allows for easy trading-off between the sub-objectives depending on user requirements. Experimental results on CIFAR-10 classification showed that our approach achieved a highly competitive test error rate of 3:24% with 30% fewer parameters and multiply and accumulate (MAC) operations compared to Differentiable ARchiTecture Search (DARTS).

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

P. Maji (Cambridge), A. Mundy (ARM Research), G. Dasika (ARM Research), J. Beu (ARM Research), M. Mattina (ARM Research), and R. Mullins (Cambridge).

(IEEE HPCA.EMC2 2019) Paper: (pdf) | Presentation: (pdf)

The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs). Although there has been a lot of research done on model and algorithmic optimization of CNN, little attention has been paid to the efficient implementation of these algorithms on embedded CPUs, which usually have frugal memory and low power budget. This research work aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today. Specifically, we demonstrate a reduction in inference latency by using a set of optimization strategies that improve the utilization of computational resources, and by effectively leveraging the ARMv8-A NEON SIMD instruction set. We evaluated our proposed region-wise multi-channel implementations on Arm Cortex-A73 platform using several representative CNNs. The results show significant performance improvements in full network, up to 60%, over existing im2row/im2col based optimization techniques.

On the Reduction of Computational Complexity of Deep Convolutional Neural Networks (Published)

P. Maji, R. Mullins. (Journal Entropy 2018 - Special Issue with Selected Papers)

Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom–Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy.

1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation (Published)

P. Maji, R. Mullins. The 26th International Conference on Artificial Neural Networks (ICANN), 2017 (Full, ORAL - Best Paper Candidate)

Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy.

ADaPT: Optimizing CNN inference on IoT and mobile devices using approximately separable 1-D kernels (Published)

P. Maji, D. Bates, A. Chadwick, and R. Mullins. In the ACM International Conference on Internet of Things and Machine Learning (ACM - IML), 2017 (Full, ORAL)

Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract important information to help advance healthcare, make our cities smarter, and innovate in smart home technology. Deep convolutional neural networks, which are at the heart of many emerging Internet-of-Things (IoT) applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity in convolutional layers, limiting their deployability. In this paper, we present an easy-to-implement acceleration scheme, named ADaPT, which can be applied to already available pre-trained networks. Our proposed technique exploits redundancy present in the convolutional layers to reduce computation and storage requirements. Additionally, we also decompose each convolution layer into two consecutive one-dimensional stages to make full use of the approximate model. This technique can easily be applied to existing low power processors, GPUs or new accelerators. We evaluated this technique using four diverse and widely used benchmarks, on hardware ranging from embedded CPUs to server GPUs. Our experiments show an average 3-5x speed-up in all deep models and a maximum 8-9x speed-up on many individual convolutional layers. We demonstrate that unlike iterative pruning based methodology, our approximation technique is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speed-ups solely from convolutional layers with no loss in baseline accuracy.

Apparatus and method for providing a bidirectional communications link between a master device and a slave device

P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8924612B2, Granted on 2014-12-30.

A bidirectional communications link between a master device and a slave device includes first endpoint circuitry coupled to the master device generating forward data packets, second endpoint circuitry coupled to the slave device for receiving reverse data packets, and bidirectional communication circuitry for transferring forward data packets from the first endpoint circuitry to the second endpoint circuitry and reverse data packets from the second endpoint circuitry to the first endpoint circuitry. In response to a power down condition requiring a power down of at least one of the first endpoint circuitry and the second endpoint circuitry, performance of said power down is deferred until both said outstanding forward credit signal and said outstanding reverse credit signal have been deasserted.

ARM Big Data Poster - Partha Maji.pdf

Enabling Deep Learning on Embedded Systems

Deep Convolutional Networks (ConvNets) have demonstrated state-of-the-art performance in many machine learning problems involving image classification and speech recognition. Over the last few years several advances in the design of ConvNets have not only led to a further boost in achieved accuracy on image recognition tasks but also played a crucial role as a feature generator for other machine learning tasks such as object detection, localization, semantic segmentation and image retrieval. However, the complexity and size of ConvNets have limited their use in mobile applications and embedded system. The aim of my research is to find ways to optimize these deep neural networks using model-architecture co-design and enable mass deployment of deep-learning based applications in consumer products.

P. Maji, R. Mullins. In the ARM-Cambridge Research Showcase,

Poster Session, Cambridge Big Data,

Maxwell Centre, University of Cambridge, Dec 2016.

Data packet flow control across an asynchronous clock domain boundary

P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8630358B2, Granted on 2014-01-14.

A system-on-chip integrated circuit includes a packet transmitter for generating data packets to be sent via a communication circuit to a packet receiver containing a buffer circuit. A transmitter counter stores a transmitter count value counting data packets sent. A receiver counter stores a receiver count value tracking data packets emptied from the buffer circuit. A comparison circuitry is used to compare the transmitter count value and the receiver count value to determine whether or not there is storage space available within the buffer circuit to receive transmission of further data packets. The packet transmitter operates in a transmitter clock domain that is asynchronous from a receiver clock domain in which the packet receiver operates. One of the count values is passed across this asynchronous clock boundary in order that the comparison may be performed and flow control exercised.

My Hobby Set Top Box Project

Inspired by industrial scale streaming Set Top Boxes, I built a very elementary networked Set Top Box on my spare time using just discrete chips. It uses 8 symmetric 32-bit processors (cores) with 32KB of RAM running at 80MHz. The complete STB solution works with any NTSC or PAL TV that has composite (RCA) input, with any network router that supports DHCP. It does not have any support for wifi. I also built few basic widgets to go with the hardware. The final PCB is in the middle between a Costa club card and a Raspberry pi.

Design and Implementation of the Quarc Network-on-Chip

P. Maji, M. Moadeli, and W. Vanderbauwhede. In the IEEE 23rd International Parallel & Distributed Processing Symposium (IPDPS), Rome, Italy, May 2009.

Networks-on-Chip (NoC) have emerged as alternative to buses to provide a packet-switched communication medium for modular development of large Systems-on-Chip. However, to successfully replace its predecessor, the NoC has to be able to efficiently exchange all types of traffic including collective communications. The latter is especially important for e.g. cache updates in multicore systems. The Quarc NoC architecture [9] has been introduced as a Networks-on-Chip which is highly efficient in exchanging all types of traffic including broadcast and multicast. In this paper we present the hardware implementation of the switch architecture and the network adapter (transceiver) of the Quarc NoC. Moreover, the paper presents an analysis and comparison of the cost and performance between the Quarc and the Spidergon NoCs implemented in Verilog targeting the Xilinx Virtex FPGA family. We demonstrate a dramatic improvement in performance over the Spidergon especially for broadcast traffic, at no additional hardware cost.

Energy efficient Convolutional Neural Network for Embedded Systems

P. Maji, R Mullins. In the Microsoft PhD Summer School Poster Session, Microsoft Research (MSR), Cambridge, July 2016.

Quarc: a High-Efficiency Network on-Chip Architecture

M. Moadeli, P. Maji, and W. Vanderbauwhede. In the IEEE 23rd International Conference on Advanced Information Networking & Applications (AINA), Bradford, UK, May 2009.

An analytical performance model for the Spidergon NoC with virtual channels

M. Moadeli, A. Shahrabi, W. Vanderbauwhede and P. Maji. In proceedings of the Journal of Systems Architecture – Embedded Systems Design (JSA), 2010.

Design and FPGA Implementation of a packet switch for the Quarc Network-on-Chip

P. Maji, W. Vanderbauwhede, and F. Rodriguez. In iSLI Annual day student poster presentation, Alba Centre, Livingston, Scotland (Awarded Best Poster).

A novel NoC in Multi-processor paradigm

P. Maji. In the IET Scotland Present Around the World competition, 2009, at the Old School, the University of Edinburgh (Awarded Best Presentation).

Design and FPGA Implementation of a packet switch for the Quarc Network-on-Chip (Available on request)

P. Maji, MSc Thesis, September 2008.

An Idea vs Real Product - A Pragmatic Point of View

'''It’s the disease of thinking that a really great idea is 90% of the work. And if you just tell all these other people “here’s this great idea,” then of course they can go off and make it happen. And the problem with that is that there’s just a tremendous amount of craftsmanship in between a great idea and a great product. And as you evolve that great idea, it changes and grows. It never comes out like it starts because you learn a lot more as you get into the subtleties of it. And you also find there are tremendous trade-offs that you have to make. There are just certain things you can’t make electrons do. There are certain things you can’t make plastic do. Or glass do. Or factories do. Or robots do. Designing a product is keeping five thousand things in your brain and fitting them all together in new and different ways to get what you want.''' (Steve Jobs 1995)