Anirbit
I am a Lecturer (Assistant Professor) in Computer Science at The University of Manchester. I am also a member of the ELLIS Society and The Centre for A.I. Fundamentals. Here is my official profile.
anirbit.mukherjee@manchester.ac.uk
I am deeply intrigued about how deep-learning and neural nets seem to land us at exciting new questions about differential equations and functional analysis. I aspire to unravel these emerging questions in mathematics.
YouTube Playlist of Talks on Our Works
My Ph.D. Thesis : A Study of the Mathematics of Deep Learning
If you have a strong background in mathematics or statistics or theoretical physics or E.C.E. then feel free to email about Ph.D. positions with me at our department. Potential candidates for starting in September 2025, could go up for the "Dean's Doctoral Scholarship" or the "President's Doctoral Scholarship",
or our "CDT Fellowship".
Look at the "Apply" tab above for more details!
The Journey So Far
Till summer 2021 I was a post-doc at Wharton, Statistics with Weijie Su. I did my Ph.D. in applied mathematics with Amitabh Basu at the Department of Applied Mathematics and Statistics, Johns Hopkins University My doctoral committee had, Mauro Maggioni (Bloomberg Distinguished Professor), Jeremias Sulam, Trac Tran , Laurent Younes and Jason Eisner. (During my Ph.D., the maximum number of courses that I attended of any one professor, it was Prof. Mauro Maggioni!)
Between 2016-2018 I have collaborated on papers with Trac.D. Tran, Raman Arora and Dan Roy During winter 2020-early 2022, I have executed multiple projects on deep-learning theory with Sayar Karmakar. Most recently I have also collaborated with Theodore Papamarkou.
During my Ph.D. I have had multiple fruitful collaborations with other grad students, like Akshay Rangamani (now post-doc at MIT), Soham De (now at DeepMind, London), Poorya Mianji, Enayat Ullah (currrent grad student at J.H.U). I have exciting ongoing collaborations with Soham Dan (grad student at UPenn) and Pulkit Gopalani (an undergrad at IIT-Kanpur).
The direction of my mathematical interests got moulded during my undergrad when I read the books by Jurgen Jost, Gregory Naber and S.Kumaresan- these three books almost permanently determined my career directions! This critical reading during my undergrad was largely under the mentorship of Prof. S.Ramanan, and he had a fundamental influence on the nature of my science. Much later I was highly influenced reading the books by Sanjeev Arora and Boaz Barak, Roman Vershynin, Phillip Rigollet, Phillip A. Griffiths, Cheng and Li and Yvonne-Choquet Bruhat.
My original training is in Quantum Field Theory (while at the Chennai Mathematical Institute and the Tata Institute of Fundamental Research). And for years before that I trained as an artist - these background tiles are from oil and pastel paintings of mine from almost one and a half decade ago!
Our UKRI AI Center for Doctoral Training (CDT) is Now Live!
- its theme is titled "Decision-Making in Complex Systems"
- and these projects would be running for years to come!
This is a joint venture with the University of Cambridge.
I am a co-PI on this grant and the Director of Admissions.
NEWS
16th September 2024 :
Sebastien Andre-Sloan joins our group as the second PhD student.
He did a pretty large third-year undergraduate project with us before taking this leap :)
1st July 2024 :
Our amazing co-author Dibyakanti Kumar joins our group as a PhD student.
He would be co-advised by Prof. Alex Frangi and myself.
Dibyakanti becomes the first student to begin their doctoral studies in our group!
11th June 2024 :
[IOP-MLST!] Investigating the Ability of PINNs to Solve Burgers' PDE Near Finite-Time Blowup
What is the relationship between an ML model's error in approximating the PDE solution and the risk of its PINN loss function? Recall that it's only the latter that the codes try to minimize. This is in general quite unclear - and in this latest work with my amazing collaborator Dibyakanti Kumar, we try to prove relationships between these two critical quantities, for non-viscous pressure-less fluids (Burgers' PDE in d-dimensions), while allowing for divergence of the flow.
This is an interesting edge because it allows for PDE solutions that blow-up in finite-time while starting from smooth initial conditions. Our way of analyzing this population risk leads to indications for why penalizing for the gradients of the surrogate (the net) has previously helped in such experiments.
27th May 2024 :
Got the UniCS undergraduate project advising award!
25th February 2024 :
[TMLR!] Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
Here we give a first-of-its-kind proof of SGD convergence on finitely large neural nets - for logistic loss in the binary classification setting. This continues our investigation that neural loss functions can be "Villani functions" and that uncovering this almost magical mathematical property of neural loss functions can help prove convergence to global minima of gradient-based algorithms for it. This is work with Pulkit Gopalani (PhD student at UMichigan) and Samyak Jha (undergrad at IIT-Bombay).
1st February 2024 :
[TMLR!] Size Lowerbounds on DeepONets
This work is with Amartya Roy @ Bosch, India. As far as we know, this is among the rare few proofs of any kind of architectural constraint for training performance that has ever been derived for any neural architecture. And this is almost entirely data independent -- hence, a "universal lower bound" and in particular, the lower bounds we derive do not depend on the Partial Differential Equation being targetted to be solved by a DeepONet. Also, this analysis leads to certain "scaling law" conjectures for DeepONets as we touch upon in the short experimental section.
2022-2023 News
27th October 2023 :
Speaking at Prof. Luca Magri's group at Imperial on our work, https://arxiv.org/abs/2310.05169
We have put out a bunch of new results over the last quarter of 2023 :)
1) https://arxiv.org/abs/2310.05169,
Here we give a first-of-its-kind mathematical framework to understand how well can neural nets solve PDEs that blow up
in finite time. This is work with Dibyakanti.
2) https://arxiv.org/abs/2310.04856
LIPEx is a wholly new way to do explainable AI in multi-class classification settings.
It "explains the distribution" and beats many kinds of XAI methods in every test we have done across both text and image.
This is work with Angelo Cangelosi, our co-advisee Hongbo Zhu and Procheta Sen
3) https://arxiv.org/abs/2309.09258,
Here we give a first-of-its-kind proof of SGD convergence on finitely large neural nets,
- for logistic loss in the binary classification setting.
This continues our investigation that neural loss functions can be of the Villani type and that helps prove convergence of SGD.
This is work with Pulkit Gopalani (PhD student at UMichigan) and Samyak Jha (undergrad at IIT-Bombay)
Visiting ETH, Zurich, Department of Applied Mathematics till 21st September, 2023.
The visit is being jointly funding by a Royal Society Grant and ETH, Zurich.
Intern Application Deadline of 8th May 2023 for Students in Israel and Germany to Visit and Work With Me.
https://www.helmholtz-hida.de/en/events/trilateral-data-science-exchange-program/
https://www.helmholtz-hida.de/en/new-horizons/israel-exchange-program/israel-exchange-projects/
Got a grant from The Royal Society as a P.I. - towards building international collaborations for our projects.
24th February 2023 :
After years of incubation, my ``NeuroTron'' algorithm (from my PhD days in late 2020) finds a home!
See our recent paper at the Neurocomputing journal, https://doi.org/10.1016/j.neucom.2023.02.034
This is a first-of-its-kind provable training guarantee on any multi-gate net trying to do regression under a data-poisoning attack.
24th November 2022 :
I got selected as a member of ELLIS. Now I have a photo on this page! :D
A lot of thanks to Francis Bach, Arno Solin, Sayan Mukherjee and Samuel Kaski for their help!
23rd September 2022 :
Two of our extended abstracts have been accepted to DeepMath2022.
- The most interesting of the two summarizes this paper,
The other one is based on our paper arXiv: 2205.11359 - on proving generalization bounds for DeepONets.
29th March 2022 :
Our work on learning of a ReLU gate with and without a data-poisoning attack got accepted to the journal Neural Networks.
https://doi.org/10.1016/j.neunet.2022.03.040
This is the first linear-time convergence for any stochastic algorithm on a ReLU gate without distributional assumptions.
2021 News Pre-UManchester
Pre-UManchester
18th June 2021 : talk at the CRUNCH group at Brown University
21st May 2021 : talk at IISc. Bangalore, Department of Mathematics
27th April 2021 : talk at TIFR, Bangalore, Center For Applicable Mathematics
21st April 2021 : talk at IMSc., Chennai
31st March 2021 : talk at ETH, Zurich, Maths
19th March 2021 : speaking at the "LIONS Lunch Seminar" at the ECE, Arizona State University
A Path Into Deep-Learning Theory
In here we assume that the reader has a working familiarity with a sufficient amount of real analysis as in say, this book by John Hunter and Bruno Nachtergaele. and some familiarity with basic learning theory as in say these notes by Lorenzo Rosasco. I keep updating this file listing the best (and freely available) expository references that I have found on topics of my interest.
But apart from the above, here is a summarized list of what I think are the most immediate references that a seriously intentioned person can get started with - the reader is encouraged to try to choose their most comfortable combination of resources for each of these 1+4 groups below.
Note : Though these very beautiful lecture notes linked above are almost entirely self-contained, its still likely that it might be hard to follow unless someone has some familiarity with the kind of contents as in one of these ``Mathematics of Machine Learning" courses as listed below.
High-Dimensional Statistics/Geometrical Functional Analysis (Introduction)
Various ``Mathematics of Machine Learning" courses :
by Afonso Bandeira , by Phillip Rigollet, by Yuxin Chen,
High-Dimensional Statistics/Geometrical Functional Analysis (Details)
Lecture notes by Boddhisattva Sen on Non-Parametric Statistics,
Lecture notes by Larry Wasserman ("Intermediate Statistics")
Learning Theory
Lecture notes by Francis Bach and his book
Personally I am hugely indebted to the lecture notes of Sham Kakde and Ambuj Tewari for getting me started - I have hardly seen such a beautiful cruise directly into the core concepts - and fast! Roi Livni's notes are a very beautiful path through the subject whose initial parts cover a lot of stuff that is not covered in the other sources mentioned above. Francis Bach's lectures, towards the end, cover some very modern topics which aren't covered in the rest of the references given here.
Continuous Optimization Theory
3. Lecture notes by Geoff Gordon and Ryan Tibshirani
4. Lecture notes by Robert Freund on constrained non-linear optimization
5. Lecture notes by Francis Bach
6. Lecture notes by Yuxin Chen
Sebastian Bubeck's above lectures are possibly the best first tour of the subject that I have seen yet!
---------------------------------------------------------------------------------------------------------------------
Niche Techniques
a.
One of the most succinct introductions to PDE are these set of 2 courses at Stanford, Math 220A and Math 220B
and also see 18.152 and 18.303 at MIT.
(These lectures by Evy Kersale seem to be a more beginner-friendly introduction to P.D.E.s)
Towards what's often needed in research see these P.D.E notes,
by Gerald Teschl, by John Hunter, by Gustav Holzegel, by Lenya Ryzhik (general) , Lenya Ryzhik (fluids)
b.
For O.D.E see these comprehensive lectures by Christopher P. Grant
(For a more beginner friendly approach see the lectures by Simon J Malham)
c.
A specialized book on measure concentration by Maxim Raginsky and Igal Sason
d.
Lecture notes by Philip Clement on gradient flows (based on the book by Luigi Ambrosio, Nicola Gigli, Giuseppe Savaré)
e.
Lecture notes by Boddhisattva Sen on empirical processes
f.
Lecture notes by Bruce Hajek on random processes
g.
h.
Lecture notes by John Thickstun on generative models
i.
Lecture notes by Zico Kolter, David Duvenaud, and Matt Johnson on deep implicit layers.
Why Deep-Learning?
"Deep Learning"/"Deep Neural Nets" are a technological marvel and they are being increasingly deployed at the cutting-edge of artificial intelligence tasks. This ongoing revolution can be said to have been ignited by the iconic 2012 paper from the University of Toronto titled ``ImageNet Classification with Deep Convolutional Neural Networks'' by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. This showed that deep nets can be used to classify images into meaningful categories with almost human-like accuracies! As of 2019 this approach continues to produce unprecedented performance for an ever widening variety of novel purposes ranging from playing chess to self-driving cars to experimental astrophysics and high-energy physics.
As theoreticians we might as well look beyond immediate practical implications and also focus on certain mind-boggling experiments like this one running live on the browser - which for all we know might also have concrete uses in the long run! In here every time we refresh the page we are being shown a seemingly human photograph (which sometimes might have minor defects), just that this photograph is completely artificially generated by a neural network! :-o The picture one sees on every refreshing of the page is essentially a sample from a distribution generated by pushing forward the standard normal distribution via a certain neural function. In a sense this person is purely the net's imagination and he/she does not actually exist! So how did the net learn to ``draw" such realistic human faces? This mechanism is still highly ill-understood and the best efforts yet by the community (see section 3.4 here of Sanjeev Arora's talk at the International Congress of Mathematicians) hint towards needing to go into very deep waters than ever before - possibly into entirely uncharted territories in high-dimensional probability.
These plethora of astonishing new successes of deep neural nets in the last few years have turned out to be extremely challenging to be mathematically rigorously explainable. Study of neural networks is a very rapidly evolving field and there is an urgent need for mathematically coherent and yet accessible introductions to it. Thus motivated I wrote this exposition on neural nets aimed at high-school students and beginning undergrads. My essay was picked up by this international consortium of scientists who have created this translation of my article into Bangla, my mother-tongue
A Summary Of Our Research In Deep-Learning
In our works we have taken several steps towards building strong theoretical foundations for deep learning.
Our proofs so far can be broadly grouped into the following 5 categories.
1. Understanding Neural Function Spaces
In our ICLR 2018 paper ``Understanding Deep Neural Networks with Rectified Linear Units" (with Amitabh Basu, Raman Arora and Poorya Mianjy, all that JHU) we observed that the gauge function of a rare kind of polytopes called ``zonotopes" are exactly representable by ReLU nets. This we used to construct neural functions with some of the highest known number of affine pieces for certain architectural regimes. We have also given constructive proofs of existence of continuum of neural functions which are super-exponentially (in the smaller depth) hard in size to represent for quadratically smaller depths. We also show characterizations like that of every continuous piecewise linear function from R^n -> R being a O(log (n)) depth ReLU net representable. We leverage our architecture specific descriptions of the corresponding neural function space to get complexity theoretic insights into exact empirical risk minimization over them.
In a subsequent paper (with Amitabh Basu) we have built upon recent developments in the theory of sign-rank of Boolean functions and about random restriction techniques for LTF gates to show various depth-hierarchy theorems for ReLU nets over Boolean functions.
2. Solving Differential Equations Via Neural Nets
For the forseeable future this is going to be a predominant part of my research program. One can see here a preliminary experimental study in this theme that my intern Pulkit, from IIT-Kanpur, published at a NeuRIPS workshop. After this, Pulkit, me and Sayar Karmakar (at UFlorida) have done a lot more theoretical work on this theme and details shall be coming in here over the next few months. Keep watching this space :)
3. Investigating the Properties of Neural Network Training Dynamics
In this paper with Soham Dan (UPenn) and Phanideep Gampa we have tried to redefine and make mathematically precise an idea originally from Weijie Su and Hangfeng He about they called local elasticity of neural nets. One can read a summary of this work at these slides that I presented at a couple of recent talks like at Google Mountain View and CMStatistics 2020.
Most recently with Sayar Karmakar we have done detailed experimental investigations into the emergence of heavy-tailed behaviour in the stochastic learning algorithms for even the simplest of neural nets. This has led to a number of surprising conjectures and we shall soon be making our experiments public.
(the current draft of the above upcoming paper is available on request)
4. Proving Neural Training Algorithms
In this work (link), a part of which was presented at the DeepMath 2020 (in collaboration with Jiayao Zhang) we showed how one could choose distributions for the stochastic gradient so as to get RMSProp to converge at sub-linear rates for smooth non-convex functions and while not using any time-varying hyper-parameters and not modifying the standard pseudocode that is implemented in softwares like TensorFlow.
We explore iterative non-gradient algorithms for neural training in this pair of recently released papers,
Part 1 : https://arxiv.org/abs/2005.01699
(with Sayar Karmakar and Ramachandran Muthukumar)
(A deterministic version of it was presented at DeepMath 2020)
Part 2 : https://arxiv.org/abs/2005.04211 (see here for the version that appeared in the Neural Networks journal)
(with Sayar Karmakar)
The question of provable training of neural nets is mathematically extremely challenging and vastly open. During 2020 - 2021 we investigated certain special cases at depth 2 and have given provable guarantees in regimes hitherto unexplored. Or results probe the particularly challenging trifecta of having finitely large nets, while not tying the data to any specific distribution and while having an adversarial attack.
We give 2 kinds of results,
We give a simple stochastic algorithm that can train a ReLU gate in the realizable setting in linear time with significantly milder conditions on the data distribution than in previous results. Leveraging some additional distributional assumptions we also show near-optimal guarantees of training a ReLU gate when an adversary is allowed to corrupt the true labels and we show that our guarantees degrade gracefully with the magnitude and the probability of the attack. Further all these guarantees that we give above hold while simultaneously keeping track of the effect of mini-batching on the algorithm - and we give experimental evidence as to how it seems to be tracking the S.G.D. dynamics which isnt yet provable!
We exhibit a non-gradient stochastic iterative algorithm "Neuro-Tron" which shows how far the above ideas can be pushed to multi-gate scenarios. To the best of my knowledge this is probably the only example where for some kind of an adversarial setting - here for a label poisoning attack - a proof has been written for a finitely large neural net.
In collaboration with Soham De (Deepmind, UK) and Enayat Ullah (JHU) have shown the first proofs of convergence of adaptive gradient algorithms like RMSProp and ADAM under any condition. Among other things this gives (a) the first moment based control on the stochastic gradient oracle which can get RMSProp to converge on smooth non-convex objectives at the same speed as SGD on convex objectives and (b) a proof of convergence for deterministic ADAM which naturally reveals a choice of step-size that has strong resemblance to the experimental heuristic called the ``bias correction term"! Thus we are led closer to explaining the mysteries of the most fundamental deep-learning algorithms.
We have also demonstrated extensive experiments on autoencoders to identify a scheme of tuning of the ADAM's parameters (beyond the conventional ranges) which help it supersede other alternatives in neural training performance.
Latest copy of this paper can be seen here A preliminary version (arXiv:1807.06766) ``Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration" appeared in the ICML 2018 Workshop, Modern Trends in Nonconvex Optimization for Machine Learning.
In our ISIT 2018 paper ``Sparse Coding and Autoencoders", arXiv:1708.03735, (with Akshay Rangamani, Amitabh Basu, Tejaswini Ganapathy (Salesforce, San Francisco), Ashish Arora, Trac D.Tran, Sang (Peter) Chin (BU)) we have shown that the landscape of autoencoders under the standard sparse-coding generative model, is asymptotically (in the sparse-code dimension) critical in a neighbourhood of the true dictionary.
To the best of our knowledge this is among the very few proofs in literature about unsupervised learning using neural nets.
5. PAC-Bayesian Risk Function for Neural Networks
This is the last chapter of my Ph.D. thesis (work done with Pushpendre Rastogi at JHU (now Amazon), Dan Roy and Jun Yang at the Vector Institute of Artificial Intelligence , Department of Statistics at UToronto). A preliminary version of this work appeared at the ICML 2019 Workshop, ``Understanding and Improving Generalization in Deep Learning"
In here we have derived a new PAC-Bayesian bound for stochastic risk of neural nets i.e the expected population risk (over the data distribution) of a neural net which is in turn being sampled from a distribution over nets. Natural sources of such distributions are the ones induced by the output of any stochastic algorithm optimizing over the neural function space. This bound of ours is capable of leveraging fine-grained geometric data about the training algorithm. We can empirically show that our bounds supersede existing theoretical PAC-Bayesian neural risk bounds in not just tightness of the numerical value of the derived bound but also in giving better/slower rates of dependencies on the architectural parameters like width and depth of the net.
These risk bound proofs at their core rely on our new theorems constructing large families of noise distributions to which the nets can be provably resilient to. This question of provable resilience of neural nets to noise distributions is intimately connected to the question of compressibility of nets - and this is a theme that we intend to explore further.
Also importantly this work includes in its Apppendix a re-derivation of the first neural PAC-Bayes bound by Neyshabur-Bhojanapalli-Srebro. We believe that our careful re-derivation not only better elucidates the use of data-dependent priors than in their proof but also fixes many of the missing details there and thus makes it amenable for direct computational comparison against other contemporary bounds.
Slides From Talks Till 2019
Previously I have given review talks on deep-learning theory at the Vector Institute of Artificial Intelligence (Toronto, 2018), SIAM Annual Meeting (Portland, 2018) and International Symposium of Mathematical Programming (ISMP) (Bordeux, France, 2018), INFORMS 2018, Massachusetts Institute of Technology (MIT) in 2017 and MOPTA 2017. The contents of these talks have now gotten absorbed into the slides above along with newer content.
This background tile is from a painting by my sister, Shubhalaxmi, who just finished her mathematics undergrad (BSc. + MSc.) at IISER, Pune