Anirbit

I am a Lecturer (Assistant Professor) in Computer Science at The University of Manchester. I am also a member of, The Centre for A.I. Fundamentals. Here is my official profile. 

 anirbit.mukherjee@manchester.ac.uk

I am deeply intrigued about how deep-learning and neural nets seem to land us at exciting new questions about differential equations and functional analysis. I aspire to unravel these emerging questions in mathematics. 


My Google Scholar Profile  

YouTube Playlist of Talks on Our Works

My Ph.D.  Thesis : A Study of the Mathematics of Deep Learning 


If you have a strong background in mathematics or statistics or theoretical physics or E.C.E. then feel free to email about Ph.D. positions with me at our department. Potential candidates for starting in September 2024, could go up for the "Dean's Doctoral Scholarship" or the "President's Doctoral Scholarship". 

Look at the "Apply" tab above for more details! 

The  Journey So Far

Till summer 2021 I was a post-doc at Wharton, Statistics with Weijie Su. I did my Ph.D. in applied mathematics with Amitabh Basu at the Department of Applied Mathematics and Statistics, Johns Hopkins University My doctoral committee had, Mauro Maggioni (Bloomberg Distinguished Professor), Jeremias Sulam, Trac Tran , Laurent Younes and Jason Eisner. (During my Ph.D., the maximum number of courses that I attended of any one professor, it was Prof. Mauro Maggioni!)

Between 2016-2018 I have collaborated on papers with Trac.D. Tran, Raman Arora and Dan Roy  During winter 2020-early 2022, I have executed multiple projects on deep-learning theory with Sayar Karmakar. Most recently I have also collaborated with Theodore Papamarkou.  

During my Ph.D. I have had multiple fruitful collaborations with other grad students, like Akshay Rangamani (now post-doc at MIT), Soham De (now at DeepMind, London), Poorya Mianji, Enayat Ullah (currrent grad student at J.H.U). I have exciting ongoing collaborations with Soham Dan (grad student at UPenn) and Pulkit Gopalani (an undergrad at IIT-Kanpur).

The direction of my mathematical interests got moulded during my undergrad when I read the books by Jurgen Jost, Gregory Naber and S.Kumaresan- these three books almost permanently determined my career directions! This critical reading during my undergrad was largely under the mentorship of Prof. S.Ramanan, and he had a fundamental influence on the nature of my science. Much later I was highly influenced reading the books by Sanjeev Arora and Boaz Barak, Roman Vershynin, Phillip Rigollet, Phillip A. Griffiths, Cheng and Li and Yvonne-Choquet Bruhat

My original training is in Quantum Field Theory (while at the Chennai Mathematical Institute and the Tata Institute of Fundamental Research). And for years before that I trained as an artist - these background tiles are from oil and pastel paintings of mine from almost one and a half decade ago!


Our UKRI AI Center for Doctoral Training (CDT) is Now Live!

- its theme is titled "Decision-Making in Complex Systems"

- and these projects would be running for years to come!

This is a joint venture with the University of Cambridge. 

I am a co-PI on this grant and the Director of Admissions.

 On my CDT profile page, see the 2 Ph.D. projects that I have in this program!




NEWS


Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets appears at TMLR! 

Here we give a first-of-its-kind proof of SGD convergence on finitely large neural nets - for logistic loss in the binary classification setting.  This continues our investigation that neural loss functions can be "Villani functions" and that uncovering this almost magical mathematical property of neural loss functions can help prove convergence to global minima of gradient-based algorithms for it. This is work with Pulkit Gopalani (PhD student at UMichigan) and Samyak Jha (undergrad at IIT-Bombay).


Size Lowerbounds on DeepONets appears at TMLR!

This work is with Amartya Roy @ Bosch, India. As far as we know, this is among the rare few proofs of any kind of architectural constraint for training performance that has ever been derived for any neural architecture. And this is almost entirely data independent -- hence, a "universal lower bound" and in particular, the lower bounds we derive do not depend on the Partial Differential Equation being targetted to be solved by a DeepONet. Also, this analysis leads to certain "scaling law" conjectures for DeepONets as we touch upon in the short experimental section.

 

1)  https://arxiv.org/abs/2310.05169,  

Here we give a first-of-its-kind mathematical framework to understand how well can neural nets solve PDEs that blow up 

in finite time. This is work with Dibyakanti. 

2) https://arxiv.org/abs/2310.04856  

     LIPEx is a wholly new way to do explainable AI in multi-class classification settings.

It "explains the distribution" and beats many kinds of XAI methods in every test we have done across both text and image. 

This is work with Angelo Cangelosi, our co-advisee Hongbo Zhu and Procheta Sen

3) https://arxiv.org/abs/2309.09258

Here we give a first-of-its-kind proof of SGD convergence on finitely large neural nets, 

- for logistic loss in the binary classification setting.  

This continues our investigation that neural loss functions can be of the Villani type and that helps prove convergence of SGD. 

This is work with Pulkit Gopalani (PhD student at UMichigan) and Samyak Jha (undergrad at IIT-Bombay)

The visit is being jointly funding by a Royal Society Grant and ETH, Zurich.

After years of incubation, my ``NeuroTron'' algorithm (from my PhD days in late 2020) finds a home!

See our recent paper at the Neurocomputing journal, https://doi.org/10.1016/j.neucom.2023.02.034

This is a first-of-its-kind provable training guarantee on any multi-gate net trying to do regression under a data-poisoning attack.

I got selected as a member of ELLIS. Now I have a photo on this page! :D 

A lot of thanks to Francis Bach, Arno Solin, Sayan Mukherjee and Samuel Kaski for their help! 

Two of our extended abstracts have been accepted to DeepMath2022.

Our work on learning of a ReLU gate with and without a data-poisoning attack got accepted to the journal Neural Networks.

https://doi.org/10.1016/j.neunet.2022.03.040

This is the first linear-time convergence for any stochastic algorithm on a ReLU gate without distributional assumptions.  

 

      Pre-UManchester

A Path Into Deep-Learning Theory  

In here we assume that the reader has a working familiarity with a sufficient amount of real analysis as in say, this book by John Hunter and Bruno Nachtergaele. and some familiarity with basic learning theory as in say these notes by Lorenzo Rosasco. I keep updating this file listing the best (and freely available) expository references that I have found on topics of my interest.

But apart from the above, here is a summarized list of what I think are the most immediate references that a seriously intentioned person can get started with - the reader is encouraged to try to choose their most comfortable combination of sources for each of these 5+1 groups below. 


  Note : Though these very beautiful lecture notes linked above are almost entirely self-contained, its still likely that it might be hard to follow unless  someone has some familiarity with the kind of contents as in one of these ``Mathematics of Machine Learning" courses as listed below. 


        Various ``Mathematics of Machine Learning" courses : 

        by Afonso Bandeira , by Phillip Rigollet, by Yuxin Chen



Personally I am hugely indebted to the lecture notes of Sham Kakde and Ambuj Tewari for getting me started - I have hardly seen such a beautiful cruise directly into the core concepts - and fast! Roi Livni's notes are a very beautiful path through the subject whose initial parts cover a lot of stuff that is not covered in the other sources mentioned above. Francis Bach's lectures, towards the end, cover some very modern topics which aren't covered in the rest of the references given here. 


3Lecture notes by Geoff Gordon and Ryan Tibshirani 

4. Lecture notes by Robert Freund on constrained non-linear optimization 

5. Lecture notes by Francis Bach  

6. Lecture notes by Yuxin Chen 


        Sebastian Bubeck's above lectures are possibly the best first tour of the subject that I have seen yet!

---------------------------------------------------------------------------------------------------------------------

Niche Techniques 


          a. 

One of the most succinct introductions to PDE are these set of 2 courses at Stanford, Math 220A and Math 220B

and also see 18.152 and 18.303 at MIT.

       (These lectures by Evy Kersale seem to be a more beginner-friendly introduction to P.D.E.s)

              Towards what's often needed in research see these P.D.E notes,

 by Gerald Teschl, by John Hunter, by Gustav Holzegel, by Lenya Ryzhik (general) , Lenya Ryzhik (fluids)

          b.  

For O.D.E see these comprehensive lectures by  Christopher P. Grant

                        (For a more beginner friendly approach see the lectures by Simon J Malham

          c. 

A specialized book on measure concentration by Maxim Raginsky and Igal Sason 

          d

Lecture notes by Philip Clement on gradient flows (based on the book by Luigi Ambrosio, Nicola Gigli, Giuseppe Savaré

          e. 

Lecture notes by Boddhisattva Sen on empirical processes  

         f

Lecture notes by Bruce Hajek on random processes 

         g. 

Geometric Deep-Learning 

          h. 

Lecture notes by John Thickstun on generative models 

         i. 

Lecture notes by Zico Kolter, David Duvenaud, and Matt Johnson on deep implicit layers.


Why Deep-Learning?

"Deep Learning"/"Deep Neural Nets" are a technological marvel and they are being increasingly deployed at the cutting-edge of artificial intelligence tasks. This ongoing revolution can be said to have been ignited by the iconic 2012 paper from the University of Toronto titled ``ImageNet Classification with Deep Convolutional Neural Networks'' by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. This showed that deep nets can be used to classify images into meaningful categories with almost human-like accuracies! As of 2019 this approach continues to produce unprecedented performance for an ever widening variety of novel purposes ranging from playing chess to self-driving cars to  experimental astrophysics and high-energy physics. 

As theoreticians we might as well look beyond immediate practical implications and also focus on certain mind-boggling experiments like this one running live on the browser - which for all we know might also have concrete uses in the long run!  In here every time we refresh the page we are being shown a seemingly human photograph (which sometimes might have minor defects), just that this photograph is completely artificially generated by a neural network! :-o  The picture one sees on every refreshing of the page is essentially a sample from a distribution generated by pushing forward the standard normal distribution via a certain neural function. In a sense this person is purely the net's imagination and he/she does not actually exist! So how did the net learn to ``draw" such realistic human faces? This mechanism is still highly ill-understood and the best efforts yet by the community (see section 3.4 here of Sanjeev Arora's talk at the International Congress of Mathematicians) hint towards needing to go into very deep waters than ever before - possibly into entirely uncharted territories in high-dimensional probability.

These plethora of astonishing new successes of deep neural nets in the last few years have turned out to be extremely challenging to be mathematically rigorously explainable. Study of neural networks is a very rapidly evolving field and there is an urgent need for mathematically coherent and yet accessible introductions to it. Thus motivated I wrote this exposition on neural nets aimed at high-school students and beginning undergrads. My essay was picked up by this international consortium of scientists who  have created this translation of my article into Bangla, my mother-tongue

A Summary Of Our Research In Deep-Learning 

In our works we have taken several steps towards building strong theoretical foundations for deep learning. 

Our proofs so far can be broadly grouped into the following 5 categories. 

1. Understanding Neural Function Spaces

 

2. Solving Differential Equations Via Neural Nets 

For the forseeable future this is going to be a predominant part of my research program. One can see here a preliminary experimental study in this theme that my intern Pulkit, from IIT-Kanpur, published at a NeuRIPS workshop. After this, Pulkit, me and Sayar Karmakar (at UFlorida) have done a lot more theoretical work on this theme and details shall be coming in here over the next few months. Keep watching this space :)  

3. Investigating the Properties of Neural Network Training Dynamics


(the current draft of the above upcoming paper is available on request) 

4. Proving Neural Training Algorithms 



Part 1 : https://arxiv.org/abs/2005.01699 

(with Sayar Karmakar and Ramachandran Muthukumar) 

(A deterministic version of it was presented at DeepMath 2020)   


               Part 2 : https://arxiv.org/abs/2005.04211  (see here for the version that appeared in the Neural Networks journal)

(with Sayar Karmakar)


The question of provable training of neural nets is mathematically extremely challenging and vastly open. During 2020 - 2021 we investigated certain special cases at depth 2 and have given provable guarantees in regimes hitherto unexplored. Or results probe the particularly challenging trifecta of having finitely large nets, while not tying the data to any specific distribution and while having an adversarial attack. 

We give 2 kinds of results, 



We have also demonstrated extensive experiments on autoencoders to identify a scheme of tuning of the ADAM's parameters (beyond the conventional ranges) which help it supersede other alternatives in neural training performance. 

Latest copy of this paper can be seen here  A preliminary version (arXiv:1807.06766) ``Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration" appeared in the ICML 2018 Workshop, Modern Trends in Nonconvex Optimization for Machine Learning.


To the best of our knowledge this is among the very few proofs in literature about unsupervised learning using neural nets. 

5. PAC-Bayesian Risk Function for Neural Networks 

This is the last chapter of my Ph.D. thesis (work done with Pushpendre Rastogi at JHU (now Amazon), Dan Roy and Jun Yang at the Vector Institute of Artificial Intelligence , Department of Statistics at UToronto). A preliminary version of this work appeared at the ICML 2019 Workshop, ``Understanding and Improving Generalization in Deep Learning" 

In here we have derived a new PAC-Bayesian bound for stochastic risk of neural nets i.e the expected population risk (over the data distribution) of a neural net which is in turn being sampled from a distribution over nets. Natural sources of such distributions are the ones induced by the output of any stochastic algorithm optimizing over the neural function space. This bound of ours is capable of leveraging fine-grained geometric data about the training algorithm. We can empirically show that our bounds supersede existing theoretical PAC-Bayesian neural risk bounds in not just tightness of the numerical value of the derived bound but also in giving better/slower rates of dependencies on the architectural parameters like width and depth of the net. 

These risk bound proofs at their core rely on our new theorems constructing large families of noise distributions to which the nets can be provably resilient to. This question of provable resilience of neural nets to noise distributions is intimately connected to the question of compressibility of nets - and this is a theme that we intend to explore further. 

Also importantly this work includes in its Apppendix a re-derivation of the first neural PAC-Bayes bound by Neyshabur-Bhojanapalli-Srebro. We believe that our careful re-derivation not only better elucidates the use of data-dependent priors than in their proof but also fixes many of the missing details there and thus makes it amenable for direct computational comparison against other contemporary bounds.   

Slides From Talks Till 2019  

Previously I have given review talks on deep-learning theory at the Vector Institute of Artificial Intelligence (Toronto, 2018), SIAM Annual Meeting (Portland, 2018) and International Symposium of Mathematical Programming (ISMP) (Bordeux, France, 2018), INFORMS 2018, Massachusetts Institute of Technology (MIT) in 2017 and MOPTA 2017. The contents of these talks have now gotten absorbed into the slides above along with newer content. 




This background tile is from a painting by my sister, Shubhalaxmi, who just finished her mathematics undergrad (BSc. + MSc.) at IISER, Pune