A theory of feature learning in neural networks

We have broken a decades-old bottleneck in machine learning by providing an analytic theory of feature learning in deep linear-width neural networks (width scaling as the input dimension and model size and data count scaling proportionaly). This is a regime where the network is highly expressive, yet it remains constrained parameter-wise with respect to the amount of data it is trained from. The model is thus forced to be "smart" in order to solve the task, i.e., to learn task-relevant features, which is the crux of deep learning.