BN "nat" Kausik

Interests: Computer Science & Economics

Venture: Arcot, NEA, Fineground, Trubates, Bitglass

Research & Teaching: Carnegie Mellon, HP Labs, Stanford, Illinois

Education: B. Tech, IIT-Madras; MS, Princeton; Ph.D, Cornell.

Github  Repec  SSRN  Linkedin   Papers, Books, Patents   

Recent 

Occam Gradient Descent

Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification. With respect to loss, compute and model size, our experiments show (a) on image classification benchmarks, linear and convolutional neural networks trained with Occam Gradient Descent outperform traditional gradient descent with or without post-train pruning; (b) on a range of tabular data classification tasks, neural networks trained with Occam Gradient Descent outperform traditional gradient descent, as well as Random Forests; (c) on natural language transformers, Occam Gradient Descent outperforms traditional gradient descent.

Scaling Efficient LLMs

Trained LLMs are typically sparse in that most of the parameters are zero, raising questions on efficiency. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, we compare theoretical and empirical estimates for training loss at current scale to obtain upper and lower bounds on the number of unique sequences in a natural training corpus as a function of its size. Our result implies (1) to double the number of skills represented in a training corpus, the corpus must scale roughly between three and five fold (2) for efficient LLMs, the number of parameters N and the size D of a natural training corpus scale as N ∼ D0.58 (3) if the number of parameters of an LLM is smaller than the number of unique sequences in the training corpus, scaling up can uncover emergent skills.

Equity Premium in Efficient Markets

Equity premium, the surplus returns of stocks over bonds, has been an enduring puzzle. While numerous prior works approach the problem assuming the utility of money is invariant across contexts, our approach implies that in efficient markets the utility of money is polymorphic, with risk aversion dependent on the information available in each context, i.e. the discount on each future cash flow depends on all information available on that cash flow. Specifically, we prove that in efficient markets, informed investors maximize return on volatility by being risk-neutral with riskless bonds, and risk-averse with equities, thereby resolving the puzzle. We validate our results on historical data with surprising consistency.Code & Data

Cognitive Aging and Labor Share

Labor share, the fraction of economic output accrued as wages, is inexplicably declining in industrialized countries. Whilst numerous prior works attempt to explain the decline via economic factors, our novel approach links the decline to biological factors. Specifically, we propose a theoretical macroeconomic model where labor share reflects a dynamic equilibrium between the workforce automating existing outputs, and consumers demanding new output variants that require human labor. Industrialization leads to an aging population, and while cognitive performance is stable in the working years it drops sharply thereafter. Consequently, the declining cognitive performance of aging consumers reduces the demand for new output variants, leading to a decline in labor share. Our model expresses labor share as an algebraic function of median age, and is validated with surprising accuracy on historical data across industrialized economies via non-linear stochastic regression.Code & Data

Long Tails & the Impact of GPT on Labor

Recent advances in AI technologies renew urgency to the question whether automation will cause mass unemployment and reduction in standards of living. While prior work analyzes historical economic data for the impact of automation on labor, we seek a test to predict the impact of emerging automation technologies such as Generative Pre-trained Transformers (GPT). Towards that goal, we observe that human needs favor long tail distributions, i.e., a long list of niche items that are substantial in aggregate popularity. In turn, the long tails are reflected in the products and services that fulfill those needs. Technologies that address a small portion of the distribution, typically the head, free up human labor to focus on more complex tasks in the long tail, thereby improving productivity and potentially lifting wages. In contrast, technologies that cover substantial portions of the long tail can squeeze wages or displace humans entirely. With this in mind, we propose a long tail test for automation technologies to predict their impact on labor. We find that popular GPTs perform poorly on such tests in that they are erratic on straightforward long tail tasks, hence absent breakthroughs, will augment human productivity rather than cause mass displacement of human labor. Going forward, we believe that to have a broad impact on displacing or devaluing human labor, AI must at least be capable of long-tail tasks that humans perform with ease.

Accelerating Machine Learning via the Weber-Fechner Law

The Weber-Fechner Law observes that human perception scales as the logarithm of the stimulus. We argue that learning algorithms for human concepts could benefit from the Weber-Fechner Law. Specifically, we impose Weber-Fechner on simple neural networks, with or without convolution, via the logarithmic power series of their sorted output. Our experiments show surprising performance and accuracy on the MNIST data set within a few training iterations and limited computational resources, suggesting that Weber-Fechner can accelerate machine learning of human concepts.