Search this site
Embedded Files
Towards AGI
  • Home
  • Schedule
  • Projects
  • Topics&Papers
    • Adversarial Robustness
    • Alignment and Safety
    • CompPsych-FoMo
    • Compression and Fast Inference
    • Continual Learning at Scale
    • Emergence & Phase Transitions in ML
    • Foundation Models
    • Generalization (iid and ood)
    • High Performance Computing
    • Knowledge Fusion
    • Neural Scaling Laws
    • Out-of-Distribution Generalization
    • Scaling Laws in Nature
    • State Space Models
    • Time Series Foundation Models
  • Reading Group
Towards AGI
  • Home
  • Schedule
  • Projects
  • Topics&Papers
    • Adversarial Robustness
    • Alignment and Safety
    • CompPsych-FoMo
    • Compression and Fast Inference
    • Continual Learning at Scale
    • Emergence & Phase Transitions in ML
    • Foundation Models
    • Generalization (iid and ood)
    • High Performance Computing
    • Knowledge Fusion
    • Neural Scaling Laws
    • Out-of-Distribution Generalization
    • Scaling Laws in Nature
    • State Space Models
    • Time Series Foundation Models
  • Reading Group
  • More
    • Home
    • Schedule
    • Projects
    • Topics&Papers
      • Adversarial Robustness
      • Alignment and Safety
      • CompPsych-FoMo
      • Compression and Fast Inference
      • Continual Learning at Scale
      • Emergence & Phase Transitions in ML
      • Foundation Models
      • Generalization (iid and ood)
      • High Performance Computing
      • Knowledge Fusion
      • Neural Scaling Laws
      • Out-of-Distribution Generalization
      • Scaling Laws in Nature
      • State Space Models
      • Time Series Foundation Models
    • Reading Group

Neural Scaling Laws

Neural Scaling Laws


2024

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit


Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Resolving discrepancies in compute-optimal scaling of language models

Power scheduler: A batch size and token number agnostic learning rate scheduler

A Practitioner's Guide to Continual Multimodal Pretraining

A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

Better schedules for low precision training of deep neural networks

Learning with random learning rates

Learning to learn learning-rate schedules



Investigating Continual Pretraining in Large Language Models 



Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Scaling Laws in Linear Regression: Compute, Parameters, and Data

How Feature Learning Can Improve Neural Scaling Laws

Random matrix methods for high-dimensional machine learning models

How predictable is language model benchmark performance?

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images

2023

Extrapolating performance in language modeling benchmarks

DataComp: In search of the next generation of multimodal datasets

Emergent and predictable memorization in large language models

LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models

Uncovering Neural Scaling Laws in Molecular Representation Learning

Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension

Unmonitorability of Artificial Intelligence


Scaling Data-Constrained Language Models


Are emergent abilities of Large Language Models a mirage?

Neural scaling of deep chemical models

Data efficient neural scaling law via model reusing

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Scaling Data-Constrained Language Models

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Scaling laws for single-agent reinforcement learning 

Broken Neural Scaling Laws


Scaling Laws for Generative Mixed-Modal Language Models

Training Trajectories of Language Models Across Scales




2022

Holistic Evaluation of Language Models (HELM)   - leaderboard

Reproducible scaling laws for contrastive language-image learning (LAION CLIP)

Scaling Laws Beyond Backpropagation

Beyond neural scaling laws: beating power law scaling via data pruning

Training compute-optimal large language models ("Chinchilla")

What Language Model to Train if You Have One Million GPU Hours?

  • Revisiting neural scaling laws in language and vision

  • A Solvable Model of Neural Scaling Laws

  • Transcending scaling laws with 0.1% extra compute

  • Unified Scaling Laws for Routed Language Models - Scaling laws for MOEs

  • Scaling laws and persistence in human brain activity

  • Scaling Scaling Laws with Board Games - Scaling laws for AlphaZero on Hex

  • Scaling Laws for Neural Language Models (Kaplan et al, the original famous scaling laws paper)

  • Scaling Laws for Autoregressive Generative Modeling

  • Scaling Laws for Transfer

  • Explaining Neural Scaling Laws

  • A Neural Scaling Law from the Dimension of the Data Manifold

  • Scaling vision transformers 

  • Deep Learning Scaling is Predictable, Empirically

  • Learning Curve Theory

  • On Power Laws in Deep Ensembles

  • A constructive prediction of the generalization error across scales  

  • Jonathan Rosenfeld's PhD thesis on  Scaling Laws for Deep Learning 

  • Learning Curves: Asymptotic Values and Rate of Convergence


Scaling and Reinforcement Learning

  • Human-Timescale Adaptation in an Open-Ended Task Space

  • Online Decision Transformer 

  • Can Wikipedia Help Offline Reinforcement Learning? 

  • Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

  • Training language models to follow instructions with human feedback

  • Offline Pre-trained Multi-Agent Decision Transformer

  • Fine-Tuning Language Models from Human Preferences  

  • Learning to summarize from human feedback 

  • Recursively Summarizing Books with Human Feedback  



Miscelaneous

Synergy and symmetry in deep learning: Interactions between the data, model, and inference algorithm


Surveys:

EpochAI Scaling Laws Literature review   and  A database of papers on scaling laws

The efficiency spectrum of large language models: An algorithmic survey


History of Scaling Laws:
Learning Curves: Asymptotic Values and Rate of Convergence (Cortes et al, 1994)



BNSL:

Multiply broken power-law densities as survival functions

Google Sites
Report abuse
Page details
Page updated
Google Sites
Report abuse