For an interactive page for browsing papers, please go to https://moss-workshop.vercel.app/
Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning
An Empirical Investigation of Initialization Strategies for Kolmogorov–Arnold Networks
Is Visual Prompting the Right Setup for Knowledge Transfer in new Foundation Models?
Emergence of Hebbian Dynamics in Regularized Non-Local Learners
Quantitative Bounds for Length Generalization in Transformers
Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks
LiteByte: Efficient and Fast-Adapting MLPs for Online Byte-Level Prediction
Koopman Autoencoders Learn Neural Representation Dynamics
Effective Reinforcement Learning for Reasoning in Language Models
Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations
Efficient B-Tree Insertions Using Proximal Policy Optimization and Hierarchical Attention Models
Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness
Cross-Validation Error Dynamics in Smaller Datasets
Transformers May Learn to Classify In-Context by Context-Adaptive Kernel Gradient Descent
Exploring Diverse Solutions for Underdetermined Problems
Measuring Memorization and Generalization in Forecasting Models via Structured Perturbations of Chaotic Systems
Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
TinyServe: Query-Aware Cache Selection for Efficient LLM Inference
Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
Neural Stochastic Differential Equations on Compact State-Spaces
Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning
Understanding How Chess-Playing Language Models Compute Linear Board Representations
AdaptMI: Adaptive Skill-based In-context Math Instructions for Small Language Models
CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Extrapolation by Association: Length Generalization Transfer in Transformers
Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models
Emergence, pretraining loss and associative recall: a toy model
Permutations as a testbed for studying the effect of input representations on learning
Foundation Models on a Budget: Approximating Blocks in Large Vision Models
On the Emergence of Position Bias in Transformers
Optimizing Explanations: Nuances Matter When Evaluation Metrics Become Loss Functions
The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features
In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly
Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models
Pruning Increases Orderedness in Weight-Tied Recurrent Computation
Decomposed Learning: An Avenue for Mitigating Grokking
How Much Context Does Natural Language Actually Require? An Analysis Using LLMs as Statistical Oracles
Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
Geometry of Rank Constraints in Shallow Polynomial Neural Networks
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
Towards Understanding Self-Pretraining for Sequence Classification
Personalizing AI Interventions in Multiple Health Behavioral Change Settings
Improving Pathfinding with Anchoring Tokens
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Approximate Message Passing on General Factor Graphs using Shallow Neural Networks
Gradient descent in presence of extreme flatness and steepness
From SGD to Spectra: A Theory of Neural Network Weight Dynamics
Restoring Task-Relevant Information in Synthetic Data: A Small-Scale V-Information View
SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference
Learning Gaussian Mixture Models via Transformer Measure Flows
Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs
Review, Remask, Refine: Process-Guided Block Diffusion for Text Generation
Understanding Attention Glitches with Threshold Relative Attention
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training