Accelerating Protein Generation with K-mer Guided Speculative Decoding (10/24/2025)
Presenter: Tom Walton
Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high-throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet, in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24-32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.
SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries (10/17/2025)
Presenter: Darin Tsui
The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley-based interpretability to extract global biological insights requires evaluating thousands of sequences--incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large-scale biological datasets. After a one-time model sketching step, SHAP zero enables near-zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high-order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black-box sequence models in biology.
PDPO: Parametric Density Path Optimization (10/03/2025)
Presenter: Sebastián Gutiérrez Hernández
We introduce Parametric Density Path Optimization (PDPO), a novel method for computing action-minimizing paths between probability densities. The core idea is to represent the target probability path as the pushforward of a reference density through a parametric map, transforming the original infinite-dimensional optimization over densities to a finite-dimensional one over the parameters of the map. We derive a static formulation of the dynamic problem of action minimization and propose cubic spline interpolation of the path in parameter space to solve the static problem. Theoretically, we establish an error bound of the action under proper assumptions on the regularity of the parameter path. Empirically, we find that using 3–5 control points of the spline interpolation suffices to accurately resolve both multimodal and high-dimensional problems. We demonstrate that PDPO can flexibly accommodate a wide range of potential terms, including those modeling obstacles, mean-field interactions, stochastic control, and higher-order dynamics. Our method outperforms existing state-of-the-art approaches in benchmark tasks, demonstrating superior computational efficiency and solution quality.
Multi-modal Foundation Model for Cosmological Simulation Data (09/19/2025)
Presenter: Bin Xia
We present a multi-modal foundation model for astrophysical galaxy data, designed to map between simulation- and observation-based galactic features. Our encoder-only transformer flexibly ingests scalar quantities (e.g., redshifts, galaxy masses) and vectors (e.g., star formation histories, spectra), supporting multi-task training that includes within-modality reconstruction and cross-modality prediction. With a dynamic masking strategy, the model can query arbitrary galaxy properties from partial inputs—including predicting spectra from redshift and mass, or estimating photometric redshifts from broadband magnitudes—while also recovering missing segments within a modality. Trained on 185,000 simulated galaxies from a gigaparsec-scale cosmology simulation, the model yields a 50% improvement in redshift estimation when combining LSST and SPHEREx photometry over LSST photometry alone, and a 63% improvement in stellar mass inference when combining late-time SFH with LSST photometry over early-time SFH with LSST photometry. The model demonstrates strong generalization across multi-modal tasks and lays the groundwork for future integration of higher-dimensional and structured data such as images, merger trees, and 3D fields. This approach provides a unified framework for connecting simulations and observations, advancing the development of generalizable astrophysical foundation models.