Machine Learning Publications (Peer-reviewed)

Neural Training of Additive Models and Tree Ensembles:

Shibal Ibrahim (MIT), Gabriel Isaac Aftriat (MIT), Kayhan Behdin (MIT), and Rahul Mazumder (MIT)

NeurIPS’23 

[Paper][Slides][Video][Code]

Generalized Additive Models (GAMs) are a family of flexible and interpretable models with old roots in statistics. GAMs are often used with pairwise interactions to improve model accuracy while still retaining flexibility and interpretability but lead to computational challenges as we are dealing with order of p^2 terms. It is desirable to restrict the number of components (i.e., encourage sparsity) for easier interpretability, and better computational and statistical properties. Earlier approaches, considering sparse pairwise interactions, have limited scalability, especially when imposing additional structural interpretability constraints. We propose a flexible GRAND-SLAMIN framework that can learn GAMs with interactions under sparsity and additional structural constraints in a differentiable end-to-end fashion. We customize first-order gradient-based optimization to perform sparse backpropagation to exploit sparsity in additive effects for any differentiable loss function in a GPU-compatible manner. Additionally, we establish novel non-asymptotic prediction bounds for our estimators with tree-based shape functions. Numerical experiments on real-world datasets show that our toolkit performs favorably in terms of performance, variable selection and scalability when compared with popular toolkits to fit GAMs with interactions. Our work expands the landscape of interpretable modeling while maintaining prediction accuracy competitive with non-interpretable black-box models. 

Shibal Ibrahim (MIT), Hussein Hazimeh (MIT), and Rahul Mazumder (MIT)

KDD'22 (Research Track) [Best Student Paper Award]

[Paper][Slides][Video][Code][Snippet]

Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We introduce FASTEL: a new toolkit (based on Tensorflow 2) for learning differentiable tree ensembles. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those obtained by popular toolkits.

Shibal Ibrahim (MIT), Kayhan Behdin (MIT), and Rahul Mazumder (MIT)

AISTATS'24 [Oral, Student Paper Highlight Award]

[Paper][Slides][Video][Code]

Joint feature selection and tree ensemble learning is a challenging task. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importances, which are known to be misleading, and can significantly hurt performance. We propose Skinny Trees: a toolkit for feature selection in tree ensembles, such that feature selection and tree ensemble learning occurs simultaneously. It is based on an end-to-end optimization approach that considers feature selection in differentiable trees with Group L0−L2 regularization. We optimize with a first-order proximal method and present convergence guarantees for a non-convex and non-smooth objective. Interestingly, dense-to-sparse regularization scheduling can lead to more expressive and sparser tree ensembles than vanilla proximal method. On 15 synthetic and real-world datasets, Skinny Trees can achieve 1.5x - 620x feature compression rates, leading up to 10x faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25% feature budget, Skinny Trees outperforms LightGBM by 10.2% (up to 37.7%), and Random Forests by 3% (up to 12.5%).

Sparse Mixture of Experts (MoE) and Routing for Scaling Vision and Language Models:

Shibal Ibrahim (MIT), Wenyu Chen (MIT), Hussein Hazimeh (Google), Natalia Ponomareva (Google), Zhe Zhao (Google), and Rahul Mazumder (MIT)

KDD'23 (Research Track)

[Paper][Slides][Video][Code][COMET-BERT Code]

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to 100x reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

Shibal Ibrahim (MIT), Wenyu Chen (MIT), Hussein Hazimeh (Google), Natalia Ponomareva (Google), and Rahul Mazumder (MIT).

Under Review

[Paper][Slides][Video][Code]

The sparse Mixture-of-Experts (Sparse-MoE) is a promising framework for efficiently scaling up model capacity. This framework consists of a set of experts (subnetworks) and one or more routers. The routers activate only a small subset of the experts on a per-example basis, which can save on resources. Among the most widely used sparse routers are Top-k and its variants, which activate k experts for each example during training. While very effective at model scaling, these routers are prone to performance issues because of discontinuous nature of the routing problem. Differentiable routers have been shown to mitigate the performance issues of Top-k, but these are not k-sparse during training, which limits their utility. To address this challenge, we propose MOESART: a novel k-sparse routing approach, which maintains k-sparsity during both training and inference. Unlike existing routers, MOESART aims at learning a good k-sparse approximation of the classical, softmax router. We achieve this through carefully designed sampling and expert weighting strategies. We compare MOESART with state-of-the-art MoE routers, through large-scale experiments on 14 datasets from various domains, including recommender systems, vision, and natural language processing. MOESART achieves up to 16% (relative) reduction in out-of-sample loss on standard image datasets, and up to 15% (relative) improvement in AUC on standard recommender systems, over popular k-sparse routers, e.g., Top-k, V-MoE, Expert Choice Router and X-MoE. Moreover, for distilling natural language processing models, MOESART can improve predictive performance by 0.5% (absolute) on average over the Top-k router across 7 GLUE and 2 SQuAD benchmarks.

Pruning Large Vision and Language Models:

Gabriel Afriat(MIT), Xiang Meng(MIT), Shibal Ibrahim (MIT), Hussein Hazimeh (Google), and Rahul Mazumder (MIT)

Under Review

[Paper][Slides][Video][Code]

Weight pruning is a common technique for compressing large neural networks. Popular pruning approaches involve optimizing a single objective function under sparsity constraints. Various objective functions have been proposed in the pruning literature, including loss functions based on second-order approximation and layer-wise reconstruction errors. In this work, we demonstrate that pruning by optimizing multiple objectives simultaneously in a multi-objective framework (MOONSHOT) can significantly improve model performance, compared to existing single-objective approaches. Specifically, we formulate pruning as a multi-objective optimization problem and develop a corresponding pruning algorithm that can scale to large models. The pruning algorithm requires specifying the sparsity level per layer, and simple approaches with uniform sparsity across layers may not work well. We propose a new approach to optimally allocate the sparsity budget across layers. To allocate sparsity budget across layers, we set up an optimization problem that accounts for layer interactions when allocating sparsity budgets. To address this challenging optimization problem, we propose specialized algorithms that result in improvements over existing approaches in terms of accuracy of the pruned model. We validate our proposals on a range of pre-trained vision transformer models, a type of architecture relatively understudied in the context of pruning. For one-shot pruning of Deit-Base vision transformer, MOONSHOT attains 72.8% accuracy at 80% sparsity. This represents a 24-29% improvement over state-of-the-art Correlation Aware Pruner and Optimal Brain Compression approaches.

Xiang Meng(MIT), Shibal Ibrahim (MIT), Kayhan Behdin (MIT), Hussein Hazimeh (Google), Natalia Ponomareva (Google) and Rahul Mazumder (MIT)

ICML'24

[Paper][Slides][Video][Code]

Structured pruning is a promising approach for reducing the inference costs of large vision and language models. By removing carefully chosen structures, e.g., neurons or attention heads, the improvements from this approach can be realized on standard deep learning hardware. In this work, we focus on structured pruning in the one-shot (post-training) setting, which does not require model retraining after pruning. We propose a novel combinatorial optimization framework for this problem, based on a layer-wise reconstruction objective and a careful reformulation that allows for scalable optimization. Moreover, we design a new local combinatorial optimization algorithm, which exploits low-rank updates for efficient local search. Our framework is time and memory-efficient and considerably improves upon state-of-the-art one-shot methods on vision models (e.g., ResNet50, MobileNet) and language models (e.g., OPT-1.3B – OPT-30B). For language models, e.g., OPT-2.7B, OSSCAR can lead to 125x lower test perplexity on WikiText with 2× inference time speedup in comparison to the state-of-the-art ZipLM approach. Our framework is also 6x – 8x faster. Notably, our work considers models with tens of billions of parameters, which is up to 100x larger than what has been previously considered in the structured pruning literature.

Transfer Learning:

Shibal Ibrahim (MIT), Natalia Ponomareva (Google), and Rahul Mazumder (MIT)

Short Paper in NeurIPS'21 Distribution Shifts Workshop.

Full Paper in ECMLPKDD'22.

[Paper][Slides]

Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. In this paper, we show that the statistical problems with covariance estimation drive the poor performance of H-score — a common baseline for newer metrics — and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure. Our shrinkage-based H-score is 3x − 10x faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We demonstrate previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., NCE, LEEP that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We support our findings with ~164,000 (fine-tuning trials) experiments on both vision models and graph neural networks.

Time Series Forecasting and Network Analysis with Graph Neural Networks (Finance Applications):

Shibal Ibrahim (MIT), Wenyu Chen (MIT), Yada Zhu (IBM), Pin-Yu Chen (IBM), Yang Zhang (IBM), and Rahul Mazumder (MIT)

Short Paper in KDD’21 ML in Finance Workshop.

Full Paper in ICAIF'22.

[Paper][Slides][Code

Financial time series forecasting is challenging due to limited sample size, correlated samples, low signal strengths, among others. Additional information with knowledge graphs (KGs) can allow for improved prediction and decision making. In this work, we explore a framework GregNets for jointly learning forecasting models and correlations structures that exploit graph connectivity from KGs. We propose novel regularizers based on KG relations to guide estimation of correlation structure. We develop a pseudo-likelihood layer that can learn the error residual structure for any multivariate time-series forecasting architecture in deep learning APIs (e.g. Tensorflow). We evaluate our modeling and algorithmic proposals in small sample regimes in real-world financial markets with two types of KGs. Our empirical results demonstrate sparser connectivity structures, runtime improvements and high-quality predictions.

Shibal Ibrahim (MIT), Max Tell (MIT), and Rahul Mazumder (MIT)

Short Paper in KDD’22 Time Series Workshop.

Full Paper in ICAIF'23 [Oral].

[Paper][Slides][Code]

Spatio-temporal modeling is an essential lens to understand many real-world phenomena from traffic to finance. There has been exciting work that explores spatio-temporal modeling with temporal graph convolutional networks. Often these methods assume that the spatial structure is static. We propose a new model Dyn-GWN for spatio-temporal learning from time-varying graphs. Our model relies on a novel module called the Tensor Graph Convolutional Module (TGCM), which captures dynamic trends in graphs effectively in the time-varying graph representations. This module has two components: (i) it applies temporal dilated convolutions both on the time-varying graph adjacency space and the time-varying features. (ii) it aggregates the higher-level latent representations from both time-varying components through a proposed layer TGCL. Experiments demonstrate the efficacy of these model across time-series data from finance and traffic domains. Dyn-GWN  can give up to better out-of-sample performance than prior methods that learn from time-varying graphs, e.g., EvolveGCN and TM-GCN. Interestingly, Dyn-GWN  can be ~300x faster than EvolveGCN, which is the more competitive baseline from state-of-the-art models that cater to time-varying graphs.

Other Topics on Interpretable Statistical Learning:

Shibal Ibrahim (MIT), Rahul Mazumder (MIT), Peter Radchenko (University of Sydney), and Emmanuel Ben-David (US Census Bureau) 

Annals of Applied Statistics (to appear)

[Paper][Slides][Code

In this paper we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition organized around ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with small number of main and pairwise interaction effects using  L0-based penalization. From a methodological viewpoint, we study both computational and statistical aspects of our estimator; and discuss variants that incorporate strong hierarchical interactions. Our algorithms (opensourced on github) extend the computational frontiers of existing algorithms for sparse additive models, to be able to handle datasets relevant for the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions that appear to be better than popular black-box machine learning methods based on gradient boosting and feedforward neural networks -- suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.

Haoyue Wang (MIT), Shibal Ibrahim (MIT), and Rahul Mazumder (MIT)

SIAM Journal on Mathematics of Data Science (SIMODS) (to appear)

[Paper][Code]

We explore computational aspects of maximum likelihood estimation of the mixture proportions of a nonparametric finite mixture model---a convex optimization problem with old roots in statistics and a key member of the modern data analysis toolkit. Motivated by problems in shape-constrained inference, we consider structured variants of this problem with additional convex polyhedral constraints.  We propose a new cubic regularized Newton method for this problem and present novel worst-case and local computational guarantees for our algorithm. We extend earlier work by Nesterov and Polyak to the case of a self-concordant objective with polyhedral constraints, such as the ones considered herein. We propose a Frank-Wolfe method to solve the cubic regularized Newton subproblem;  and derive efficient solutions for the linear optimization oracles that may be of independent interest. In the particular case of Gaussian mixtures without shape constraints, we derive bounds on how well the finite mixture problem approximates the infinite-dimensional Kiefer-Wolfowitz maximum likelihood estimator. Experiments on synthetic and real datasets suggest that our proposed algorithms exhibit improved runtimes and scalability features over existing benchmarks.