Schedule

 

"When something is important enough, you do it even if the odds are not in your favor."

~ Elon Musk, Engineer

 All times below are in EST.

Day 1

November 27th

2pm EST

in KT 244

Title: Clustering a mixture of Gaussians with unknown covariance.

Speaker: Mateo Díaz

Abstract: Clustering is a fundamental data scientific task with broad application. This talk investigates a simple clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We show its solutions achieve the optimal misclassification rate when the number of samples grows linearly in the dimension, up to a logarithmic factor. However, solving the Max-cut problem appears to be computationally intractable. To overcome this, we develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. Although this sample complexity is worse than that of the Max-cut problem, we conjecture that no polynomial-time method can perform better. Furthermore, we present numerical and theoretical evidence that supports the existence of a statistical-computational gap.

Short Talks and Posters

Day 2

November 28th

2pm EST

in LB 211

Title: Simulation-Calibrated Scientific Machine Learning

Speaker: Yiping Lu

Abstract: Machine learning (ML) has achieved great success in a variety of applications suggesting a new way to build flexible, universal, and efficient approximators for complex high-dimensional data.  These successes have inspired many researchers to apply ML to other scientific applications such as industrial engineering, scientific computing, and operational research, where similar challenges often occur. However, the luminous success of ML is overshadowed by persistent concerns that the mathematical theory of large-scale machine learning, especially deep learning, is still lacking and the trained ML predictor is always biased. In this talk, I’ll introduce a novel framework of (S)imulation-(Ca)librated (S)cientific (M)achine (L)earning (SCaSML), which can leverage the structure of physical models to achieve the following goals: 1) make unbiased predictions even based on biased machine learning predictors; 2) beat the curse of dimensionality with an estimator suffers from it. The SCASML paradigm combines a (possibly) biased machine learning algorithm with a de-biasing step design using rigorous numerical analysis and stochastic simulation. Theoretically, I’ll try to understand whether the SCaSML algorithms are optimal and what factors (e.g., smoothness, dimension, and boundness) determine the improvement of the convergence rate. Empirically, I’ll introduce different estimators that enable unbiased and trustworthy estimation for physical quantities with a biased machine learning estimator. Applications include but are not limited to estimating the moment of a function, simulating high-dimensional stochastic processes, uncertainty quantification using bootstrap methods, and randomized linear algebra.

3pm EST

in KT 128

Data Science for the Biological Sciences - Workshop 1

Speakers: Kathleen Lois Foster & Alessandro Maria Selvitella

Abstract: This Workshop is aimed at students and practitioners in the biological sciences who are interested in developing coding and data science skills to solve concrete problems emerging in their biological field of study. By the end of the workshops, the participants will have gathered technical and theoretical skills in data science. They will have learned how to install R and R-studio, use the statistical software R to perform basic statistical analysis of a biological question, and visualize the biological information hidden in the data under study. Furthermore, they will have gained knowledge about the structure of different types of data, descriptive statistics, including mean, standard deviation, confidence intervals, and probability distributions, how to perform hypothesis tests, including t-test and ANOVA, and the difference between statistical and biological significance.

Slides

R-code

Short Talks and Posters

Day 3

November 29th

2pm EST

in KT 244

Survival Analysis via Ordinary Differential Equations

Speaker: Weijing Tang

Abstract: Survival analysis is an extensively studied branch of statistics with wide applications in various fields. Despite rich literature on survival analysis, the growing scale and complexity of modern data create new challenges that existing statistical models and estimation methods cannot meet. In the first part of this talk, I will introduce a novel and unified ordinary differential equation (ODE) framework for survival analysis. I will show that this ODE framework allows flexible modeling and enables a computationally and statistically efficient procedure for estimation and inference. In particular, the proposed estimation procedure is scalable, easy-to-implement, and applicable to a wide range of survival models. In the second part, I will present how the proposed ODE framework can be used to address the intrinsic optimization challenge in deep learning survival analysis, so as to accommodate data in diverse formats.

Links: [Part 1] [Part 2] [Code]

4pm EST

in KT 247

Title: Blob Method for Optimal Transport

Speaker: Harlin Lee

Abstract: Optimal transport (OT) aims to find the most efficient way of moving mass from one distribution to another with minimum cost. In this joint work with Katy Craig (UCSB) and Karthik Elamvazhuthi (UC Riverside), we apply the regularized particle method (i.e. blob method) to dynamic OT, which results in a regularized and discretized optimization problem. This is shown to converge to the original problem formulation, with the added benefit of being able to flexibly handle state and control constraints. I will demonstrate a few numerical experiments on control theory, as well as potential application to sampling.

Short Talks and Posters

Day 4

November 30th

2pm EST

in KT 118

Title: On finding natural groupings among images and other high-dimensional datasets

Speaker: Mireille Boutin

Abstract: Sample points in a high-dimensional space do not typically form agglomerations or ``clusters" in the traditional sense. Current methods for clustering high-dimensional datasets typically project the data to a lower dimensional space and use a clustering found in a lower-dimensional space to define a natural grouping among the original points. Thus, projection can create clusters to “appear” among the sample points, but these clusters may not exist in the original dataset. Still, a clustering found in projected data does define a natural grouping of he high-dimensional data points. But since there are many ways to project a dataset, they can be many ways to group the high-dimensional data points. These different groupings may not be consistent. We propose a basic model for the overlapping grouping structures of a high-dimensional dataset, provide experimental evidence supporting this model, and show how to use it to generate synthetic images. This is joint work with Sangchun Han, Tarun Yellamraju, Alden Bradford, and Evzenie Coupkova.

3pm EST

in KT 118

Data Science for the Biological Sciences - Workshop 2

Speakers: Kathleen Lois Foster & Alessandro Maria Selvitella

Abstract: This Workshop is aimed at students and practitioners in the biological sciences who are interested in developing coding and data science skills to solve concrete problems emerging in their biological field of study. By the end of the workshops, the participants will have gathered technical and theoretical skills in data science. They will have learned how to install R and R-studio, use the statistical software R to perform basic statistical analysis of a biological question, and visualize the biological information hidden in the data under study. Furthermore, they will have gained knowledge about the structure of different types of data, descriptive statistics, including mean, standard deviation, confidence intervals, and probability distributions, how to perform hypothesis tests, including t-test and ANOVA, and the difference between statistical and biological significance.

Slides

R-code

18:00 - 19:30

Virtual

The ChatGPT Revolution

Look for the documentary in the main hall of Gather "Data" Town!

Short Talks and Posters

Day 5

December 1st

10:45am EST

KT 118

Title: Clustering probability distributions using the Fisher-Rao metric

Speaker:  Alice Le Brigant

Abstract: The Fisher-Rao metric is a Riemannian metric defined on the parameter space of families of probability distributions, such as normal or beta distributions. It provides geometric tools that are useful to perform learning on probability distributions inside a given parametric family, in particular clustering. We will explore the Riemannian geometries of specific families of distributions and see how they can be used to cluster complex data such as point clouds and trajectories.

12:00 - 13:30

Helmke Library - LB 440a

Panel Discussion

TBA

13:30 - 14:00

Helmke Library - LB 440a

Poster and Short Talk Awards