Schedule

"When something is important enough, you do it even if the odds are not in your favor."

~ Elon Musk, Engineer

All times below are in EST.

Day 1

November 28th

18:00 - 19:00
KT 216

Video

Title: From classical statistics to modern deep learning

Speaker: Mikhail Belkin

Abstract: Recent empirical successes of deep learning have exposed significant gaps in our fundamental understanding of learning and optimization mechanisms. Modern best practices for model selection are in direct contradiction to the methodologies suggested by classical analyses. Similarly, the efficiency of SGD-based local methods used in training modern models, appeared at odds with the standard intuitions on optimization.

First, I will present evidence, empirical and mathematical, that necessitates revisiting classical statistical notions, such as over-fitting. I will continue to discuss the emerging understanding of generalization, and, in particular, the "double descent" risk curve, which extends the classical U-shaped generalization curve beyond the point of interpolation.

Second, I will discuss why the landscapes of over-parameterized neural networks are generically never convex, even locally. Instead they satisfy the Polyak-Lojasiewicz (PL) condition across most of the parameter space instead, presents an powerful framework for optimization in general over-parameterized models and allows SGD-type methods to converge to a global minimum.

While our understanding has significantly grown in the last few years, a key piece of the puzzle remains -- how does optimization align with statistics to form the complete mathematical picture of modern ML?


Short Talks and Posters

Day 2

November 29th

12:00 - 13:15

KT G29

Data Science for the Biological Sciences - Workshop 1

Speakers: Kathleen Lois Foster & Alessandro Maria Selvitella

Abstract: This Workshop is aimed at students and practitioners in the biological sciences who are interested in developing coding and data science skills to solve concrete problems emerging in their biological field of study. By the end of the workshops, the participants will have gathered technical and theoretical skills in data science. They will have learned how to install R and R-studio, use the statistical software R to perform basic statistical analysis of a biological question, and visualize the biological information hidden in the data under study. Furthermore, they will have gained knowledge about the structure of different types of data, descriptive statistics, including mean, standard deviation, confidence intervals, and probability distributions, how to perform hypothesis tests, including t-test and ANOVA, and the difference between statistical and biological significance.

Slides

R-code

14:00 - 15:00

KT 239

Video

Title: Early detection of fake news on social media

Speaker: Yang Liu

Abstract: The fast and wide spread of fake news on social media can cause severe social harms. Early detection of fake news is crucial to mitigate its social harm but remains challenging. Manual fact-check requires a decent amount of time to conduct and provides delayed results thus is not appropriate for this task. Machine learning-based detection approaches need to rely on data that are available and reliable at the early stage of news propagation to perform early detection. This presentation introduces several state-of-the-arts fake news early detection approaches that are based on deep learning and neural networks, including the one proposed by the presenter’s team which is based on news propagation path classification.

Short Talks and Posters

Day 3

November 30th

13:00 - 14:00

KT 239

Video

Title: Reliable AI: Successes, Challenges, and Limitations

Speaker: Gitta Kutyniok

Abstract:

Artificial intelligence is currently leading to one breakthrough after the other, both in public life with, for instance, autonomous driving and speech recognition, and in the sciences in areas such as medical diagnostics or molecular dynamics. However, one current major drawback is the lack of reliability of such methodologies.

In this lecture we will first provide an introduction into this vibrant research area, focussing specifically on deep neural networks. We will then survey recent advances, in particular, concerning generalization guarantees and explainability. Finally, we will discuss fundamental limitations of deep neural networks and related approaches in terms of computability, which seriously affects their reliability.


18:00 - 19:00
KT 216

Title: To split or not to split that is the question: From cross validation to debiased machine learning. - Rescheduled to Friday Dec. 2nd 10:30am KT 241

Speaker: Morgane Austern

Abstract: Data splitting is an ubiquitous method in statistics with examples ranging from cross validation to cross-fitting. However, despite its prevalence, theoretical guidance regarding its use is still lacking. In this talk we will explore two examples and establish an asymptotic theory for it.

In the first part of this talk, we study the cross-validation method, a ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds.

In the second part of this talk, we study the role of cross fitting in the generalized method of moments with moments that also depend on some auxiliary functions. Recent lines of work show how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and root-n consistency of the target parameter of interest. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties, then sample splitting is not required. This allows for sample re-use, which can be beneficial in moderately sized sample regimes.


Short Talks and Posters

Day 4

December 1st

12:00 - 13:15

KT G29

Data Science for the Biological Sciences - Workshop 2

Speakers: Kathleen Lois Foster & Alessandro Maria Selvitella

Abstract: This Workshop is aimed at students and practitioners in the biological sciences who are interested in developing coding and data science skills to solve concrete problems emerging in their biological field of study. By the end of the workshops, the participants will have gathered technical and theoretical skills in data science. They will have learned how to install R and R-studio, use the statistical software R to perform basic statistical analysis of a biological question, and visualize the biological information hidden in the data under study. Furthermore, they will have gained knowledge about the structure of different types of data, descriptive statistics, including mean, standard deviation, confidence intervals, and probability distributions, how to perform hypothesis tests, including t-test and ANOVA, and the difference between statistical and biological significance.

Slides

R-code

14:00 - 15:00

KT 239

Data Hub Club

15:00 - 16:00

KT 150

Video

Title: Doing Some Good with Machine Learning

Speaker: Lester Mackey

Abstract:

This is the story of my assorted attempts to do some good with machine learning. Through its telling, I’ll highlight several models of organizing social good efforts, describe half a dozen social good problems that would benefit from our community's attention, and present both resources and challenges for those looking to do some good with ML.

17:00 - 18:00

KT 241

Video

Title: Causal inference in medical records: applications to drug repurposing for dementia

Speaker: Marie-Laure Charpignon

Abstract: Metformin, a diabetes drug with anti-aging cellular responses, has complex actions that may alter dementia onset. Mixed results are emerging from prior observational studies. To address this complexity, we deploy a causal inference approach accounting for the competing risk of death in emulated clinical trials using two distinct electronic health record systems. In intention-to-treat analyses, metformin use associates with lower hazard of all-cause mortality and lower cause-specific hazard of dementia onset, after accounting for prolonged survival, relative to sulfonylureas. In parallel systems pharmacology studies, the expression of two proteins related to Alzheimer’s disease, APOE and SPP1, was suppressed by pharmacologic concentrations of metformin in differentiated human neural cells, relative to a sulfonylurea. Together, our findings suggest that metformin might reduce the risk of dementia in diabetes patients through mechanisms beyond glycemic control, and that SPP1 is a potential biomarker for metformin’s action in the brain. In this talk, I will share ongoing work involving the use of physician prescription preferences as candidate instrumental variables to draw inference about the effects of antidiabetic treatment initiation.

18:00 - 19:30

Virtual

Future of Humanity: AI & Robotics

Look for the documentary in the main hall of Gather "Data" Town!

Short Talks and Posters

Day 5

December 2nd

10:30 - 11:30
KT 2
41

Title: To split or not to split that is the question: From cross validation to debiased machine learning.

Speaker: Morgane Austern

Abstract: Data splitting is an ubiquitous method in statistics with examples ranging from cross validation to cross-fitting. However, despite its prevalence, theoretical guidance regarding its use is still lacking. In this talk we will explore two examples and establish an asymptotic theory for it.

In the first part of this talk, we study the cross-validation method, a ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds.

In the second part of this talk, we study the role of cross fitting in the generalized method of moments with moments that also depend on some auxiliary functions. Recent lines of work show how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and root-n consistency of the target parameter of interest. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties, then sample splitting is not required. This allows for sample re-use, which can be beneficial in moderately sized sample regimes.


12:00 - 13:30

Helmke Library - LB 440a

Panel Discussion

Connie Kracher

Shannon Bischoff

Alexander Schultz

Jeffrey Sawalha

Francesca Mancini

Brent Hepner

Lindsey Payne

Rachel Ruble

13:30 - 14:00

Helmke Library - LB 440a

Poster and Short Talk Awards