Learning Seminar in Fundamentals of Data Analysis

In our learning seminar, we will work together to learn a broad range of fundamental topics in modern data science and its applications. The tentative topics will include probability on graphs, continuous optimization, interactive machine learning, randomized linear algebra, etc.

Organizers: Wenjian Liu, Fei Ye

Time: Tuesday, 11:45am-12:45pm

Location (in-person and hybrid): GC: 4214-03

Zoom: link Zoom Meeting ID: 5977829609

Much of the material covered can be found in the following excellent texts (conveniently available online):

Random Graphs and Complex Networks. Vol. I by van der Hofstad
Probability on Trees and Networks by Lyons and Peres
Markov Chains and Mixing Times by Levin, Peres and Wilmer
Concentration-of-measure Inequalities by Lugosi
High-dimensional probability: An introduction with applications in data science by Vershynin

Current Schedule (Spring 2025)

May 06 (Online Only)

Ning Ning (Texas A&M University)

Title: Convergence of Dirichlet forms for MCMC optimal scaling with dependent target distributions on large graphs

Abstract: Markov chain Monte Carlo (MCMC) algorithms have played a significant role in statistics, physics, machine learning and others, and they are the only known general and efficient approach for some high-dimensional problems. The random walk Metropolis (RWM) algorithm as the most classical MCMC algorithm, has had a great influence on the development and practice of science and engineering. The behavior of the RWM algorithm in high-dimensional problems is typically investigated through a weak convergence result of diffusion processes. In this paper, we utilize the Mosco convergence of Dirichlet forms in analyzing the RWM algorithm on large graphs, whose target distribution is the Gibbs measure that includes any probability measure satisfying a Markov property. The abstract and powerful theory of Dirichlet forms allows us to work directly and naturally on the infinite-dimensional space, and our notion of Mosco convergence allows Dirichlet forms associated with the RWM chains to lie on changing Hilbert spaces. Through the optimal scaling problem, we demonstrate the impressive strengths of the Dirichlet form approach over the standard diffusion approach.

April 29

Wenjian Liu (Queensborough Community College, CUNY)

Title: Summarizing and Exploring Data: Key Concepts in Data Analysis

Abstract: In today's data-driven world, effectively summarizing and visualizing data is essential for informed decision-making. This presentation provides an overview of fundamental techniques in data analysis, including different types of variables, measures of central tendency and dispersion, and graphical methods such as histograms and box plots. Through practical examples, we will explore how to interpret data distributions and identify key patterns, ensuring a deeper understanding of statistical summaries. This session will equip you with essential tools for data exploration and analysis.

April 22

Wenjian Liu (Queensborough Community College, CUNY)

Title: Inference for Two Proportions and Matched Pair Data

Abstract: This presentation focuses on statistical methods for comparing two proportions across independent or matched samples. We explore key techniques including Fisher’s Exact Test for small sample sizes, large-sample Z-tests, and chi-squared tests for two-by-two tables. Special attention is given to paired data analysis using McNemar’s test, highlighting how matched designs impact inference strategies. Through real-world examples such as clinical trials and voter preference studies, participants will learn how to select appropriate tests, compute confidence intervals, and interpret results accurately. This session is ideal for anyone working with categorical data and interested in rigorous hypothesis testing methods.

April 8

Wenjian Liu (Queensborough Community College, CUNY)

Title: Modeling Recurrent Events with Inverse-Gaussian Frailty

Abstract: This presentation explores advanced techniques for modeling recurrent event data using a semiparametric approach based on the inverse-Gaussian frailty model. We introduce the multiplicative intensity framework for recurrent events, where individual heterogeneity is captured through a frailty term, allowing for flexible and realistic modeling of event times. The talk outlines the derivation of the full likelihood, discusses estimation strategies, and examines the asymptotic properties of the estimators. To ground the theory in practice, we present an application to a bladder cancer recurrence dataset, demonstrating how the proposed model can reveal meaningful insights in medical research. This session will be of interest to those working with longitudinal or event-based data, particularly in biostatistics and reliability engineering.

April 1

Wenjian Liu (Queensborough Community College, CUNY)

Title: Statistical Inference in Random Graphs and Community Detection

Abstract: This presentation delves into modern statistical methods for analyzing and making inferences in network data, with a special focus on random graph models such as the Stochastic Block Model (SBM). We will explore how communities within networks can be identified using maximum likelihood estimation, spectral clustering, and other advanced techniques. Special attention is given to the behavior of estimators, identifiability issues, and the theoretical guarantees of consistency and asymptotic properties. The session combines mathematical rigor with practical insights, including simulation studies that illustrate the performance of various methods under different conditions. This talk is ideal for those interested in statistical learning, network science, and high-dimensional inference.

March 25

Yanqiu Guo (Brown University)

Title: Practical Approaches to Statistical Inference

Abstract: This presentation provides a comprehensive look at statistical inference techniques, focusing on estimation methods, hypothesis testing, and confidence intervals. We will discuss how to interpret sample data effectively and make data-driven decisions with statistical rigor. Key topics include the role of variability, the impact of sample size, and applications in different fields such as business, healthcare, and engineering. Whether you are new to statistics or looking to refine your analytical skills, this session will equip you with essential tools for making informed conclusions.

March 18

Wenjian Liu (Queensborough Community College, CUNY)

Title: Statistical Inference and Sample Size Determination

Abstract: This presentation covers key statistical inference techniques, focusing on hypothesis testing, confidence intervals, and sample size determination for population means and variances. We will explore power calculations, Type I and Type II errors, and strategies for ensuring reliable statistical conclusions. Additionally, we will discuss practical applications, including quality control, clinical trials, and business decision-making.

March 11

Wenjian Liu (Queensborough Community College, CUNY)

Title: Statistical Inference for Single Samples

Abstract: This presentation explores statistical inference techniques for analyzing single samples. Topics include confidence intervals, hypothesis testing, and methods for estimating population parameters when variance is known or unknown. We will discuss the role of the Central Limit Theorem in inference, Slutsky’s Theorem, and how different sample sizes impact statistical conclusions. Additionally, we will introduce prediction and tolerance intervals for estimating future observations. Whether you are interested in quality control, experimental analysis, or data-driven decision-making, this session provides essential tools for statistical reasoning.

March 4

Wenjian Liu (Queensborough Community College, CUNY)

Title: Foundations of Statistical Inference and Sampling Distributions

Abstract: How can we make reliable conclusions about an entire population using only a sample? This presentation delves into the principles of statistical inference, covering key topics such as point estimation, confidence intervals, and hypothesis testing. We will explore sampling techniques, the behavior of sample statistics, and methods for evaluating estimators. Through real-world applications and theoretical discussions, this session will provide valuable insights into how data-driven conclusions are made with confidence.

February 25

Wenjian Liu (Queensborough Community College, CUNY)

Title: Understanding Statistical Inference and Data Analysis

Abstract: This presentation explores key concepts in statistical inference, focusing on confidence intervals, hypothesis testing, and prediction intervals. We will discuss essential techniques for drawing conclusions from sample data, including methods for estimating population means and variances. Using practical examples, we will demonstrate how confidence levels, p-values, and power calculations impact decision-making. This session provides a comprehensive guide to statistical inference for single samples.

Fall 2024 Schedule

November 7

Yanqiu Guo (Brown University)

Title: Foundations of Bayesian Inference: From Theory to Real-World Applications

Abstract: Bayesian inference offers a powerful framework for understanding and making predictions in the face of uncertainty by combining prior knowledge with observed data. This talk provides an accessible introduction to Bayesian methods, focusing on fundamental principles such as Bayes' theorem, prior and posterior distributions, and the concept of updating beliefs with evidence. We will explore practical applications across various fields, from data science and machine learning to scientific research, highlighting how Bayesian approaches can enhance decision-making and model interpretation.

November 14

Wenjian Liu (Queensborough Community College, CUNY)

Title: Introduction to Probabilistic Inference in Graphical Models

Abstract: We describe a few discrete probability models to which we will come back repeatedly throughout. Our focus is primarily on graph-based processes, including percolation, random graphs, the Ising model, and random walks on graphs. After a brief review of graph basics and Markov chains theory, we formally introduce Bayesian Inference on graphical models.

November 21 (online)

Wenjian Liu (Queensborough Community College, CUNY)

Title: Martingales and Potential Theory: Applications in Stochastic Processes

Abstract: This talk reviews stopping times and basic martingale results, and we examine their utility in deriving advanced concentration inequalities and their applications in random graphs and machine learning. Additionally, we introduce potential theory and its relation to Markov chains, including the analysis of hitting times and recurrence properties. This session aims to provide a robust framework for leveraging martingales and potentials in research and applications.

December 5

Wenxue Li (Columbia University)

Title: Diffusion Models in Data Reconstruction: From Theory to Applications

Abstract: Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in reconstruction of complex data from noisy or corrupted inputs. This seminar will provide an introduction to diffusion models, highlighting their mathematical foundations and exploring three predominant formulations of the models: Denoising Diffusion Probabilistic Models (DDPMs), Score-Based Generative Models (SGMs), and Stochastic Differential Equations (Score SDEs). Particular attention will be given to their statistical properties and sampling methodologies. While much of the existing research has focused on continuous data, diffusion models have fallen short on discrete data domains. There is a novel loss, score entropy, that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models. Inspired by this idea, our team is actively developing algorithms for permutation-sensitive datasets, such as rankings, to help bridge the gap. The seminar will conclude with a discussion on further directions for advancing diffusion models.

December 12

Wenjian Liu (Queensborough Community College, CUNY)

Title: Moments and Concentration Inequalities: Applications in Data Science

Abstract: This presentation focuses on the role of moments in understanding the tail behavior of random variables and introduces concentration inequalities as powerful tools to quantify deviations from the mean. We begin with foundational inequalities—Markov's, Chebyshev's, and Chernoff-Cramér bounds—and extend to techniques for handling sub-Gaussian and sub-exponential variables. Using methods such as the probabilistic approach, first and second moment principles, and chaining, we demonstrate how concentration results can dismiss rare "bad events" and achieve sharp probabilistic guarantees. Classical examples include random k-SAT thresholds, percolation theory, and Erdős-Rényi graph properties, while modern applications in sparse recovery and empirical risk minimization illustrate their relevance to data science. By bridging inequalities, moments, and tails, this talk highlights the central role of concentration phenomena in probability, combinatorics, and learning theory.

Page updated

Google Sites

Report abuse