# Group meeting

For 2020 spring, the group meeting is organized by Jason Pacheco (pachecoj@cs.arizona.edu ) and Ning Hao (nhao@math.arizona.edu). The main theme is sparse structure learning. We meet on **Friday 1-2pm** (up to 2:30pm when necessary) in **ENR2 S375** (for 2020 spring).

# SPRING 2020 Schedule.

**Mar. 13. **No meeting (spring break).

**Mar. 6. **Speaker: Jason Pacheco

Title: Bayesian Approaches for Learning Sparse Gaussian Graphical Models

Abstract: I will present the problem of learning sparse Gaussian graphical model structure in a Bayesian context. I will review the connection between sparsity of the precision matrix and edge structure of the corresponding Markov random field. I will then introduce the Wishart conjugate prior distribution for the Gaussian precision matrix, followed by the G-Wishart which is conjugate for a known sparsity pattern. We will then discuss algorithms for doing computation in these sparse models, including inference and model selection.

**Feb. 28. **Speaker: Mike Hughes (Tufts University)

Location: **ENR2 S395 (Unusual location!)**

Title: Overcoming model misspecification in structured clustering and reinforcement learning with prediction constrained training

Abstract: We present a new optimization objective, prediction constrained training, for bringing supervised and semi supervised learning to structured clustering models such as mixture models, topic models, and hidden Markov models for sequential data. We show how PC training helps overcome issues with maximum likelihood that arise when models are inevitably misspecified, and also delivers benefits over several other modern approaches. We will highlight recent work applying these methods to clinical intervention forecasting in the intensive care unit and also to batch reinforcement learning for blood pressure management.

**Feb. 21. **No meeting (Rodeo).

**Feb. 14. **Speaker: Ning Hao

Title: Covariance and precision matrix estimation

Abstract: We will review recent methods on high dimensional covariance and precision matrix estimation.

**Feb. 7. **Speaker: Xueying Tang

Title: An Overview of Bayesian Methods for Learning Sparse Structures

Abstract: In this talk, we will review two classes of priors for modeling sparsity in a Bayesian framework.

**Jan. 31. **Speaker: Ning Hao

Title: An overview of sparse regression (Frequentist approach) Part II

Abstract: We will continue our review on sparse regression, and if time permitted, we will introduce a refitted cross-validation method on variance estimation for high dimensional linear regression.

**Jan. 24. **Speaker: Ning Hao

Title: An overview of sparse regression (Frequentist approach) Part I

Abstract: We will review various methods to solve a high dimensional sparse mode and their statistical properties.

**Notice**: We plan to meet on every **Friday 1-2pm** (up to 2:30pm when necessary) from **Jan 24**. The main theme will on **sparse structure learning**. A list of potential topics and tentative schedule for 2020 spring can be found here. https://docs.google.com/document/d/1UCFhrIS3Y9VI6hXqipACJZHHHQy3Dg1ECVYUg5hWd78/edit?usp=sharing

Email nhao@math.arizona.edu if you want to edit it.

# Fall 2019 Schedule.

**Sep 6. **Speaker: Kwang-Sung Jun

Title: Accelerating discovery rate in adaptive experiments via bandits with low-rank structure.

Abstract: In many applications, a unit of an experiment is to test a pair of objects from two different entity types. For example, imagine the drug discovery application where one has to repeatedly test drug-protein pairs via expensive experiments in order to find those pairs with the desired interaction. How one can maximize the discovery rate with the least experiment cost? In this talk, we show how one can leverage the model structure (low-rank) to accelerate discovery rate for these problems involving two entity types. At the heart of the solution is a new multi-armed bandit formulation called bilinear low-rank bandits, which leverage feature information of the two entities (e.g., drug and protein) in order to maximize the rewards (e.g., discovery rate). We first show that the existing linear bandit algorithms can solve the problem via some reduction, but its convergence rate to the optimal strategy is very slow. To improve the convergence, we exploit the low-rank structure of the model -- commonly observed in real-world data -- with a two-stage algorithm that first identifies important subspaces then invokes linear bandits with a novel use of a data-dependent regularizer to bias the algorithm to those identified subspaces. The convergence of the proposed algorithm is now sensitive to the rank of the unknown, never worse than linear bandits, and significantly better in some cases. Our result sheds light on solving bandit problems with complex but structured models.

**Sep 13. **Speaker: Jason Pacheco

Title: Probabilistic Reasoning in Complex Systems: Algorithms and Applications

Abstract: Statistical machine learning in scientific applications is complicated by high-dimensional, continuous, and nonlinear interactions that arise in such domains. Consequently, reasoning tasks pose computational challenges that prohibit accurate statistical models in favor of low-fidelity approximations that fail to capture complex global phenomena. Motivated by real-world applications throughout, this talk will explore general-purpose methods for probabilistic inference and decision making in such settings. I will introduce extensions of belief propagation (BP) inference, which utilize diverse hypothesis selection to identify multiple distinct inferences from observed data in stochastic processes of high-dimensional continuous data. I will show how these methods can be flexibly adapted to problems of articulated human pose estimation in images and video, as well as protein structure prediction from low-resolution experimental data.

I will conclude this talk with a vignette of my more recent work on robust and efficient methods for sequential decision making in information gathering systems. Using gene regulatory network inference as an example, I will demonstrate how these algorithms meet or exceed the performance of domain-specific approaches to Bayesian experimental design.

**Sep 20. **Speaker: Xueying Tang

Title: An Introduction to Process Data Analysis

Abstract: Computer simulations have become a popular tool for assessing complex skills such as problem-solving skills. In such computer-based items, the entire human-computer interactive process for each respondent is recorded in log files and is known as the process data. Researchers in education and psychology are paying increasing attention to process data as these data contain substantially richer information about the respondents than traditional item responses. The additional information can help improve the accuracy of educational assessments and cognitive diagnoses. In this talk, I will introduce process data, some recent development in process data analysis, and some future directions.

References: https://arxiv.org/abs/1904.09699, https://arxiv.org/abs/1908.06075

**Sep 27. **Speaker: Chicheng Zhang

Title: Efficient active learning of sparse halfspaces

Abstract: We study the problem of PAC active learning of homogeneous linear classifiers (halfspaces) in R^d, where the goal is to learn a halfspace with low error using as few label queries as possible. Under the extra assumption that there is a s-sparse halfspace that performs well on the data (s << d), we would like our active learning algorithm to be attribute efficient, i.e. to have label requirements sublinear in d. In this talk, we present a computationally efficient algorithm that achieves this goal. Under certain distributional assumptions on the data, our algorithm achieves a label requirement of O(s polylog(d,1/epsilon)). In contrast, existing algorithms in this setting are either computationally inefficient, or subject to label requirements polynomial in d or 1/epsilon.

References: https://arxiv.org/abs/1805.02350

**Oct 4. **Speaker: Vahan Huroyan

Title: Theoretically Guaranteed Projected Power Method for the Multi-way Matching Problem

Abstract: We propose an iterative algorithm together with its theoretical analysis for the multi-way matching problem. The input of the multi-way matching problem includes multiple sets, with the same number of objects and noisy measurements of fixed one-to-one correspondence maps between the objects of each pair of sets. Given only noisy measurements of the mutual correspondences, the multi-way matching problem asks to recover the correspondence maps between pairs of them. The desired output includes the original fixed correspondence maps between all pairs of sets. Our proposed algorithm iteratively solves a non-convex optimization formulation for the multi-way matching problem. We prove that for specific noise model, if the initial point of our proposed iterative algorithm is good enough, the algorithm linearly converges to the unique solution. Furthermore, we show how to find such an initial point. Numerical experiments demonstrate that our method is much faster and more accurate than the state-of-the-art methods.

**Oct 11. **Speaker: Ruiyang Wu

Title: Quadratic Discriminant Analysis by Projection

Abstract: Discriminant analysis includes a class of simple but powerful classification algorithms. Linear Discriminant Analysis (LDA) is the most commonly used, but it fails when the equal covariance structure assumption is violated. Quadratic Discriminant Analysis allows variance heterogeneity, but it is often not as robust. In this talk, I will present a new method incorporating the basic idea of QDA and linear projection, then compare its performance to other popular classification algorithms.

**Oct 18. **Speaker: Yiwen Liu

Title: Trajectory inference using single-cell transcriptomic data

Abstract: Recent advancements in single-cell transcriptomics have brought new opportunities and challenges for studying dynamic processes of cell developments. For many biological systems such as developmental disorders and pathologies, there are no clear distinctions between cellular states. Cells in those systems change status by gradual transcriptional changes. Trajectory inference methods aim to computationally infer the order of those cells along the underlying developmental path in an unsupervised manner. In this talk, I will introduce the single-cell trajectory inference problem, discuss some recent studies as well as several future developments in trajectory inference.

**Oct 25. **Speaker: Chi-kwan Chan

Title: Imaging the Supermassive Black Hole at the Center of the M87 Galaxy: A Data Analysis Perspective

Abstract: The Event Horizon Telescope experiment recently revealed the first image of a black hole---the supermassive black hole at the center of the M87 Galaxy. I will provide theoretical background of black holes, give an overview of the observation and data processing procedure, and explain how this image is used to measure the mass of the black hole and to test Einstein's general theory of relativity. I will put special emphasis on the statistical tools used in calibration and error analysis, which helped to make this image possible.

**Nov 1. **Speaker: Christina Duron

Title: Network Data Analysis Techniques on Sets of DESeq and RNAseq Expression Data

Abstract: Conventional differential expression analyses have been successfully employed to identify genes whose levels change across experimental conditions. Yet one limitation of this approach is the inability to discover central regulators that control gene expression networks. After providing an overview on previous analysis techniques taken on DESeq and RNAseq expression data, I will discuss a methodology that leverages the betweenness centrality network analysis and RNAseq data. Taken together, this network analysis approach serves as a valuable tool for identifying central genes unique to a tumor ecosystem.

**Nov 8. **Speaker: Alon Efrat

Title: Are Friends of My Friends Too Social? Limitations of Location Privacy in a Socially-Connected World

Abstract: Location Base Services (LBS) as Uber and Pizza Delivery suggest multiple benefits from having users share their locations at certain time. However, side information might leak and reveals the (exact or approximated) other locations and others times as well. We will show how, describe how to use this phenomenon when localization accuracy is desired, (e.g. in a GPS-jammed environment). On the other hand once this phenomenon results in undesired lost of privacy, we will show ways to quantify the lost privacy, and propose how to protect the privacy of the ones that seek this privacy, without degrading the quality of the data needed for successful LBS.

**Nov 15. **Speaker: Xiaoxiao Sun

Title: Theory informs practice: smoothing parameters selection for smoothing spline ANOVA models in large samples.

Abstract: Large samples have been generated routinely from various sources. Classic statistical models, such as smoothing spline ANOVA models, are not well equipped to analyze such large samples due to expensive computational costs. In particular, the daunting computational costs of selecting smoothing parameters render smoothing spline ANOVA models impractical. In this talk, I will present an asympirical, i.e., asymptotic and empirical, smoothing parameters selection approach for smoothing spline ANOVA models in large samples. The idea of this approach is to use asymptotic analysis to show that the optimal smoothing parameter is a polynomial function of the sample size and an unknown constant. The unknown constant is then estimated through the empirical subsample extrapolation. The proposed method can significantly reduce the computational cost of selecting smoothing parameters in high-dimensional and large samples. We show smoothing parameters chosen by the proposed method tend to the optimal smoothing parameters that minimize a specific risk function. In addition, the estimator based on the proposed smoothing parameters achieves the optimal convergence rate. Extensive simulation studies demonstrate the numerical advantage of the proposed method over competing methods in terms of relative efficiencies and running time. An application to molecular dynamics data with nearly one million observations shows that the proposed method has the best prediction performance.

**Nov 22. **Speaker: Lingling An

Title: A Neural-network Based Imputation for Single-cell RNA-seq Data

Abstract: Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct downstream analyses. The existing imputation methods cannot handle the data with extremely high sparsity well. We propose a deep learning approach that employs autoencoder architecture to impute the dropouts in scRNA-seq data with high sparsity. Through comprehensive stimulation studies we demonstrate that the new method outperforms the existing approaches in denoising scRNA-seq data.

**Nov 29. ** Thanksgiving.

# Spring 2019 Schedule.

**May 1. **Speaker: Helen Zhang

Title: Tensor Regression and Regularization

Abstract: I will present the recent paper of Zhou, Li, and Zhu (2013), which proposes a new family of tensor regression models for high dimensional data analysis. Both theory and scalable estimation algorithms will be discussed. The method is then illustrated with applications to MRI imaging data.

Reference:

1. Zhou, H., Li, L., and Zhu, H. (2013) Tensor Regression with Applications in Neuroimaging Data Analysis. JASA, 108, 540-552.

**April 24. **Speaker: Ruiyang Wu

Title: A fast algorithm for matrix balancing

Abstract: In regard to Hi-C data normalization problem, I will introduce a fast matrix balancing algorithm based on Newton's method, and compare it to the older but widely used Sinkhorn-Knopp algorithm.

Reference: Knight, P. A. & Ruiz, D. (2012). A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis.

**April 17. **Speaker: Yiwen Liu

Title: Adaptive dimension reduction for high dimensional data

Abstract: For high dimensional data clustering, conventional algorithms such as the K-means often suffer from the local minima problem and the curse of dimensionality. To address the first problems, various initialization methods were proposed but with limited success. While dimension reduction techniques are often adopted as a preprocessing step to tackle the second problem. However, since the dimension reduction subspace is selected and fixed during the clustering process, it may deviate from the optimal. It becomes increasingly clear that the clustering process can be coupled with the subspace selection process. The data are then simultaneously clustered while the dimension reduction subspaces are selected adaptively. In this talk, I will review current works under this adaptive dimension reduction topic and address their challenges and limitations. They provide a rich and flexible framework for further exploration.

**April 10. **U2 can UQ workshop

**April 3. **Speaker: Ning Hao

Title: Introduction to Hi-C Data II

Abstract: We will continue the introduction to Hi-C data. In particular, I will introduce a non-parametric model and related statistical problems.

**March 27. **Speaker: Xiaoxiao Sun

Title: Supervised Functional Principal Component Analysis

Abstract: Functional principal component analysis (FPCA) is an important tool for dimension reduction in functional data analysis. In this talk, I will introduce supervised FPCA approaches which make use of response information. Compared to the classical FPCA methods, the supervised ones can incorporate supervision information to recover more interpretable underlying structures.

**March 20. **Speaker: Amy Kim and Faryad Sahneh

Title: Automatic Feature Selection for High-Dimensional Climate Data in Hurricane Predictive Modeling Part 2

Abstract: Modern machine learning techniques of sparse regression models such as the Lasso and Elastic Net can improve model performances and make more accurate predictions by choosing relevant features. When these methods are applied to screen high dimensional climate data to predict the number of Atlantic hurricanes, we are faced with two challenges. First, the standard Lasso (or Elastic Net) regression does not explicitly account for strong spatial and temporal correlations in climatic data. This challenge can be addressed by utilizing the “fused Lasso.” However, most existing fused Lasso techniques are not developed for generalized linear models such as Poisson regression, which is well-suited for representing the number of hurricane incidences. Secondly, even in the case of the linear fussed Lasso, requirements on row/column ranks of the underlying fusion matrix are rather restricting, and particularly not suitable for our problem at hand.

In this talk, we show that directly formulating the problem as a generalized linear model with a generic fusion matrix leads to a convex optimization problem. Specifically, we show that the resulting objective function adopts the functional form of CVXPY, a Python-embedded modeling language for efficiently solving convex optimization problems. Finally, we conclude by demonstrating how machine learning techniques for sparse regression models can perform feature selection for our hurricane prediction problem.

References:

[1] Agrawal, A., Verschueren, R., Diamond, S., & Boyd, S. (2018). A rewriting system for convex optimization problems. *Journal of Control and Decision*, *5*(1), 42-60.

[2] Diamond, S., & Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. *The Journal of Machine Learning Research*, *17*(1), 2909-2913.

**March 13. https://www.math.arizona.edu/events/46**

**March 6. **No meeting (Spring break).

**February 27. **Speaker: Chi-Kwan Chan

Title: Review of PLAsTiCC Top Solutions

Abstract: The Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC) is an online competition for finding accurate machine learning algorithms to classify astronomical object based on time series data. The winner of the challenge, Kyle Boone, and several top Kagglers have posted notes and/or source codes of their solutions. In this informal talk, I will give a summary of their solutions and go through some of their data analysis codes. I will provide a list of lesson learned and discuss with the audiences how we can improve in future Kaggle challenge.

**February 20. **Speaker: Amy Kim

Title: Cancer prevention & control meet statistical machine learning: karyometric studies

Abstract: Karyometric studies based on a computer analysis of high resolution images of nuclei provide a useful tool for early detection and prevention of cancer. We want to derive an objective characterization of nuclei from the karyometric features of cancer/normal cells, apply modern statistical machine learning algorithms to increase the discrimination ability with smaller errors, and develop a new method to effectively select features that can discriminate cancer cells the best. In this talk, we will first introduce one of the previous studies (Bartels et al., 2012) to understand the previous analysis procedures by reproducing the original study results. Then, we will discuss main challenges of this study and possible solutions and directions for the future plan.

References:

[1] Anderson, N., Houghton, J., Kirk, S. J., Frank, D., Ranger-Moore, J., Alberts, D. S., ... & Bartels, P. H. (2003). Malignancy-associated changes in lactiferous duct epithelium. *Analytical and quantitative cytology and histology*, *25*(2), 63-72.

[2] Bartels, P. H., Garcia, F. A., Trimble, C. L., Kauderer, J., Curtin, J., Lim, P. C., ... & Bartels, H. G. (2012). Karyometry in atypical endometrial hyperplasia: A Gynecologic Oncology Group study. *Gynecologic oncology*, *125*(1), 129-135.

**February 13. **Speaker 1: Kyungmi Chung (45 min)

Title: Introduction to a Group-Specific Recommender System

Abstract: Recommender system is an algorithm which predicts user responses to options. In this talk, we will introduce recommender system under the latent factorization framework. The main reference is Bi et al. *A Group-Specific Recommender System*. (2017) JASA .

Speaker 2: Bjorn Wastvedt (15 min)

Title: Stylometry in Aristotle’s *Ethics*

Abstract: Aristotle most likely wrote the material contained in two ethical works, *Nicomachean Ethics* (NE) and the *Eudemian Ethics* (EE), though their current status – two publications, each between two covers, packaged and titled separately – very likely does not reflect any intention of Aristotle’s (Barnes 1997). Nevertheless the books (the “chapters”) unique to each treatise do cohere enough for us to usefully label them as parts of their respective works. Aside from these “special books” (so-called because they are unique to their respective treatises) are three “common books,” printed both in *NE* manuscripts and in *EE* manuscripts. The question of the “original” provenance of the common books has long been a favorite of those interested in the *Eudemian Ethics*; in this paper I address that question, the current answers to it, and what we may say about it today.

Interest in Aristotle’s less well-known ethical treatise is undoubtedly growing, but statistical discussion of the status of the work in Aristotle’s corpus remains surprisingly thin. In 1978, Anthony Kenny's *The Aristotelian Ethics* influentially answered two problems regarding the *EE*: its philosophical value (on par with the *NE*) and the original home of the common books (the *EE*). Kenny has since published addenda to his work from 1992 and 2016, but even the 2016 edition of his study, he does not expand significantly on his original statistical analyses. Since 1978, without exception (so far as I know), no one has updated Kenny’s statistical inquiry: the vast majority of inquiries into the problem of the common books focus their attention on philosophical arguments, not on statistical ones, and those few that address Kenny’s statistical tests are critical and not constructive. I aim to bring contemporary stylometric method to the problem of the common books.

**February 6. **Speaker: Marina Kiseleva and Yujing Qin

Title: Predicting Astronomical Transients for LSST

Abstract: Each night starting in 2023, the Large Synoptic Survey Telescope (LSST) will send out alerts for 10 million astrophysical transients, including supernovae, gamma ray bursts, and the tidal disruption of stars by the central, supermassive black holes in galaxies. Due to the high volume, most of these transients will fade away before they can even be classified. Our vision is to predict what each transient is based on the host galaxy’s properties. Thus, even before it occurs, we may predict the most likely transient to occur in any particular galaxy. We explore machine learning algorithms on the largest, most complete database of known transients and corresponding host galaxy features. Currently, we are exploring Bayesian models and decision trees, trying to determine which modeling technique best suits our needs. Our challenges include disparate datasets, feature measurement errors, severe class imbalance, and redshift bias. Furthermore, LSST will detect transients at different rates from previous sky surveys, thus requiring us to mold our data and problem to these unprecedented rates.

**January 30. **Speaker: Selena Niu

Title: Introduction to Hi-C Data I

Abstract: Hi-C is a genome-wide sequencing technique which is used to study the 3D chromatin structure inside the nucleus. In this talk, we will briefly introduce Hi-C experiments and data, and related scientific problems. The main reference is Lieberman-Aiden et al. *Comprehensive mapping of long-range interactions reveals folding principles of the human genome.* (2009) Science.

**January 23.** (3:15-4:00pm ENR2 S395, unusual time and location) Organization meeting.

# Fall 2018 Schedule.

**December 6.** Discussion: PLAsTiCC Astronomical Classification

**November 29.** Discussion: PLAsTiCC Astronomical Classification

**November 22. **No meeting (Thanksgiving).

**November 15.** Discussion: PLAsTiCC Astronomical Classification

**November 8.** Discussion: PLAsTiCC Astronomical Classification

**November 5. **Statistics GIDP Colloquium, see http://math.arizona.edu/events/9295

**November 1.** Discussion: PLAsTiCC Astronomical Classification

**October 25. **Speaker: Ning Hao

Title: Non-sparse modeling (continued)

**October 18. **No meeting this week.

**October 11. **Speaker: Ning Hao

Title: Non-sparse modeling

Abstract: Sparse modeling has been popular in last 20 years. Recently, some researchers have revealed that the sparse models may not always be realistic in applications. Nonetheless, non-sparse high dimensional models have been rarely studied. We will discuss some alternative strategies and related open problems.

**October 4.** Speaker: Keaton Hamm.

Title: CUR Decompositions and Subspace Clustering

Abstract: The subspace clustering problem seeks to cluster data in a high-dimensional space that is drawn from the union of much smaller dimensional subspaces. One method of attack for this problem is to find a similarity matrix from the data which identifies the clusters. This talk will discuss an intriguing matrix decomposition method called CUR decomposition, and describe how this decomposition gives many new similarity matrices for data which solve the clustering problem. An algorithm based on the proposed theory but which is capable of handling noisy data will be presented, and it is shown that the algorithm achieves the best classification error to date on the Hopkins155 motion data set.

**September 27.** Speaker: Amy Kim.

Title: The Overview of the Generalized Lasso and Other Lasso Type Problems (Slides)

Abstract: Different types of Lasso problems and developing algorithms to solve those types of problems have been extensively studied over many years. Depending on the structures of data, dimensions of design matrices, or distributional assumptions of data, several different methods that have been stemmed from the Lasso can be used. We will explore different types of Lasso problems, especially focusing on the generalized lasso and the fused lasso.

**September 20.** Speaker: Xiaoxiao Sun.

Title: Introduction to Single-Cell RNA Sequencing (Slides)

Abstract: As a new sequencing technology, single-cell RNA sequencing (scRNA-seq) measures the distribution of expression levels for individual cells. It allows researchers to study new biological questions, such as cell type identification and stochasticity of gene expression. In this talk, I first introduce the RNA sequencing (RNA-seq) and tasks in the RNA-seq data analysis. Second, the workflow and major differences between RNA-seq and single -cell RNA-seq are introduced. Third, I will discuss three major problems: systematic bias, cell-cycle effects, and pseudotime, in the scRNA-seq data analysis.

**September 13.** Speaker: Faryad Sahneh.

Title: Automatic Feature Selection for High-Dimensional Climate Data in Hurricane Predictive Modeling

Abstract: Climatic data are temporal and spatial, and are extremely high dimensional in nature. Feature selection for predictive modeling in cumbersome and requires deep understanding of the underlying physics of phenomenon under study. This talk proposes how modern machine learning techniques of sparse regression models, such as Lasso and ElasticNet, can augment predictive modeling efforts showing improved performance for feature selection in climatic data.