MATH 5472. Computer-Age Statistical Inference and its applications

Synopsis

This course is designed for PhD students (year 1) in applied mathematics, statistics, and engineering who are interested in learning from data. It covers advanced topics in statistical machine learning, with emphasis on the integration of statistical models and algorithms for statistical inference. This course aims to first make connections among classical topics, and then move forward to modern topics, including statistical view of deep learning. Various applications will be discussed, such as computer vision, human genetics, and text mining.


Note: On one side, this course can be challenging for some non-math students as some homework requires mathematical derivation. On the other side, it can be challenging for some math students as it requires coding. If you are still interested in, then let's suffer to learn! Of course, students are welcome to be audience.

Lecture information

Tuesday, Thursday, 03:00PM - 04:20PM, Room 5506, 25-26 lift, main academic building, HKUST.

Reference books

Grading policy: Assignment (60%) + Project (40%)

Assignment (60%)

Assignment 1 [pdf]

Assignment 2 [pdf]

Assignment 3 [pdf]

Assignment 4 [pdf]

Project (40%)

In this project, you can choose one paper from the "Project list" (to be posted). The purpose of this project is that we are learning to critically read and discuss papers in statistics and machine learning. These papers can be new and potentially influential works, or they can be older important works that you may not have seen in other classes. Please inform the instructor once you decide to work on one paper. No more than two students can work on the same paper. The rule for you to pick your topic is "first come, first served".

Requirement: each student needs to submit a report after reading his/her chosen paper. Rough format: overview of the paper, simulations or examples to illustrate the key results of the paper (based on your own implementation), summary of the points. Your report also includes the github link of your code, such that your results can be easily reproducible. Aim for 6-10 pages. Click here for the Latex template. You can check the scribed notes of the journal club at CMU and  use them as an example when you prepare your own report.

Remark: There will be a discount if you use an existing implementation to reproduce the key results. You should make an explicit statement in your report if you use the existing implementation. Due to the university regulation on academic integrity, there will be a substantial penalty if you do not make such a statement.

Deadline: Dec. 15, 2023

Please choose your project here [link]

Project list: 

1. Weighted Low Rank Matrix Approximation and Acceleration. 

2. Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares 

3. Flexible signal denoising via flexible empirical Bayes shrinkage. Journal of Machine Learning Research 22(93): 1-28.

4. Empirical Bayes estimation of normal means, accounting for uncertainty in estimated standard errors. arXiv:1901.10679.

5. False discovery rates: a new deal. Biostatistics 18(2): 275-294.

6. Finding scientific topics. PNAS [link] (Gibbs sampling for topic models)

7. Non-negative matrix factorization algorithms greatly improve topic model fits. arXiv:2105.13440.

8. Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse [link

9. varbvs: fast variable selection for large-scale regression. arXiv:1709.06597.

10.  Empirical Bayes matrix factorization. Journal of Machine Learning Research 22(120): 1-40.

11.  Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes. Journal of Machine Learning Research  [link]

12.  Maximum Likelihood for Gaussian Process Classification and Generalized Linear Mixed Models under Case-Control Sampling. Journal of Machine Learning Research  [link]

13.  The Implicit Regularization of Stochastic Gradient Flow for Least Squares. International Conference on Machine Learning, 2020.

14. Generalizing RNA velocity to transient cell states through dynamical modelling. [link]

15. SPICEMIX enables integrative single-cell spatial modeling of cell identity [link]

16.  ebnm: an R package for solving the empirical Bayes normal means problem using a variety of prior families. arXiv:2110.00152.

   17. Latent Dirichlet Allocation [link]

   

   18. Gaussian Process Boosting [link]

   

   19. Diffusion Posterior Sampling For General Noisy Inverse Problems. [link]


   20. Construction of a 3D whole organism spatial atlas by joint modelling of multiple slices with deep neural networks [link]


   21. XMAP: Cross-population fine-mapping by leveraging genetic diversity and accounting for confounding bias [link]