DSC214
Topological Data Analysis
Spring' 2025, Tue/Thu: 8:00am -- 9:20am
Location: Central Hall 222
Spring' 2025, Tue/Thu: 8:00am -- 9:20am
Location: Central Hall 222
Course Description
Topological data analysis: algorithm and applications
Topological methods provide powerful tools for characterizing structures/features behind data and analyzing diverse complex data (images, graphs, point sets, etc). This course introduces basic concepts and topological structures, as well as recent topological tools, algorithms, as well as examples to applications. Some topics include: basics in topology, simplicial complexes to model data, persistent homology, discrete Morse theory, topology inference, the Mapper methodology, hierarchical clustering, and integration of topological methods with machine learning.
We will be using the campuswire as a platform for students to connect to each other, and share references / insights etc. You should have received an email inviting you to campuswire, and you can also join by an access code available in Canvas.
Prerequisite: Linear algebra, algorithms (DSC40B or equivalent)
Class time: Sp'2025, Tue/Thu 8:00am -- 9:20pm PDT.
Office Hour: Thu 9:30am -- 10:20am PDT (HDSI 446), or by appointment (yusuwang@ucsd.edu ).
Tentative schedule: download here
Reference books:
Much of the material is from a up-coming book Computational Topology for Data Analysis co-authored by Dr. T. K. Dey and myself, and you can download an e-version of the book here (for academic use), or on my website.
I will distribute (in Canvas) class notes and slides throughout the course.
Other references that might be useful:
Computational Topology: An Introduction, by H. Edelsbrunner and J. Harer, AMS Press, 2009.
Algebraic Topology, by A. Hatcher, Cambridge University Press, 2002. (Online version is available at the author's webpage.)
An Introduction to Morse Theory, by Y. Matsumoto, Amer. Math. Soc., Providence, Rhode Island, 2002.
Elements of Algebraic Topology, by J. R. Munkres, Perseus, Cambridge, Massachusetts, 1984.
Grading policy:
There is one project. The final grade is based on: Project / survey (proposal, project, report, short presentation with Q&A): 100%.
Project requirements:
Each student is required to meet briefly with me around 3rd-4th week, so that we can discuss to find a good potential topic for your final project. You can choose to do a project or a survey -- however, note that the bar for survey is high: it has to be reasonable broad, and there has to be new insights / reflections in addition to aggregating existing work.
New!: To help to develop your project, a list of potential topics is given here -- note that this is only meant to serve as a starter, to help you develop your own project.
After a topic is chosen, you are required to submit a pre-proposal about what you intend to work on (at least one page without reference).
After you finish your project, you will submit a final report (7+ pages excluding reference, single spacing, font New Times Roman 11-point or equivalent), which should clearly describe the motivation of your project, what you have done, and your findings. A final presentation is also needed. A survey needs to be 15+ pages (single spacing, same font requirement as above). Important: I will be asking questions related to the projects as well as concepts from the class after your presentation. Your grade will be partially based on your answer to my questions as well.
In addition to the final project report, you also need to submit all your code and a simple documentation in github repository shared with me.
(NEW) Important: Instructions for final project submissions:
Your final project submission should include the following items: (1) A 7 to 10 minutes presentation of your project, which includes motivation, approaches, and observations. (2) Final report or survey (see above for requirements) in pdf form. (3) A link of the github repository of your code together with a simple documentation (eg how to run the code, and where are test datasets). The repository can be private if you don't feel comfortable sharing publicly. I will put my github handler name in Canvas.
The deadline of all submissions Tuesday (June 10th) of the Final Exam Week. Your submissions will be via emails to me (yusuwang@ucsd.edu) : Please send me a pdf of your report, a link where you put the recording of your presentation, and a Github link of your code. (The link for your presentation could be in your github.) You are responsible to make sure that I receive your submissions -- in particular, I will acknowledge receiving your email by noon on Wed (June 11). If you didn't receive an acknowledge email from me by then, that means that I didn't receive your email -- please make sure to reach out to me in that case.
I might reach out and ask to meet briefly with you if I have questions regarding your work. Please make sure to check emails from me on Wed + Thu (June 11 + 12).
Timeline:
Week 2-4: Students meet with me to discuss project topic .
Week 5: Finalization of project topic. Project proposal due by end of May 4th, 2022 (Sunday).
Week 10: Finish up your final project
Final Exam week:
Tuesday June 10th by 11:59pm : Final project report as well as your recorded presentation are due (there is no late submission accepted as I need to assign final grades). Please submit the video to the google drive I shared with you (information will be available in canvas then), and send the report to me by email.
Lecture materials: (Lecture notes in pptx are distributed in canvas. Below we have pdf versions of slides; note that all animations are lost in the pdf slides. )
Topic 0: Introduction to computational topology (slides: pptx)
Topic 1: Basics. (slides: pptx; Optional reading: Chapter 1 of CTDA textbook.)
Topic 2: Simplicial complexes; aka, how do we model space of interests? (slides: pdf; Optional reading: Chapter 2.1-2.3 of CTDA textbook.)
Topic 3: Introduction to homology groups; aka, how do we quantify topological features? (slides: pdf; Optional reading: Chapter 2.4-2.5 of CTDA textbook.)
Topic 4 - A: Introduction to persistent homology; aka, a modern extension of homology. (slides: pdf. Optional reading: Chapter 3.1 - 3.3 of CTDA textbook.)
Topic 4 - B: Persistence homology for common data types (point clouds and functions). (slides: pdf. Optional reading: Chapter 6.1, and Chapter 3.4 of CTDA textbook.)
Topic 4 - C: Distances for PHs, and stability. (slides: pdf. Optional reading: Chapter 4.1 of CTDA textbook.)
Topic 5: PH in practice. (slides: pdf. Optional reading: Chapter 8 of CTDA textbook.)
Topic 6: Introduction of the Reeb graphs and Contour trees. (slides: pdf. Optional reading: Chapter 7 of CTDA textbook.)
Topic 7: Introduction to Mapper methodology, and Multiscale Mapper. (slides: pdf. Optional reading; Chapter 9.3 of CTDA textbook.)
Topic 8: Discrete Morse theory, and applications. (slides: pdf. Optimal reading: Chapter 10 of CTDA textbook.)
Topic 9: Topology inference for PCD; data sparsification. (slides: pdf. Optimal reading: Chapter 6.2--6.4 of CTDA textbook.)
Topic 10: TDA in ML. (slides: pdf. Optimal reading: Chapter 13 of CTDA textbook.)
Other course materials: (Optional assignments, potential project topics, announcement etc. )
See the following position paper on Topological Deep Learning (appeared in ICML 2024).
See the following survey paper on "Open problems in mechanistic interpretability", and think about how TDA might be useful in this regard
Check out the Chapter 13 of the textbook that I co-authored (which can be downloaded on my website), to get a quick idea of where topological information can be used with Machine learning.
Announcement:
More details on the project is provided above. A list of potential project topics is also given.
The office hour on April 10 is moved to April 11 (Fri) at 3pm in HDSI 446.
The office hour on April 17 is moved to April 18 (Fri) at 3pm in HDSI 446.
The office hour on May 22 (Thu) is shifted by 30 minutes to 10am -- 10:50am in HDSI 446.
New: Please see above for more detailed instructions regarding the Final Project Submissions.
New: There will be no class in Tue (June 3rd). The class on Thu (June 5) will be on zoom, where we have a guest lecture by Dr. Chao Chen on TDA+ML in Medical Image Analysis. See canvas and campuswire for zoom link.
New: The last class will be on zoom too, at the same zoom link that I sent you for class.
Software / Tools resources
Computing Persistent homology:
The Ripser library is available here . It contains a memory+time efficient algorithm for computing persistent homology for Rips filtration.
The Gudhi library is available here . This includes currently the most efficient algorithm for computing persistent homology for arbitrary simplicial complex. The references for the algorithm is available on the Gudhi webpage.
The PHAT library is available here . This algorithm is also efficient, and the algorithm is closer to the Gaussian elimination we described in class, thus potentially easier to play with if you need to modify the code. The references for the code is available on the website. There is also a parallel version DIPHA for effecient computation of PH on distributed systems, which is available here.
The Dionysus library is one of the most comprehensive packages, which contains many persistence related computation, including levelset zigzag persistence. It is available here. It is written in C++ and reasonably efficient.
The JavaPles library is a Matlab and Java-based system. It is available here . It can automatically compute Vietoris-Rips, witness and lazy-witness complexes from input finite metric spaces (e.g point clouds data in a feature space) to further build persistent homology. The webpage also contains tutorial etc.
The SimPers package is specifically for persistence of a filtration connected by simplicial maps. It is available here .
The Sophia package is also specifically for persistence of a filtration connected by simplicial maps. Its practical performance can be better than SimPers. It is available here .
The scikit-tda package (available here) is a python library that can compute persistent homology induced from points data, as well as other tools such as visualization of persistence diagrams.
Topological data analysis (TDA) and machine learning
The sklearn-tda python package (available here) by Mathieu Carriere is a package combing TDA with machine learning (including most kernels for persistence diagrams)
The TopoLayer python package (available here) by Rickard Bruel Gabrielsson implements a topological layer that can be inserted in a neural network
Computing Bottleneck / Wasserstein distance between persistence diagrams:
The Hera package is available here. It computes the bottleneck or Wasserstein distances between two persistence diagrams efficiently.
Sparsification / approximation:
The SimBa package is available here . It produces approximate persistence diagrams for Rips filtration of a set of points in the Euclidean space or in general metric space. It is originally developed by Dayu Shi, working with Dr. Tamal Dey and myself, and currently maintained by Sayan Mandal.
The Sophia package is available here. It produces approximate persistence diagrams for Rips filtration of a set of points in the Euclidean space.
Graphs, Reeb graph / contour tree related:
The code to compute persistence diagrams for a function defined on a graph can be downloaded here. The code to compute the persistence-distortion distance for comparing two metric graphs can be downloaded here. Both are developed by Dayu Shi, working with Dr. Tamal Dey and myself.
The Denali software package for visualizing terrain metaphor is available here. The package includes code to compute contour trees for a scalar field.
The randomized algorithm for computing Reeb graph for arbitrary simplicial complex can be downloaded here.
Mapper:
The Python Mapper package is available here . It is developed by Daniel Muller.
Discrete Morse complex:
The graph reconstruction based on discrete Morse complex is available here. It takes arbitrary simplicial complex + a vertex-wise function, or a 2D / 3D cubic grid with vertex-wise function, as input. Examples for road network reconstructions are included.
The DisPerse package is available here . The software only takes 2D or 3D volumetric data, or Delaunay triangulations.