A companion site for students and teachers of the MLBDS course
course director: Li Shen, PhD
Email: li.shen@mssm.edu
Website: Shen lab@Mount Sinai
guest lecturers for Spring 2024: Parijat Dube, Adam Catto, Pei Wang, Vikas Pejaver, Avner Schlessinger, Riccardo Miotto
There is a Mathematics for Machine Learning Specialization on Coursera. It will help you gain the prerequisite mathematical knowledge for machine learning. If linear algebra or calculus sounds remote to you, you may find it to be very useful. Tips: you can always audit a Coursera course. The specialization has 3 courses: linear algebra and multivariate calculus are most useful for the course. Principal component analysis is less relevant but still an important topic.
If you still want more, this book provides a nice treatment of many topics in machine learning. The book has a companion website with Jupyter notebooks and other resources.
Don't worry. You don't need a PhD in mathematics to do machine learning. We just want you to understand some basic stuffs. You can do a lot of data science without knowing the underlying math. But knowing math will allow you to go much further.
calculus cheatsheet
linear algebra cheatsheet
3Blue1Brown, one of my favorite YouTubers, has made series of videos that cover many topics in math. For example, Essence of linear algebra, Essence of calculus, and Neural networks are all fantastic.
Here are some example videos:
chain rule
linear transformation and matrices
back-propagation
Linear algebra plays an important role in many machine learning algorithms. The level of mathematics involved is NOT high. However, a lot of students may find it challenging to put the math to "work". numpy is a widely used scientific computing library. Other deep learning libraries such as PyTorch and TensorFlow all have similar grammars as numpy. This PDF reviews some basic linear algebra operations in numpy. Make sure you understand them!
S3Fs is a Python module to deal with S3 files like local using high-level APIs
PyTorch and TensorFlow are the two most widely used deep learning platforms, while PyTorch is extremely popular among academic researchers. In this course, we use PyTorch as a teaching tool to help you understand deep learning.
Here is a great book on PyTorch for you to dive deep into concepts like Tensors and Autograd and an extended example on lung cancer detection. You can also read the book online.
scikit-learn is an extremely popular Python package for machine learning. You may find the following tutorials to be particularly useful:
Model Selection: how to do cross-validation and use grid search to find the best hyperparameters for your model.
Model evaluation using different performance metrics.
Data Preprocessing is extremely important. A lot of models simply won't work without proper data transformation. Some examples: feature scaling and PCA, comparing different scalers.
Basic Imputation Methods. In real-world data, missing values are very common.
How to construct pipelines to put together multiple data preprocessing steps and prediction models.
How to save a model and load it later for prediction, i.e. model persistence.