BDS 3002: Machine Learning for Biomedical Data Science

A companion site for students and teachers of the MLBDS course

course director: Li Shen, PhD

Email: li.shen@mssm.edu

Website: Shen lab@Mount Sinai

guest lecturers for Spring 2024: Parijat Dube, Adam Catto, Pei Wang, Vikas Pejaver, Avner Schlessinger, Riccardo Miotto

Useful information:

Math for Machine Learning

Some quick references for math

Some videos to help you understand math

Linear Algebra and numpy

Google Colab

AWS S3 (for Instructors)

PyTorch

scikit-learn

Math for Machine Learning

There is a Mathematics for Machine Learning Specialization on Coursera. It will help you gain the prerequisite mathematical knowledge for machine learning. If linear algebra or calculus sounds remote to you, you may find it to be very useful. Tips: you can always audit a Coursera course. The specialization has 3 courses: linear algebra and multivariate calculus are most useful for the course. Principal component analysis is less relevant but still an important topic.

If you still want more, this book provides a nice treatment of many topics in machine learning. The book has a companion website with Jupyter notebooks and other resources.

Don't worry. You don't need a PhD in mathematics to do machine learning. We just want you to understand some basic stuffs. You can do a lot of data science without knowing the underlying math. But knowing math will allow you to go much further.

Some quick references for math

calculus_cheat_sheet_all.pdf

calculus cheatsheet

linalg_cheat_sheet.pdf

linear algebra cheatsheet

Some videos to help you understand math

3Blue1Brown, one of my favorite YouTubers, has made series of videos that cover many topics in math. For example, Essence of linear algebra, Essence of calculus, and Neural networks are all fantastic.

Here are some example videos:

chain rule

linear transformation and matrices

back-propagation

numpy-demo.pdf

Linear Algebra and numpy

Linear algebra plays an important role in many machine learning algorithms. The level of mathematics involved is NOT high. However, a lot of students may find it challenging to put the math to "work". numpy is a widely used scientific computing library. Other deep learning libraries such as PyTorch and TensorFlow all have similar grammars as numpy. This PDF reviews some basic linear algebra operations in numpy. Make sure you understand them!

Google Colab

An overview of the Colab's features
Deal with files in Google Colab
A notebook for dealing with external data in colab
A simple guide to the markdown that can be used to create rich format text cells in Colab.
A markdown table generator.

AWS S3 (for Instructors)

Grant a user access to only one bucket or folder
S3Fs is a Python module to deal with S3 files like local using high-level APIs

PyTorch

PyTorch and TensorFlow are the two most widely used deep learning platforms, while PyTorch is extremely popular among academic researchers. In this course, we use PyTorch as a teaching tool to help you understand deep learning.

Here is a great book on PyTorch for you to dive deep into concepts like Tensors and Autograd and an extended example on lung cancer detection. You can also read the book online.

scikit-learn

scikit-learn is an extremely popular Python package for machine learning. You may find the following tutorials to be particularly useful:

Model Selection: how to do cross-validation and use grid search to find the best hyperparameters for your model.
Model evaluation using different performance metrics.
Data Preprocessing is extremely important. A lot of models simply won't work without proper data transformation. Some examples: feature scaling and PCA, comparing different scalers.
Basic Imputation Methods. In real-world data, missing values are very common.
How to construct pipelines to put together multiple data preprocessing steps and prediction models.
How to save a model and load it later for prediction, i.e. model persistence.

Page updated

Google Sites

Report abuse