Deep Learning Foundations Starter Track

Overview

This page is for students who are interested in AI but are still at the very beginning. If you are not yet comfortable with deep learning, PyTorch, or neural networks, this page is for you.

The goal of this page is to help students build enough foundation to later study computer vision, multimodal AI, and edge AI with confidence.

Part I. Google Colab

Students who are completely new to deep learning do not need to install everything on their own computer first. A good way to start is to use Google Colab.

What to learn first in Colab

how to open a notebook
how to run a code cell
how to use GPU runtime
how to install a package with pip
how to upload a small file or connect Google Drive

Recommended resources

Google Colab: https://colab.research.google.com/
Colab FAQ: https://research.google.com/colaboratory/faq.html
PyTorch in Colab guide: https://pytorch.org/tutorials/beginner/colab.html
Weights & Biases: https://wandb.ai/site
W&B Docs: https://docs.wandb.ai/
W&B Tutorial: https://docs.wandb.ai/tutorials/

Part II. Lecture materials for beginning students

Why use this: a good next step for students who want to learn PyTorch in a more standard and up-to-date way.

Part III. Papers and code

These are a small number of landmark papers that students should read after becoming comfortable with basic deep learning and PyTorch. The goal is not to read many papers at once. The goal is to begin practicing how to read important papers in deep learning and computer vision.

1. AlexNet (NeurIPS 2012)

Paper: ImageNet Classification with Deep Convolutional Neural Networks

Why read it: this is one of the papers that made deep learning for computer vision take off.
Focus on: ReLU, dropout, data augmentation, and large-scale image classification.

2. VGG (ICLR 2015)

Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition

Why read it: this paper is simple and easy to follow, and it shows why deeper convolutional networks matter.
Focus on: repeated 3x3 convolutions, depth, and simple architecture design.

3. ResNet (CVPR 2016)

Paper: Deep Residual Learning for Image Recognition

Why read it: one of the most important papers in modern deep learning.
Focus on: the degradation problem, residual connections, and why skip connections help optimization.

4. U-Net (MICCAI 2015)

Paper: U-Net: Convolutional Networks for Biomedical Image Segmentation

Why read it: this is a very good first paper for understanding segmentation and encoder-decoder structure.
Focus on: contracting path, expanding path, skip connections, and localization.

5. Transformer / ViT

Paper 1: Attention Is All You Need
Paper 2: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Why read these: the original Transformer paper introduced an architecture based purely on attention, and ViT brought that idea into image recognition by treating images as sequences of patches.
Focus on: self-attention, token representation, patch embedding, and how ViT differs from CNNs.

Part IV. Self-check project

Build a Vision Transformer from scratch in PyTorch and train it on a small dataset

Goal

This project is intended for students who have already finished the lecture materials.

The goal is not to get the best score. The goal is to become comfortable with reading a paper, tracing an implementation, modifying a model, and observing what happens.

Reference code: https://github.com/tintn/vision-transformer-from-scratch

This repository is a simplified PyTorch implementation of the ViT paper and is designed to be easier to understand than a large production codebase.

What to do

Read the ViT paper at a high level. You do not need to understand every equation perfectly at first.
Run the reference implementation in Google Colab.
Identify the main building blocks:
- patch embedding
- positional embedding
- multi-head self-attention
- MLP block
- classification head
Train the model on a small dataset.
Change one or two settings yourself, such as:
- patch size
- embedding dimension
- number of transformer layers
- number of heads
- learning rate
Record the results with Weights & Biases (W&B).
Write a short memo answering:
- What part was hardest to understand?
- What changed when you modified the model?
- Why is ViT different from CNNs?