Completed:
Coursera Week 8
EX 7
Coursera Week 9 part 1
Lessons Learned:
K Means algorithm
nitialize cluster centroids
classify data based on proximity to centroids
find the center of each "cluster"
move centroids to the mean of the classified groups
repeat
K - the number of clusters
Application:
well separated cluster
not well separated, i.e.height /weight → t-shirt size S/M/L
Optimization Objective function (distortion)
Choose number of clusters - elbow method
Dimensionality reduction.
Data Compression
reduce data from 2D → 1D, 3D → 2D, etc
reduce training time
Data Visualization
Principal Component Analysis (PCA)
Identify a lower dimension surface on which to project data s.t. sum of square error b/w actual data and data on surface (projection error) is minimal
PCA is NOT Linear Regression
Should NOT be used for overfitting, use regularization instead.
Principal COmponent Analysis (PCA) Algorithm
Data processing: feature scaling/ mean normalization
Compute covariance matrix (n X n) → sigma
Compute eigenvector (single value decomposition (SVD)) → [U,S,V]
First K columns of U matrix is Ureduced
z = trans(Ureduced) * x
Reconstruction from compressed representation: Xapprox = trans(Ureduced) *Z
K - number of principal components
K should be the smallest value such that:
average squared projection error / total variation in data should be less than 0.01
→ 99% of variance is retained
Use PCA to speed up supervised learning: maps x to z
Given dataset (x1, y1), …(xm, ym)
Extract input: (x1...xm)
apply PCA to obtain (z1...zm)
new training set (z1, y1), (z2, y2)..(zm, ym)
randi(range, size of matrix) → randi(10,5): a 5 X 5 matrix of integers ranging from 1-10