Think of a sheet of paper crumpled into 3D space. Locally it’s still a flat 2D surface, you could lay a tiny ruler on it and do normal geometry, but globally it looks complicated.
Data often behaves the same way: it lives in very high dimensions (pixels, genes, sensors) but actually sits near a lower-dimensional “surface.” That surface is the manifold.
Extrinsic dimension: where the data is embedded (e.g., 3D).
Intrinsic dimension: the real degrees of freedom on the manifold (e.g., the sheet’s 2D).
Goal: recover the intrinsic structure.
It’s a family of techniques for nonlinear dimensionality reduction: mapping high-D data to a low-D space (usually 2D/3D) while keeping the meaningful shape. Great for visualization and sometimes for speeding up downstream models. 📉
Neighbor graph: connect each point to its nearest neighbors. “Near” on the manifold ≠ “near” in straight-line Euclidean space, so neighbors are crucial.
Preserve local or geodesic structure: either
approximate geodesic distances along the graph (paths that follow the surface), or
preserve tiny local neighborhoods so the map unfolds without tearing.
Embed: solve an eigenproblem or optimization to place points in 2D/3D so those distances/neighborhoods are respected.
That’s it: build the graph, respect the manifold, place the points.
PCA (baseline): linear, fast, keeps global variance; fails on curved manifolds.
Isomap: preserves global geodesic distances using shortest paths + MDS. Good for “unrolling” shapes like the Swiss roll; can struggle with holes or disconnected graphs.
LLE (Locally Linear Embedding): preserves each point as a weighted combo of its neighbors. Great at local structure; global layout can drift.
Laplacian Eigenmaps: spectral method preserving neighborhood smoothness; close cousin to LLE.
t-SNE: focuses on local clusters; gorgeous visuals for high-D embeddings (images, words, cells), but global distances aren’t meaningful. Needs perplexity tuning. ✨
UMAP: like t-SNE’s pragmatic cousin — faster, preserves more global structure, supports transform of new points, and has clear knobs (n_neighbors, min_dist). 🚀
Want a faithful global unfolding (shape matters end-to-end)? → Isomap.
Care about local neighborhoods/clusters (who’s near whom), not global axes? → t-SNE or UMAP.
Need something you can apply to new data easily and use for both viz + modeling? → UMAP (with transform) or autoencoders.
Suspect the data is mostly linear? → PCA first; it’s a strong baseline.
Scale your features first. Unscaled distances wreck neighbor graphs.
Tune neighbors:
Small k/n_neighbors → very local, more fragmented.
Big k → smoother, risks shortcuts across folds.
Check connectivity: if your graph isn’t connected, Isomap/LLE can fail or produce weird islands.
Multiple runs: t-SNE/UMAP can vary with initialization; run a few times to confirm patterns.
Don’t overread axes: in t-SNE/UMAP, distances between far clusters and absolute axis directions are usually not meaningful.
Out-of-sample: vanilla LLE/Isomap don’t naturally map new points. If you need that, use UMAP or learn a parametric map (e.g., an autoencoder).
Sanity checks: color by known labels or continuous variables; neighbors in high-D should usually remain neighbors in the embedding. ✅
Biology: single-cell RNA-seq to reveal cell types and trajectories (t-SNE/UMAP everywhere).
Vision: visualize embeddings of images; detect pose/lighting manifolds.
NLP: inspect word/sentence embeddings; discover topic clusters.
Recommenders: map users/items to uncover communities and niches.
Robotics: reduce sensor spaces; learn configuration manifolds for planning.
Medical imaging: organize scans by anatomy or disease progression.
Astronomy: discover structures in spectra or sky surveys.
Anomaly detection: points far from the learned manifold = potential outliers. 🚨
Parameter sensitivity: neighborhood size, perplexity, and min_dist change the story.
Density distortions: t-SNE especially; don’t treat cluster area as sample size.
No magic clustering: embeddings suggest structure, but use proper clustering metrics/tests.
Computation: big datasets need approximations (UMAP helps; so do ANN libraries).
A manifold is a space that’s locally Euclidean: around any point you can use (d) coordinates even if it’s embedded in (\mathbb{R}^D) with (d \ll D).
Methods often solve an eigenproblem of a graph matrix (Laplacian/weights) or minimize a divergence between neighbor probabilities (t-SNE/UMAP).
Standardize features.
PCA to 50–100 dims (denoise/speed).
Choose neighbors (n_neighbors ~ 10–50) and try UMAP; compare with t-SNE.
Validate with labels/metadata; iterate on parameters.
If global geometry matters, test Isomap and inspect geodesic quality.
If you want, tell me your dataset shape and goal, and I’ll give you concrete settings to start with. 😊