statrefs home‎ > ‎Main‎ > ‎Books and Data Sets‎ > ‎

### The Elements of Statistical Learning (Hastie Tibshirani)

 Author(s) Trevor Hastie, Robert Tibshirani, Jerome Friedman Title The Elements of Statistical Learning Edition Second Edition Year 2011 Publisher Springer ISBN 978-0387848570 Website http://www-stat.stanford.edu/ElemStatLearn  (book website)

Data, R code, and a pdf version of the book are available at the book's website.

Preface to the Second Edition

Preface to the First Edition

1 Introduction

2 Overview of Supervised Learning

2.1 Introduction

2.2 Variable Types and Terminology

2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors
• 2.3.1 Linear Models and Least Squares
• 2.3.2 Nearest-Neighbor Methods
• 2.3.3 From Least Squares to Nearest Neighbors

2.4 Statistical Decision Theory

2.5 Local Methods in High Dimensions

2.6 Statistical Models, Supervised Learning and Function Approximation
• 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y )
• 2.6.2 Supervised Learning
• 2.6.3 Function Approximation

2.7 Structured Regression Models
• 2.7.1 Difficulty of the Problem

2.8 Classes of Restricted Estimators
• 2.8.1 Roughness Penalty and Bayesian Methods
• 2.8.2 Kernel Methods and Local Regression
• 2.8.3 Basis Functions and Dictionary Methods

2.9 Model Selection and the Bias–Variance Tradeoff

Bibliographic Notes

Exercises

3 Linear Methods for Regression

3.1 Introduction

3.2 Linear Regression Models and Least Squares
• 3.2.1 Example: Prostate Cancer
• 3.2.2 The Gauss–Markov Theorem
• 3.2.3 Multiple Regression from Simple Univariate Regression
• 3.2.4 Multiple Outputs

3.3 Subset Selection
• 3.3.1 Best-Subset Selection
• 3.3.2 Forward- and Backward-Stepwise Selection
• 3.3.3 Forward-Stagewise Regression
• 3.3.4 Prostate Cancer Data Example (Continued)

3.4 Shrinkage Methods
• 3.4.1 Ridge Regression
• 3.4.2 The Lasso
• 3.4.3 Discussion: Subset Selection, Ridge Regression and the Lasso
• 3.4.4 Least Angle Regression

3.5 Methods Using Derived Input Directions
• 3.5.1 Principal Components Regression
• 3.5.2 Partial Least Squares

3.6 Discussion: A Comparison of the Selection and Shrinkage Methods

3.7 Multiple Outcome Shrinkage and Selection

3.8 More on the Lasso and Related Path Algorithms
• 3.8.1 Incremental Forward Stagewise Regression
• 3.8.2 Piecewise-Linear Path Algorithms
• 3.8.3 The Dantzig Selector
• 3.8.4 The Grouped Lasso
• 3.8.5 Further Properties of the Lasso
• 3.8.6 Pathwise Coordinate Optimization

3.9 Computational Considerations

Bibliographic Notes

Exercises

4 Linear Methods for Classification

4.1 Introduction

4.2 Linear Regression of an Indicator Matrix

4.3 Linear Discriminant Analysis
• 4.3.1 Regularized Discriminant Analysis
• 4.3.2 Computations for LDA
• 4.3.3 Reduced-Rank Linear Discriminant Analysis

4.4 Logistic Regression
• 4.4.1 Fitting Logistic Regression Models
• 4.4.2 Example: South African Heart Disease
• 4.4.3 Quadratic Approximations and Inference
• 4.4.4 L1 Regularized Logistic Regression
• 4.4.5 Logistic Regression or LDA?

4.5 Separating Hyperplanes
• 4.5.1 Rosenblatt’s Perceptron Learning Algorithm
• 4.5.2 Optimal Separating Hyperplanes

Bibliographic Notes

Exercises

5 Basis Expansions and Regularization

5.1 Introduction

5.2 Piecewise Polynomials and Splines
• 5.2.1 Natural Cubic Splines
• 5.2.2 Example: South African Heart Disease (Continued)
• 5.2.3 Example: Phoneme Recognition

5.3 Filtering and Feature Extraction

5.4 Smoothing Splines
• 5.4.1 Degrees of Freedom and Smoother Matrices

5.5 Automatic Selection of the Smoothing Parameters
• 5.5.1 Fixing the Degrees of Freedom

5.6 Nonparametric Logistic Regression

5.7 Multidimensional Splines

• 5.8 Regularization and Reproducing Kernel Hilbert Spaces
• 5.8.1 Spaces of Functions Generated by Kernels
• 5.8.2 Examples of RKHS

5.9 Wavelet Smoothing
• 5.9.1 Wavelet Bases and the Wavelet Transform

Bibliographic Notes

Exercises

Appendix: Computational Considerations for Splines

Appendix: B-splines

Appendix: Computations for Smoothing Splines

6 Kernel Smoothing Methods

6.1 One-Dimensional Kernel Smoothers
• 6.1.1 Local Linear Regression
• 6.1.2 Local Polynomial Regression

6.2 Selecting the Width of the Kernel

6.3 Local Regression in IRp

6.4 Structured Local Regression Models in IRp
• 6.4.1 Structured Kernels
• 6.4.2 Structured Regression Functions

6.5 Local Likelihood and Other Models

6.6 Kernel Density Estimation and Classification
• 6.6.1 Kernel Density Estimation
• 6.6.2 Kernel Density Classification
• 6.6.3 The Naive Bayes Classifier

6.7 Radial Basis Functions and Kernels

6.8 Mixture Models for Density Estimation and Classification

6.9 Computational Considerations

Bibliographic Notes

Exercises

7 Model Assessment and Selection

7.1 Introduction

7.2 Bias, Variance and Model Complexity

7.3 The Bias–Variance Decomposition

7.4 Optimism of the Training Error Rate

7.5 Estimates of In-Sample Prediction Error

7.6 The Effective Number of Parameters

7.7 The Bayesian Approach and BIC

7.8 Minimum Description Length

7.9 Vapnik–Chervonenkis Dimension
• 7.9.1 Example (Continued)

7.10 Cross-Validation
• 7.10.1 K-Fold Cross-Validation
• 7.10.2 The Wrong and Right Way to Do Cross-validation
• 7.10.3 Does Cross-Validation Really Work?

7.11 Bootstrap Methods
• 7.11.1 Example (Continued)

7.12 Conditional or Expected Test Error?

Bibliographic Notes

Exercises

8 Model Inference and Averaging

8.1 Introduction

8.2 The Bootstrap and Maximum Likelihood Methods
• 8.2.1 A Smoothing Example
• 8.2.2 Maximum Likelihood Inference
• 8.2.3 Bootstrap versus Maximum Likelihood

8.3 Bayesian Methods

8.4 Relationship Between the Bootstrap and Bayesian Inference

8.5 The EM Algorithm
• 8.5.1 Two-Component Mixture Model
• 8.5.2 The EM Algorithm in General
• 8.5.3 EM as a Maximization–Maximization Procedure

8.6 MCMC for Sampling from the Posterior

8.7 Bagging
• 8.7.1 Example: Trees with Simulated Data

8.8 Model Averaging and Stacking

8.9 Stochastic Search: Bumping

Bibliographic Notes

Exercises

9 Additive Models, Trees, and Related Methods

• 9.1.2 Example: Additive Logistic Regression
• 9.1.3 Summary

9.2 Tree-Based Methods
• 9.2.1 Background
• 9.2.2 Regression Trees
• 9.2.3 Classification Trees
• 9.2.4 Other Issues
• 9.2.5 Spam Example (Continued)

9.3 PRIM: Bump Hunting
• 9.3.1 Spam Example (Continued)

9.4 MARS: Multivariate Adaptive Regression Splines
• 9.4.1 Spam Example (Continued)
• 9.4.2 Example (Simulated Data)
• 9.4.3 Other Issues

9.5 Hierarchical Mixtures of Experts

9.6 Missing Data

9.7 Computational Considerations

Bibliographic Notes

Exercises

10.1 Boosting Methods
• 10.1.1 Outline of This Chapter

10.2 Boosting Fits an Additive Model

10.5 Why Exponential Loss?

10.6 Loss Functions and Robustness

10.7 “Off-the-Shelf” Procedures for Data Mining

10.8 Example: Spam Data

10.9 Boosting Trees

10.10 Numerical Optimization via Gradient Boosting
• 10.10.1 Steepest Descent
• 10.10.3 Implementations of Gradient Boosting

10.11 Right-Sized Trees for Boosting

10.12 Regularization
• 10.12.1 Shrinkage
• 10.12.2 Subsampling

10.13 Interpretation
• 10.13.1 Relative Importance of Predictor Variables
• 10.13.2 Partial Dependence Plots

10.14 Illustrations
• 10.14.1 California Housing
• 10.14.2 New Zealand Fish
• 10.14.3 Demographics Data

Bibliographic Notes

Exercises

11 Neural Networks

11.1 Introduction

11.2 Projection Pursuit Regression

11.3 Neural Networks

11.4 Fitting Neural Networks

11.5 Some Issues in Training Neural Networks
• 11.5.1 Starting Values
• 11.5.2 Overfitting
• 11.5.3 Scaling of the Inputs
• 11.5.4 Number of Hidden Units and Layers
• 11.5.5 Multiple Minima

11.6 Example: Simulated Data

11.7 Example: ZIP Code Data

11.8 Discussion

11.9 Bayesian Neural Nets and the NIPS 2003 Challenge
• 11.9.1 Bayes, Boosting and Bagging
• 11.9.2 Performance Comparisons

11.10 Computational Considerations

Bibliographic Notes

Exercises

12 Support Vector Machines and Flexible Discriminants

12.1 Introduction

12.2 The Support Vector Classifier
• 12.2.1 Computing the Support Vector Classifier
• 12.2.2 Mixture Example (Continued)

12.3 Support Vector Machines and Kernels
• 12.3.1 Computing the SVM for Classification
• 12.3.2 The SVM as a Penalization Method
• 12.3.3 Function Estimation and Reproducing Kernels
• 12.3.4 SVMs and the Curse of Dimensionality
• 12.3.5 A Path Algorithm for the SVM Classifier
• 12.3.6 Support Vector Machines for Regression
• 12.3.7 Regression and Kernels
• 12.3.8 Discussion

12.4 Generalizing Linear Discriminant Analysis

• 12.5 Flexible Discriminant Analysis
• 12.5.1 Computing the FDA Estimates

12.6 Penalized Discriminant Analysis

12.7 Mixture Discriminant Analysis
• 12.7.1 Example: Waveform Data

Bibliographic Notes

Exercises

13 Prototype Methods and Nearest-Neighbors

13.1 Introduction

13.2 Prototype Methods
• 13.2.1 K-means Clustering
• 13.2.2 Learning Vector Quantization
• 13.2.3 Gaussian Mixtures

13.3 k-Nearest-Neighbor Classifiers
• 13.3.1 Example: A Comparative Study
• 13.3.2 Example: k-Nearest-Neighbors and Image Scene Classification
• 13.3.3 Invariant Metrics and Tangent Distance

• 13.4.1 Example
• 13.4.2 Global Dimension Reduction for Nearest-Neighbors

13.5 Computational Considerations

Bibliographic Notes

Exercises

14 Unsupervised Learning

14.1 Introduction

14.2 Association Rules
• 14.2.2 The Apriori Algorithm
• 14.2.3 Example: Market Basket Analysis
• 14.2.4 Unsupervised as Supervised Learning
• 14.2.5 Generalized Association Rules
• 14.2.6 Choice of Supervised Learning Method
• 14.2.7 Example: Market Basket Analysis (Continued)

14.3 Cluster Analysis
• 14.3.1 Proximity Matrices
• 14.3.2 Dissimilarities Based on Attributes
• 14.3.3 Object Dissimilarity
• 14.3.4 Clustering Algorithms
• 14.3.5 Combinatorial Algorithms
• 14.3.6 K-means
• 14.3.7 Gaussian Mixtures as Soft K-means Clustering
• 14.3.8 Example: Human Tumor Microarray Data
• 14.3.9 Vector Quantization
• 14.3.10 K-medoids
• 14.3.11 Practical Issues
• 14.3.12 Hierarchical Clustering

14.4 Self-Organizing Maps

14.5 Principal Components, Curves and Surfaces
• 14.5.1 Principal Components
• 14.5.2 Principal Curves and Surfaces
• 14.5.3 Spectral Clustering
• 14.5.4 Kernel Principal Components
• 14.5.5 Sparse Principal Components

14.6 Non-negative Matrix Factorization
• 14.6.1 Archetypal Analysis

14.7 Independent Component Analysis and Exploratory Projection Pursuit
• 14.7.1 Latent Variables and Factor Analysis
• 14.7.2 Independent Component Analysis
• 14.7.3 Exploratory Projection Pursuit
• 14.7.4 A Direct Approach to ICA

14.8 Multidimensional Scaling

14.9 Nonlinear Dimension Reduction and Local Multidimensional Scaling

Bibliographic Notes

Exercises

15 Random Forests

15.1 Introduction

15.2 Definition of Random Forests

15.3 Details of Random Forests
• 15.3.1 Out of Bag Samples
• 15.3.2 Variable Importance
• 15.3.3 Proximity Plots
• 15.3.4 Random Forests and Overfitting

15.4 Analysis of Random Forests
• 15.4.1 Variance and the De-Correlation Effect
• 15.4.2 Bias

Bibliographic Notes

Exercises

16 Ensemble Learning

16.1 Introduction 16.2 Boosting and Regularization Paths

16.2 Boosting and Regularization Paths
• 16.2.1 Penalized Regression
• 16.2.2 The “Bet on Sparsity” Principle
• 16.2.3 Regularization Paths, Over-fitting and Margins

16.3 Learning Ensembles
• 16.3.1 Learning a Good Ensemble
• 16.3.2 Rule Ensembles

Bibliographic Notes

Exercises

17 Undirected Graphical Models

17.1 Introduction

17.2 Markov Graphs and Their Properties

17.3 Undirected Graphical Models for Continuous Variables
• 17.3.1 Estimation of the Parameters when the Graph Structure is Known
• 17.3.2 Estimation of the Graph Structure

17.4 Undirected Graphical Models for Discrete Variables
• 17.4.1 Estimation of the Parameters when the Graph Structure is Known
• 17.4.2 Hidden Nodes
• 17.4.3 Estimation of the Graph Structure
• 17.4.4 Restricted Boltzmann Machines

Exercises

18 High-Dimensional Problems: p _ N

18.1 When p is Much Bigger than N

18.2 Diagonal Linear Discriminant Analysis and Nearest Shrunken Centroids

18.3 Linear Classifiers with Quadratic Regularization
• 18.3.1 Regularized Discriminant Analysis
• 18.3.2 Logistic Regression with Quadratic Regularization
• 18.3.3 The Support Vector Classifier
• 18.3.4 Feature Selection
• 18.3.5 Computational Shortcuts When p _ N

18.4 Linear Classifiers with L1 Regularization
• 18.4.1 Application of Lasso to Protein Mass Spectroscopy
• 18.4.2 The Fused Lasso for Functional Data

18.5 Classification When Features are Unavailable
• 18.5.1 Example: String Kernels and Protein Classification
• 18.5.2 Classification and Other Models Using Inner-Product Kernels and Pairwise Distances
• 18.5.3 Example: Abstracts Classification

18.6 High-Dimensional Regression: Supervised Principal Components
• 18.6.1 Connection to Latent-Variable Modeling
• 18.6.2 Relationship with Partial Least Squares
• 18.6.3 Pre-Conditioning for Feature Selection

18.7 Feature Assessment and the Multiple-Testing Problem
• 18.7.1 The False Discovery Rate
• 18.7.2 Asymmetric Cutpoints and the SAM Procedure
• 18.7.3 A Bayesian Interpretation of the FDR

18.8 Bibliographic Notes

Exercises

References

Author Index

Index