I am revisiting the book titled, Introduction to Statistical Learning with Applications in R, after 7 years. It was back in 2013, when I read through this book for the first time, and worked through the book. Needless to say R ecosystem has expanded greatly since then. I have done many projects in the field of data science and have grown a bit wiser. This blogpost summarizes my re-learnings from this fascinating book, that is a bridge to Elements of Statistical Learning, a book that is considered the bible of Statistical Learning

Statistical Learning

  • Brief History of Statistical learning
    • 19th century – Legendre and Gauss method of least squares
    • 1936 – Linear Discriminant Analysis by Fischer
    • 1940 – Logistic Regression
    • 1970 – Generalized Linear Model by Nelder and Wedderburn
    • 1980s – By 1980s computing technology improved Breiman, Friedman, Olshen and Stone introduced classification and regression trees. Also principles of cross-validation were laid out.
    • 1986 – Hastie and Tibshirani coined the term generalized additive models in 1986 for a class of non-linear extensions to generalized linear models, and also provided a practical software implementation.
  • Looking at a ML model from the eyes of Interpretability vs Flexibility
  • Distinguishing the purpose of ML model - Inference vs. Prediction
  • Bayes classifier - $P(Y=j|X=X_0)$
  • Bayes Error rate Unattainable gold standard against which to compare other methods

$$ 1-E(\max_j P(Y=j|X)) $$

Linear Regression

  • If one wants to follow the traditional route of statistical testing and inference, one should compute the standard errors of coefficients, residual sum of errors, total sum of errors, for a regression model. These metrics will help one compute Adjusted R-squared, test for non zero coefficients in the regression equation.
  • If you are regressing $Y$ with $X_1$ in one model and in another model you are regressing $Y$ with $X_1$ and $X_2$ , it is possible that the coefficient of $X_1$ is coefficient is significant in the first equation; not statistically significant in the second equation. The reason could be that there is a healthy correlation between $X_1$ and $X_2$ and the reason the coefficient of $X_1$ is statistically significant is that, it has effect on the target variable via $X_2$ that is not captured in the equation
  • In a regression equation $Y = \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p$, it is possible that all the true coefficients are 0. If you do individual test for each of the coefficients, there is a high chance that you will incorrectly conclude that there is a relationship, just based on the chance of seeing a random possibility
  • Two of the most important assumptions state that the relationship between the predictors and response are additive and linear. The additive assumption means that the effect of changes in a predictor $X_j$ on the response $Y$ is independent of the values of the other predictors. The linear assumption states that the change in the response $Y$ due to a one-unit change in the $X_j$ is constant, regardless of the value of $X_j$
  • A possible remedy for the problem of heteroscedasticity is to do a concave transformation of the response variable. This results in a greater amount of shrinkage of the large responses, leading to a reduction of heteroscedasticity.
  • Weighted least squares is another remedy to solve heteroscedasticity, if you have a prior idea about the variance of the response.
  • A classic method to identify outliers is to plot a studentized residual plot, i.e a plot of studentized residuals vs fitted values. Any value greater than 4 can be considered as a outliers
  • A simple analysis of correlation matrix of the predictors might not reveal the problem present among a set of predictors. It is possible that there are more than three or four variables that are collinear with no single pair exhibiting a large correlation. This goes by the name, multicollinearity. The best way to identify is via Variance Inflation factor computation
  • Two ways to tackle multicollinearity
    • Remove the variable
    • Group all the variables that are jointly collinear and create a single factor
  • Leverage is the characteristic of the predictors, i.e., one of the predictors is far removed from the domain of the predictor space
  • Outlier behavior is detected by looking at the outcome variable
  • If an observation is an outlier and has high leverage, it is a deadly combo that results in unstable estimates. It means that the response for the observation is extreme. It also means that the observation is in itself extreme in the predictor space
  • One can compute leverage statistic for each observation, plot the values to detect high leverage observations

Classification

  • Linear Discriminant Analysis is popular when we have more than two response classes
  • If the sample size is small and the distribution of predictors is approximately normal, the linear discriminant model is more stable than the logistic regression model
  • The word linear in the classifier’s name stems from the fact that the discriminant functions $\hat \delta_k(x)$ are linear functions of $x$

$$ \delta_k(x) = x \cdot {\mu_k \over \hat \sigma^2} - \cdot {\mu_k \over \hat 2 \sigma^2} + \log (\hat \pi_k) $$

  • LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance $\sigma^2$, and plugging estimates for these parameters into the Bayes classifier
  • In the case of $p>1$ predictors, the LDA classifier assumes that the observations in the $k$th class are drawn from a multivariate Gaussian distribution $N(\mu_k, \Sigma)$ where $\mu_k$ is a class-specific mean vector and $\Sigma$ is a covariance matrix that is common to all $K$ classes

$$ \delta_k(x) = x^T \Sigma^{-1} \mu_k - 1/2 \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k $$

  • Default probability modeling - Confusion matrix
    • The sensitivity is the percentage of true defaulters that are identified(true positive rate)
    • The specificity is the percentage of non-defaulters that are identified($1-\text{false positive rate})$
  • In some cases where there is unbalanced data, LDA might do a poor job of identifying the defaulters, even though the overall accuracy is good. Why is this ? LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers
  • Bayes classifier will yield the smallest possible total number of misclassified observations, irrespective of which class the errors come from
  • QDA assumes that each class has its own covariance matrix. Thus QDA is more flexible than LDA.
  • When the true decision boundaries are linear, then LDA and logistic regression approaches will tend to perform well
  • When boundaries are moderately non-linear, QDA may give better results
  • For much more complicated boundaries, a non parametric approach such as KNN can be superior. But the level of smoothness for non parametric method needs to be chosen carefully.
  • Ideally one can think of QDA as the middle ground between LDA and KNN type methods

Resampling Methods

  • What is the advantage of LOOCV over validation set approach ? LOOCV: repeatedly fit statistical learning method using training sets that contain $n-1$ observations, almost as many as are in the entire data set. Hence LOOCV does not overestimate the test error rate as much as the validation set approach does
  • What is a computational shortcut to evaluating LOOCV in the case of simple regression ? It can expressed in terms of the leverage statistic of the individual observation

$$ CV_{(n)} = {1 \over n} \sum^n_{i=1} \left ( y_i - \hat y_i \over 1 - h_i \right)^2 $$

  • The biggest advantage of LOOCV method is that it can be used for a wide variety of methods
  • The advantage of k-fold cross validation procedure of LOOCV is that it is not computationally expensive as the latter. Also there is lesser variance as compared to the latter
  • when one uses cv.glm function to do cross validation, the object returned has two delta values; the first one is the standard k-fold CV estimate and the second one is a bias corrected version

Linear Model Selection and Regularization

  • When the number of predictors are large, then there are ways to choose the best model
    • Best subset selection
    • Forward selection
    • Backward selection
  • $C_p$ estimate

$$ C_p = {1 \over n} (RSS + 2 d \hat \sigma^2) $$

  • AIC criterion

$$ AIC = {1 \over n \hat \sigma^2} (RSS + 2 d \hat \sigma^2) $$

  • BIC criterion

$$ BIC = {1 \over n \hat \sigma^2} (RSS + \log(n) d \hat \sigma^2) $$

  • BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than $C_p$
  • Adjusted R-squared

$$ \text{Adjusted } R^2 = 1 - {RSS/(n-d-1) \over TSS/n-d } $$

  • One-standard-error rule : Calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point of the curve. The rationale here is that if a set of models appears to be more or less equally good, then we might as well choose the simplest model
  • Why use Shrinkage methods ? It reduces the variance of the coefficient estimates
  • Why should predictor variables be standardized before applying shrinkage methods ? Shrinkage methods such as Ridge regression or Lasso work by applying penalty on the coefficients. The value of the coefficients will vary depending on the scale used to measure the predictor variables. By standardizing the predictors, the coefficients are treated on par and hence shrinkage based methods have a better opportunity to perform
  • Ridge regression works best in situations where the least squares estimates have high variance
  • In situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance
  • One can interpret ridge regression and lasso as computationally feasible alternatives to best subset selection.
  • Neither ridge regression nor the lasso will universally dominate the other.
  • One can formulate the estimation of parameters in lasso and ridge regression from a bayesian perspective. If one assumes that the prior on $\beta$’s are from a gaussian distribution, the posterior mode for $\beta$ is given by the ridge regression solution. If one assumes the prior as Laplace distribution, then the posterior is the lasso solution
  • All the dimension reduction methods work in two steps. First, the transformed predictors are obtained. Second, the model is fit using these predictors. There could be many ways to transform the predictors. Two popular methods are PCR and PLS
  • If the predictor space has multicollinearity, then the reported coefficients have high standard errors. The importance of each variable is masked due to the presence of collinearity. One way to remove collinearity is to figure out components that are correlated to predictors.
  • PCR is a method that does not perform feature selection. Hence it is closer to Ridge regression method than lasso
  • Drawback of PCR is that there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response

Moving beyond Linearity

  • basis function approach is one where a family of functions or transformations can be applied to a variable $X$ such as $b_1(X), b_2(X), b_3(X), \ldots, b_K(x)$. Instead of fitting a linear model $X$, we fit the model

$$ y_i = \beta_0 + \beta_1 b_1(x_i) + \ldots + \beta_K b_K(x_i) + \epsilon_i $$

  • The word regression splines should be associated with the term piece wise polynomial with constraints.
  • A cubic spline with $K$ knots uses a total of $4+K$ degrees of freedom
  • In order to fit a cubic spline to a dataset with $K$ knots, we perform least squares regression with an intercept and $3+K$ predictors, of the form $X, X^2, X^3, h(X, \xi_1), h(X, \xi_2), \ldots h(X, \xi_K)$ where $\xi_1, \xi_2, \ldots, \xi_K$ are the knots. This amounts to estimating a total of $K+4$ regression coefficients; for this reason, fitting a cubic spline with $K$ knots uses $K+4$ degrees of freedom
  • A natural cubic spline is a regression spline with additional boundary constraints; the function is required to be linear at the boundary. This additional constraint means that natural splines generally produce more stable estimates at the boundaries
  • Local regression is an approach where one fits flexible non-linear functions, which involve computing the fit at a target point $x_0$ using only the nearby training observations
  • Generalized additive models provide a framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity
  • GAM model

$$ y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \ldots f_p(x_{ip}) + \epsilon_i $$

  • The main limitation of GAM’s is that the model is restricted to be additive. With many variables, important interactions can be missed.

Tree Based Methods

  • These models are good for interpretability but not competitive in terms of predictive accuracy
  • It is computationally infeasible to consider every possible partition of the feature space in to high dimensional boxes. Hence a top-down, greedy approach is taken called as recursive binary splitting.
    • It is called top-down because it starts with one region that captures all the points and recursively splits the points to various regions
    • It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead
  • Cost complexity pruning is a way to avoid overfitting. The formulation is similar to lasso, i.e. one imposes a penalty parameter $\alpha$ for the number of terminal nodes.
  • When building a classification tree, either the Gini index or the entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to nde purity than is the classification error rate
  • Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. When building decision trees, each time a split is made, a random set of predictors are chosen to be considered for a split. Hence if there a subsets of correlated predictors, this method ensures that the resulting output comprises a set of decorrelated trees
  • Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set
  • Boosting has three tuning parameters
    • The number of trees
    • The shrinkage parameter
    • The number of splits in each tree

SVM

  • Maximal margin hyperplane: It is the separating hyperplane for which the margin is the largest - that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies.
  • If the classes are perfectly separable, then there are infinitely many hyperplanes that separates the two classes
  • Maximal margin classifier cast as an optimization problem
  • Support Vector classifier cast as an optimization problem
  • Why use Kernel functions ?
  • Radial basis functions and the way they squash the infinite dimensional space

Unsupervised Learning

  • PCA refers to the process bu which principal components are computed, and the subsequent use of these components in understanding the data.
  • PCA finds a low-dimensional representation of a data set that contains as much has possible of the variation.
  • Way to formulate the PCA problem as a convex optimization problem
  • Results obtained via PCA will depend on whether the individual variables have been standardized or not
  • a $n\times p$ data matrix $X$ has $\min(n-1,p)$ distinct principal components
  • One can decide the number of principal components required by looking at scree plot
  • Clustering looks to find homogeneous subgroups among the observations
  • Hierarchical clustering is an approach that does not require that we commit to a particular choice of clusters
  • Four commonly used type of linkages in hierarchical clustering
    • complete
    • single
    • average
    • centroid

Takeaway

This book is brilliantly written and targeted towards someone who wants to get a high-level overview of various ML methods, as well as work through some code that go with various method. The closest book in the Python world is the written by Aurelien Geron. However the math in this book is a little more detailed and acts a primer to Elements of Statistical Learning. In any case, despite reading it second time, there are a ton of things that I have learned to see differently.