I strongly believe that you can learn stats by have a parallel process of working on understanding the theory AND simulating data to know the implementation details about the theory. Hence while learning about linear models, I used this book to know the R commands for running linear models. The book takes you through all the possible nuances of a linear model. Let me summarize this book.

image

Estimation

  • Estimation of various parameters in a linear model from scratch

  • Identifiability arises when the model matrix is not full rank and hence not invertible.

  • Check the eigen values of the design matrix. If any of the eigen values is close to 0 or 0 , then you have a problem of identifiability and problem of collinearity.

  • Clear lack of indentifiability is good as software throws up error or clear warnings. But if there is a situation which is close to unidentifiability, then it is a bigger problem where it is the responsibility of the analyst to interpret the standard errors of the model

  • If I have 2 models , M1 and M2, and the only difference between the two is an explanatory variable , then anova(M1, M2) does an F test for the significance of the additional variable

  • For a single parameter model, t statistic = square root( F statistic )

Inference

  • Distribution of Beta, mu and variance of beta

  • Use of I() for hypothesis testing

  • Use of offset() for hypothesis testing

  • Checking the t stat from a standard regression model , one can get a confidence band for the parameter. One can get the same statistic by doing the following :

  • Fit a model with all the variables

  • Fit a model with all the variables except the one that you want to test

  • Do an anova of two models, you get F statistic and it is nothing but the square of t statistic reported

  • Suppose there is a book store and depending on the genre of the book, the bookstore offers discount. So , If I pick up a nonfiction finance book, I can ask two questions

  • Given that I have chosen a non-fiction finance book, what is the average discount that I can expect?

  • Given that I have chosen a non-fiction finance book, what is the price band that I will be expected to foot?

  • The thing to note is that in former question, beta variance will suffice , but the latter question needs to take care of error variance too

  • Predict with the argument “confidence” can be used for predicting the mean of response variable given a specific value of the independent variable

  • Predict with the argument “prediction” can be used for predicting the response variable given a specific value of the independent variable\

Diagnostics

  • Cooks distance – What is it ? How to compute it ?

  • Hat value measures leverage – What is it ? How to compute it ?

  • Added Variable plots – How do draw one ?

  • Durbin Watson test - How to compute it ?

  • Leverage talks about the spread in the independent variables

  • Cooks distance talks about the influence of a specific point on the slope and intercept of the model

  • Way to draw a added variable plot

Problem with Predictors

  • Measurement error of the independent variable causes a bias in the estimates

  • By changing the scale, the parameters are also affected

  • Change the scale of X – t, F, Rsquare, sigma square remains same whereas beta gets rescaled.

  • Change the scale of Y – t, F,Rsquare remains same whereas beta and sigma square gets rescaled

  • Collinearity

  • Conditional Index

  • Variance Inflation Factor

Problems with Error

  • Bootstrap regression

  • Robust regression

  • Weighted Least Squares

  • Generalized Least Squares

Transformation

  • In reality, there will be non constant error variance

  • Log transformation is one of the easiest

  • Log – Log is also good as it removes the non linearity in the relationship and makes it a linear relationship. YVonneBishop is credited with the development of Log Linear Models

  • Box Cox Transformation to be used on variables so that the response variable is more tractable analytically

  • Build confidence intervals for Box Cox and then use it to estimate bands for Lambda. Based on the Lambda you can decide whether you really need a transformation or not

  • Logit Transformation in cases where the y variable is proportion based

  • Fischer Z transformation in cases when the y variable represents correlation

  • Hockey Stick regression

  • Spline regression

  • Polynomial regression

  • Orthogonal regression

Variable Selection

  • Forward + Backward + Stepwise regression

  • Information criterion like AIC, BIC

  • Mallows criterion

Shrinkage Methods

  • PCA based regression

  • Partial Regression

  • Ridge Regression

Missing Treatment

  • Removal of outliers

  • Imputing mean

  • Regression fill in method

ANCOVA + ANOVA

  • One of the best ways to check the seasonality is to use ANOVA and create bands for each of the parameters which relate to the difference of means between 2 levels

  • Two Way ANOVA and Fractional Design

Insurance Redlining Case

  • Aggregation bias more specifically when conclusions at the group level are extended to the individual level

  • Steps

  • Diagnostics

  • Skewness

  • Variation of the independent variable

  • Stripcharts to get an idea about the variation

  • Pairs to see the cross correlation

  • Fit an MLR

  • Hat values talks about the variation in the independent variables

  • Cooks distance talks about the influence of the point on the slope and intercept of the model.

  • Use partial regression plots and partial residual plots to check for the need of transformation