Linear Models with R
I strongly believe that you can learn stats by have a parallel process of working on understanding the theory AND simulating data to know the implementation details about the theory. Hence while learning about linear models, I used this book to know the R commands for running linear models. The book takes you through all the possible nuances of a linear model. Let me summarize this book.
Estimation
-
Estimation of various parameters in a linear model from scratch
-
Identifiability arises when the model matrix is not full rank and hence not invertible.
-
Check the eigen values of the design matrix. If any of the eigen values is close to 0 or 0 , then you have a problem of identifiability and problem of collinearity.
-
Clear lack of indentifiability is good as software throws up error or clear warnings. But if there is a situation which is close to unidentifiability, then it is a bigger problem where it is the responsibility of the analyst to interpret the standard errors of the model
-
If I have 2 models , M1 and M2, and the only difference between the two is an explanatory variable , then anova(M1, M2) does an F test for the significance of the additional variable
-
For a single parameter model, t statistic = square root( F statistic )
Inference
-
Distribution of Beta, mu and variance of beta
-
Use of I() for hypothesis testing
-
Use of offset() for hypothesis testing
-
Checking the t stat from a standard regression model , one can get a confidence band for the parameter. One can get the same statistic by doing the following :
-
Fit a model with all the variables
-
Fit a model with all the variables except the one that you want to test
-
Do an anova of two models, you get F statistic and it is nothing but the square of t statistic reported
-
Suppose there is a book store and depending on the genre of the book, the bookstore offers discount. So , If I pick up a nonfiction finance book, I can ask two questions
-
Given that I have chosen a non-fiction finance book, what is the average discount that I can expect?
-
Given that I have chosen a non-fiction finance book, what is the price band that I will be expected to foot?
-
The thing to note is that in former question, beta variance will suffice , but the latter question needs to take care of error variance too
-
Predict with the argument “confidence” can be used for predicting the mean of response variable given a specific value of the independent variable
-
Predict with the argument “prediction” can be used for predicting the response variable given a specific value of the independent variable\
Diagnostics
-
Cooks distance – What is it ? How to compute it ?
-
Hat value measures leverage – What is it ? How to compute it ?
-
Added Variable plots – How do draw one ?
-
Durbin Watson test - How to compute it ?
-
Leverage talks about the spread in the independent variables
-
Cooks distance talks about the influence of a specific point on the slope and intercept of the model
-
Way to draw a added variable plot
Problem with Predictors
-
Measurement error of the independent variable causes a bias in the estimates
-
By changing the scale, the parameters are also affected
-
Change the scale of X – t, F, Rsquare, sigma square remains same whereas beta gets rescaled.
-
Change the scale of Y – t, F,Rsquare remains same whereas beta and sigma square gets rescaled
-
Collinearity
-
Conditional Index
-
Variance Inflation Factor
Problems with Error
-
Bootstrap regression
-
Robust regression
-
Weighted Least Squares
-
Generalized Least Squares
Transformation
-
In reality, there will be non constant error variance
-
Log transformation is one of the easiest
-
Log – Log is also good as it removes the non linearity in the relationship and makes it a linear relationship. YVonneBishop is credited with the development of Log Linear Models
-
Box Cox Transformation to be used on variables so that the response variable is more tractable analytically
-
Build confidence intervals for Box Cox and then use it to estimate bands for Lambda. Based on the Lambda you can decide whether you really need a transformation or not
-
Logit Transformation in cases where the y variable is proportion based
-
Fischer Z transformation in cases when the y variable represents correlation
-
Hockey Stick regression
-
Spline regression
-
Polynomial regression
-
Orthogonal regression
Variable Selection
-
Forward + Backward + Stepwise regression
-
Information criterion like AIC, BIC
-
Mallows criterion
Shrinkage Methods
-
PCA based regression
-
Partial Regression
-
Ridge Regression
Missing Treatment
-
Removal of outliers
-
Imputing mean
-
Regression fill in method
ANCOVA + ANOVA
-
One of the best ways to check the seasonality is to use ANOVA and create bands for each of the parameters which relate to the difference of means between 2 levels
-
Two Way ANOVA and Fractional Design
Insurance Redlining Case
-
Aggregation bias more specifically when conclusions at the group level are extended to the individual level
-
Steps
-
Diagnostics
-
Skewness
-
Variation of the independent variable
-
Stripcharts to get an idea about the variation
-
Pairs to see the cross correlation
-
Fit an MLR
-
Hat values talks about the variation in the independent variables
-
Cooks distance talks about the influence of the point on the slope and intercept of the model.
-
Use partial regression plots and partial residual plots to check for the need of transformation