Purpose
To get equipped with the multitude of tools available for multivariate analysis Currently the tools which I use are present in base package and hence they are mostly pen and paper mode graphics. Meaning you can overlay one over the other and you end up making a graphic prettier.
However ggplot2 package gives a richer variety of tools. One thing I wanted to check is to give a check of my current skillset level as far as graphics are concerned. I took the graphics data that is present in ggplot2 and did all the possible charts that I can draw for understanding the data.
I had this feeling that my current graphic skillsets suck big time.
> library(ggplot2)
> data(diamonds)
> head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
> dsmall <- diamonds[sample(nrow(diamonds), 100), ] |
In the initial stages, I had to understand various attributes of the dataset. What is x, y, z , table and depth ? Referred to some site on the internet to get a hang of x , y, z ?
Somehow I wanted to learn everything about the price dependencies from the dataset with out having any domain knowledge of diamonds.
I began crunching some basic numbers Basic Summary Stats
> summary(diamonds)
carat cut color clarity depth
Min. :0.200 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
1st Qu.:0.400 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
Median :0.700 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
Mean :0.798 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
3rd Qu.:1.040 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
Max. :5.010 I: 5422 VVS1 : 3655 Max. :79.00
J: 2808 (Other): 2531
table price x y
Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
z
Min. : 0.000
1st Qu.: 2.910
Median : 3.530
Mean : 3.539
3rd Qu.: 4.040
Max. :31.800 |
> par(mfrow = c(2, 3))
> plot(diamonds$x, diamonds$price, main = "x")
> plot(diamonds$y, diamonds$price, main = "y")
> plot(diamonds$z, diamonds$price, main = "z")
> plot(diamonds$depth, diamonds$price, main = "depth")
> plot(diamonds$table, diamonds$price, main = "table")
> plot(diamonds$carat, diamonds$price, main = "carat") |
Then a boxplot of categorical variables
> par(mfrow = c(1, 1))
> boxplot(price ~ clarity, diamonds, main = "clarity")
> boxplot(price ~ cut, diamonds, main = "cut") |
Subsequently a pairs plot of all the continous variables
> z <- diamonds[, c("x", "y", "z", "depth", "table", "carat", "price")]
> pairs(z) |
What else am I capable of doing ?
> summary(lm(price ~ cut, diamonds))
Call:
lm(formula = price ~ cut, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-4258 -2741 -1494 1360 15348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4358.76 98.79 44.122 < 2e-16 ***
cutGood -429.89 113.85 -3.776 0.000160 ***
cutVery Good -377.00 105.16 -3.585 0.000338 ***
cutPremium 225.50 104.40 2.160 0.030772 *
cutIdeal -901.22 102.41 -8.800 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16
> summary(lm(price ~ clarity, diamonds))
Call:
lm(formula = price ~ clarity, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-4737 -2727 -1429 1262 16254
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3924.1687 144.5619 27.145 < 2e-16 ***
claritySI2 1138.8599 150.2746 7.579 3.55e-14 ***
claritySI1 71.8325 148.6049 0.483 0.629
clarityVS2 0.8207 148.8672 0.006 0.996
clarityVS1 -84.7133 150.9746 -0.561 0.575
clarityVVS2 -640.4316 154.7737 -4.138 3.51e-05 ***
clarityVVS1 -1401.0541 158.5401 -8.837 < 2e-16 ***
clarityIF -1059.3296 171.8990 -6.163 7.21e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3935 on 53932 degrees of freedom
Multiple R-squared: 0.02715, Adjusted R-squared: 0.02702
F-statistic: 215 on 7 and 53932 DF, p-value: < 2.2e-16
> summary(lm(price ~ x, diamonds))
Call:
lm(formula = price ~ x, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-8426.1 -1263.7 -185.0 973.1 32128.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14094.056 41.732 -337.7 <2e-16 ***
x 3145.413 7.146 440.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1862 on 53938 degrees of freedom
Multiple R-squared: 0.7822, Adjusted R-squared: 0.7822
F-statistic: 1.937e+05 on 1 and 53938 DF, p-value: < 2.2e-16
> summary(lm(price ~ y, diamonds))
Call:
lm(formula = price ~ y, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-152436.0 -1229.3 -241.3 837.7 31436.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13402.027 44.062 -304.2 <2e-16 ***
y 3022.887 7.536 401.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1999 on 53938 degrees of freedom
Multiple R-squared: 0.749, Adjusted R-squared: 0.7489
F-statistic: 1.609e+05 on 1 and 53938 DF, p-value: < 2.2e-16
> summary(lm(price ~ z, diamonds))
Call:
lm(formula = price ~ z, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-139561.1 -1234.9 -239.8 824.5 32084.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13296.57 44.64 -297.9 <2e-16 ***
z 4868.79 12.37 393.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2027 on 53938 degrees of freedom
Multiple R-squared: 0.7418, Adjusted R-squared: 0.7417
F-statistic: 1.549e+05 on 1 and 53938 DF, p-value: < 2.2e-16
> summary(lm(price ~ depth, diamonds))
Call:
lm(formula = price ~ depth, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-3766 -2986 -1521 1396 14937
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5763.67 740.56 7.783 7.21e-15 ***
depth -29.65 11.99 -2.473 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3989 on 53938 degrees of freedom
Multiple R-squared: 0.0001134, Adjusted R-squared: 9.483e-05
F-statistic: 6.115 on 1 and 53938 DF, p-value: 0.01340
> summary(lm(price ~ carat, diamonds))
Call:
lm(formula = price ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-18585.33 -804.81 -18.92 537.43 12731.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2256.36 13.06 -172.8 <2e-16 ***
carat 7756.43 14.07 551.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16 |
Is that all I can do ? Don’t I know any other thing that I can perform on this dataset ? let me think for 2 min .. I can do a multiple reg
> model <- lm(price ~ x + y + z + table + depth + cut + clarity +
+ carat, diamonds)
> summary(model)
Call:
lm(formula = price ~ x + y + z + table + depth + cut + clarity +
carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-20741.8 -561.1 -112.0 403.9 11469.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2503.359 455.793 5.492 3.98e-08 ***
x -869.652 36.695 -23.699 < 2e-16 ***
y 14.840 21.597 0.687 0.492
z -41.610 37.405 -1.112 0.266
table -28.878 3.252 -8.881 < 2e-16 ***
depth -78.535 5.061 -15.517 < 2e-16 ***
cutGood 564.548 37.520 15.047 < 2e-16 ***
cutVery Good 705.968 36.009 19.605 < 2e-16 ***
cutPremium 725.949 35.994 20.169 < 2e-16 ***
cutIdeal 807.366 37.310 21.639 < 2e-16 ***
claritySI2 2721.859 48.934 55.623 < 2e-16 ***
claritySI1 3569.061 48.715 73.263 < 2e-16 ***
clarityVS2 4164.446 48.967 85.047 < 2e-16 ***
clarityVS1 4379.709 49.708 88.109 < 2e-16 ***
clarityVVS2 4863.608 51.197 94.998 < 2e-16 ***
clarityVVS1 4831.625 52.630 91.803 < 2e-16 ***
clarityIF 5128.992 56.869 90.189 < 2e-16 ***
carat 10494.676 53.769 195.182 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1262 on 53922 degrees of freedom
Multiple R-squared: 0.8999, Adjusted R-squared: 0.8999
F-statistic: 2.851e+04 on 17 and 53922 DF, p-value: < 2.2e-16 |
Everything looks significant except x and y Let me look at the residuals of the model
> par(mfrow = c(1, 1))
> qqnorm(resid(model))
> qqline(resid(model)) |
Its pathetic….Residuals are nowhere close to normal. The multivariate model that I have fit is a nice piece of bullshit and nothing more thatn that!
Ok, let me think for another 2 minutes …I think it is time that I fall back on Hadley Wickham and see what I can learn from the ggplot2 package
- The first thing that I did not transform the data inspite of seeing a non linear relationship between price vs carat.
> par(mfrow = c(1, 1))
> plot(log(diamonds$carat), log(diamonds$price)) |
It does look linear Let me contrast it with qplot
> print(qplot(log(diamonds$carat), log(diamonds$price))) |
- An additional insight makes a lot of sense carat vs density
> print(qplot(carat, x * y * z, data = dsmall)) |
- The other cutesy thing is to assign color or shape to categorical data
> print(qplot(carat, price, data = dsmall, colour = color))
> print(qplot(carat, price, data = dsmall, shape = cut)) |
- Got introduced to another feature called alpha
> print(qplot(carat, price, data = diamonds, alpha = I(1/10)))
> print(qplot(carat, price, data = diamonds, alpha = I(1/100)))
> print(qplot(carat, price, data = diamonds, alpha = I(1/10))) |
Aesthetics is also an art. For categorical variables, color and shape works well. Whereas shape works well for continous data.
If there is a huge dataset, obviously a loess type fits well. I have no clue about local regression functions. This is onething that I need to figure out as I go along. I have already downloaded a few books relating to it. One by Generalized Additive Models, -→ Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. -→ Hardel, W. (1990) Applied Nonparametric Regression. -→ Loader, C. (1999) Local Regression and Likelihood. -→ Chambers, J.M. and Hasite, T.J. (1993). Statistical Models in S. -→ Venables, W.N. and Ripley, B.D. (1994) Modern Applied Statistics with S-Plus.
God knows when I am going to read these books. Ok, there are 2 books out of the above list that are in my list that I have put on the wall. May be someday. The coming three days are holidays and I would ideally love to spend my entire time doing statistics. However I need to go for a workshop , a graphic workshop …
5.Anyways, coming back to loess provision in ggplot2 There is a way to specifying a set of geom. You can do this with a set of methods in a vector form c(…) etc Default method in ggplot2 is loess.
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ span = 0.1))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ span = 0.2))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ span = 1)) |
However this does not work if there are larger number of points. You can use mgcv library
> library(mgcv)
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ method = "gam", formula = y ~ s(x)))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ method = "gam", formula = y ~ s(x, bs = "cs"))) |
One can also use lm to fit a linear model. One can load the splines package and use a natural spline
> library(splines)
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ method = "lm"))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+ method = "lm", formala = y ~ ns(x, 5))) |
- Now coming to the categorical variables, you can use jitter plot or boxplot
> print(qplot(color, price/carat, data = diamonds, geom = "boxplot",
+ color = "blue"))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter"))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+ alpha = I(1/5)))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+ alpha = I(1/50)))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+ alpha = I(1/200))) |
- For continous variables, histogram and density plots are very important
> print(qplot(carat, data = diamonds, geom = "histogram"))
> print(qplot(carat, data = diamonds, geom = "density")) |
Let me change the binwidth and see how the histogram looks like
> print(qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1))
> print(qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01)) |
- Another cool feature of the density and histogram is that one can create a density and histogram for separate categories in the data
> print(qplot(carat, data = diamonds, geom = "density", colour = color))
> print(qplot(carat, data = diamonds, geom = "histogram", fill = color)) |
- Obviously the next thing is the bar chart
> print(qplot(color, data = diamonds, geom = "bar"))
> print(qplot(color, data = diamonds, geom = "bar", weight = carat) +
+ scale_y_continuous("carat")) |
- Timeseries plots
> print(qplot(date, unemploy/pop, data = economics, geom = "line"))
> print(qplot(date, uempmed, data = economics, geom = "line"))
> year <- function(x) as.POSIXlt(x)$year + 1900
> print(qplot(unemploy/pop, uempmed, data = economics, geom = c("point",
+ "path")))
> print(qplot(unemploy/pop, uempmed, data = economics, geom = "path",
+ colour = year(date)) + scale_area()) |
- Using conditional plots in multivariate data is extremely useful
> print(qplot(carat, data = diamonds, facets = color ~ ., geom = "histogram",
+ binwidth = 0.1, xlim = c(0, 3)))
> print(qplot(carat, ..density.., data = diamonds, facets = color ~
+ ., geom = "histogram", binwidth = 0.1, xlim = c(0, 3))) |
- All other options that are generally available for plot in R are available for qplot
> print(qplot(carat, price, data = dsmall, xlab = "Price ($)",
+ ylab = "Weight (carats)", main = "Price-weight relationship"))
> print(qplot(carat, price/carat, data = dsmall, ylab = expression(frac(price,
+ carat)), xlab = "Weight (carats)", main = "Small diamonds",
+ xlim = c(0.2, 1)))
> print(qplot(carat, price, data = dsmall, log = "xy")) |
There are a few important differences between plot and qplot:
- qplot is not generic: you cannot pass any type of R object to qplot and expect to get some kind of default plot. Note, however, that ggplot() is generic, and may provide a starting point for producing visualisations of arbitrary R objects.
- Usually you will supply a variable to the aesthetic attribute you’re interested in. This is then scaled and displayed with a legend. If you want to set the value, e.g., to make red points, use I(): colour = I(“red”).
- While you can continue to use the base R aesthetic names (col, pch, cex, etc.), it’s a good idea to switch to the more descriptive ggplot2 aesthetic names (colour, shape and size). They’re much easier to remember!
- To add further graphic elements to a plot produced in base graphics, you can use points(), lines() and text(). With ggplot2, you need to add additional layers to the existing plot
Takeaways
- qplot stands for quickplot
- They look more neater than the base plot
- loess can be used for local regression plots
- timeseries plots look publication friendly
- Way to create lattice plots
- jitter plots and boxplots have a new look