Purpose
To get equipped with the multitude of tools available for multivariate analysis Currently the tools which I use are present in base package and hence they are mostly pen and paper mode graphics. Meaning you can overlay one over the other and you end up making a graphic prettier.

However ggplot2 package gives a richer variety of tools. One thing I wanted to check is to give a check of my current skillset level as far as graphics are concerned. I took the graphics data that is present in ggplot2 and did all the possible charts that I can draw for understanding the data.

I had this feeling that my current graphic skillsets suck big time.

> library(ggplot2)
> data(diamonds)
> head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
> dsmall <- diamonds[sample(nrow(diamonds), 100), ]

In the initial stages, I had to understand various attributes of the dataset. What is x, y, z , table and depth ? Referred to some site on the internet to get a hang of x , y, z ?

Somehow I wanted to learn everything about the price dependencies from the dataset with out having any domain knowledge of diamonds.

I began crunching some basic numbers Basic Summary Stats

> summary(diamonds)
     carat              cut        color        clarity          depth
 Min.   :0.200   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00
 1st Qu.:0.400   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00
 Median :0.700   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80
 Mean   :0.798   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75
 3rd Qu.:1.040   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50
 Max.   :5.010                     I: 5422   VVS1   : 3655   Max.   :79.00
                                   J: 2808   (Other): 2531
     table           price             x                y
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900
z Min. : 0.000 1st Qu.: 2.910 Median : 3.530 Mean : 3.539 3rd Qu.: 4.040 Max. :31.800
> par(mfrow = c(2, 3))
> plot(diamonds$x, diamonds$price, main = "x")
> plot(diamonds$y, diamonds$price, main = "y")
> plot(diamonds$z, diamonds$price, main = "z")
> plot(diamonds$depth, diamonds$price, main = "depth")
> plot(diamonds$table, diamonds$price, main = "table")
> plot(diamonds$carat, diamonds$price, main = "carat")

qplot-003.jpg

Then a boxplot of categorical variables

> par(mfrow = c(1, 1))
> boxplot(price ~ clarity, diamonds, main = "clarity")
> boxplot(price ~ cut, diamonds, main = "cut")

qplot-004.jpg

Subsequently a pairs plot of all the continous variables

> z <- diamonds[, c("x", "y", "z", "depth", "table", "carat", "price")]
> pairs(z)

qplot-005.jpg

What else am I capable of doing ?

> summary(lm(price ~ cut, diamonds))
Call:
lm(formula = price ~ cut, data = diamonds)
Residuals: Min 1Q Median 3Q Max -4258 -2741 -1494 1360 15348
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4358.76 98.79 44.122 < 2e-16 *** cutGood -429.89 113.85 -3.776 0.000160 *** cutVery Good -377.00 105.16 -3.585 0.000338 *** cutPremium 225.50 104.40 2.160 0.030772 * cutIdeal -901.22 102.41 -8.800 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3964 on 53935 degrees of freedom Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279 F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16 > summary(lm(price ~ clarity, diamonds)) Call: lm(formula = price ~ clarity, data = diamonds)
Residuals: Min 1Q Median 3Q Max -4737 -2727 -1429 1262 16254
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3924.1687 144.5619 27.145 < 2e-16 *** claritySI2 1138.8599 150.2746 7.579 3.55e-14 *** claritySI1 71.8325 148.6049 0.483 0.629 clarityVS2 0.8207 148.8672 0.006 0.996 clarityVS1 -84.7133 150.9746 -0.561 0.575 clarityVVS2 -640.4316 154.7737 -4.138 3.51e-05 *** clarityVVS1 -1401.0541 158.5401 -8.837 < 2e-16 *** clarityIF -1059.3296 171.8990 -6.163 7.21e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3935 on 53932 degrees of freedom Multiple R-squared: 0.02715, Adjusted R-squared: 0.02702 F-statistic: 215 on 7 and 53932 DF, p-value: < 2.2e-16 > summary(lm(price ~ x, diamonds)) Call: lm(formula = price ~ x, data = diamonds)
Residuals: Min 1Q Median 3Q Max -8426.1 -1263.7 -185.0 973.1 32128.1
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -14094.056 41.732 -337.7 <2e-16 *** x 3145.413 7.146 440.2 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1862 on 53938 degrees of freedom Multiple R-squared: 0.7822, Adjusted R-squared: 0.7822 F-statistic: 1.937e+05 on 1 and 53938 DF, p-value: < 2.2e-16 > summary(lm(price ~ y, diamonds)) Call: lm(formula = price ~ y, data = diamonds)
Residuals: Min 1Q Median 3Q Max -152436.0 -1229.3 -241.3 837.7 31436.0
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13402.027 44.062 -304.2 <2e-16 *** y 3022.887 7.536 401.1 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1999 on 53938 degrees of freedom Multiple R-squared: 0.749, Adjusted R-squared: 0.7489 F-statistic: 1.609e+05 on 1 and 53938 DF, p-value: < 2.2e-16 > summary(lm(price ~ z, diamonds)) Call: lm(formula = price ~ z, data = diamonds)
Residuals: Min 1Q Median 3Q Max -139561.1 -1234.9 -239.8 824.5 32084.6
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13296.57 44.64 -297.9 <2e-16 *** z 4868.79 12.37 393.6 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2027 on 53938 degrees of freedom Multiple R-squared: 0.7418, Adjusted R-squared: 0.7417 F-statistic: 1.549e+05 on 1 and 53938 DF, p-value: < 2.2e-16 > summary(lm(price ~ depth, diamonds)) Call: lm(formula = price ~ depth, data = diamonds)
Residuals: Min 1Q Median 3Q Max -3766 -2986 -1521 1396 14937
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5763.67 740.56 7.783 7.21e-15 *** depth -29.65 11.99 -2.473 0.0134 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3989 on 53938 degrees of freedom Multiple R-squared: 0.0001134, Adjusted R-squared: 9.483e-05 F-statistic: 6.115 on 1 and 53938 DF, p-value: 0.01340 > summary(lm(price ~ carat, diamonds)) Call: lm(formula = price ~ carat, data = diamonds)
Residuals: Min 1Q Median 3Q Max -18585.33 -804.81 -18.92 537.43 12731.68
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2256.36 13.06 -172.8 <2e-16 *** carat 7756.43 14.07 551.4 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1549 on 53938 degrees of freedom Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493 F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16

Is that all I can do ? Don’t I know any other thing that I can perform on this dataset ? let me think for 2 min .. I can do a multiple reg

> model <- lm(price ~ x + y + z + table + depth + cut + clarity +
+     carat, diamonds)
> summary(model)
Call:
lm(formula = price ~ x + y + z + table + depth + cut + clarity +
    carat, data = diamonds)
Residuals: Min 1Q Median 3Q Max -20741.8 -561.1 -112.0 403.9 11469.0
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2503.359 455.793 5.492 3.98e-08 *** x -869.652 36.695 -23.699 < 2e-16 *** y 14.840 21.597 0.687 0.492 z -41.610 37.405 -1.112 0.266 table -28.878 3.252 -8.881 < 2e-16 *** depth -78.535 5.061 -15.517 < 2e-16 *** cutGood 564.548 37.520 15.047 < 2e-16 *** cutVery Good 705.968 36.009 19.605 < 2e-16 *** cutPremium 725.949 35.994 20.169 < 2e-16 *** cutIdeal 807.366 37.310 21.639 < 2e-16 *** claritySI2 2721.859 48.934 55.623 < 2e-16 *** claritySI1 3569.061 48.715 73.263 < 2e-16 *** clarityVS2 4164.446 48.967 85.047 < 2e-16 *** clarityVS1 4379.709 49.708 88.109 < 2e-16 *** clarityVVS2 4863.608 51.197 94.998 < 2e-16 *** clarityVVS1 4831.625 52.630 91.803 < 2e-16 *** clarityIF 5128.992 56.869 90.189 < 2e-16 *** carat 10494.676 53.769 195.182 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1262 on 53922 degrees of freedom Multiple R-squared: 0.8999, Adjusted R-squared: 0.8999 F-statistic: 2.851e+04 on 17 and 53922 DF, p-value: < 2.2e-16

Everything looks significant except x and y Let me look at the residuals of the model

> par(mfrow = c(1, 1))
> qqnorm(resid(model))
> qqline(resid(model))

qplot-008.jpg

Its pathetic….Residuals are nowhere close to normal. The multivariate model that I have fit is a nice piece of bullshit and nothing more thatn that!

Ok, let me think for another 2 minutes …I think it is time that I fall back on Hadley Wickham and see what I can learn from the ggplot2 package

  1. The first thing that I did not transform the data inspite of seeing a non linear relationship between price vs carat.
> par(mfrow = c(1, 1))
> plot(log(diamonds$carat), log(diamonds$price))

qplot-009.jpg

It does look linear Let me contrast it with qplot

> print(qplot(log(diamonds$carat), log(diamonds$price)))

qplot-010.jpg

  1. An additional insight makes a lot of sense carat vs density
> print(qplot(carat, x * y * z, data = dsmall))

qplot-011.jpg

  1. The other cutesy thing is to assign color or shape to categorical data
> print(qplot(carat, price, data = dsmall, colour = color))
> print(qplot(carat, price, data = dsmall, shape = cut))

qplot-012.jpg

  1. Got introduced to another feature called alpha
> print(qplot(carat, price, data = diamonds, alpha = I(1/10)))
> print(qplot(carat, price, data = diamonds, alpha = I(1/100)))
> print(qplot(carat, price, data = diamonds, alpha = I(1/10)))

qplot-013.jpg

Aesthetics is also an art. For categorical variables, color and shape works well. Whereas shape works well for continous data.

If there is a huge dataset, obviously a loess type fits well. I have no clue about local regression functions. This is onething that I need to figure out as I go along. I have already downloaded a few books relating to it. One by Generalized Additive Models, -→ Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. -→ Hardel, W. (1990) Applied Nonparametric Regression. -→ Loader, C. (1999) Local Regression and Likelihood. -→ Chambers, J.M. and Hasite, T.J. (1993). Statistical Models in S. -→ Venables, W.N. and Ripley, B.D. (1994) Modern Applied Statistics with S-Plus.

God knows when I am going to read these books. Ok, there are 2 books out of the above list that are in my list that I have put on the wall. May be someday. The coming three days are holidays and I would ideally love to spend my entire time doing statistics. However I need to go for a workshop , a graphic workshop …

5.Anyways, coming back to loess provision in ggplot2 There is a way to specifying a set of geom. You can do this with a set of methods in a vector form c(…) etc Default method in ggplot2 is loess.

> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     span = 0.1))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     span = 0.2))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     span = 1))

qplot-014.jpg

However this does not work if there are larger number of points. You can use mgcv library

> library(mgcv)
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "gam", formula = y ~ s(x)))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "gam", formula = y ~ s(x, bs = "cs")))

qplot-015.jpg

One can also use lm to fit a linear model. One can load the splines package and use a natural spline

> library(splines)
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "lm"))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "lm", formala = y ~ ns(x, 5)))

qplot-016.jpg

  1. Now coming to the categorical variables, you can use jitter plot or boxplot
> print(qplot(color, price/carat, data = diamonds, geom = "boxplot",
+     color = "blue"))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter"))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+     alpha = I(1/5)))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+     alpha = I(1/50)))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+     alpha = I(1/200)))

qplot-017.jpg

  1. For continous variables, histogram and density plots are very important
> print(qplot(carat, data = diamonds, geom = "histogram"))
> print(qplot(carat, data = diamonds, geom = "density"))

qplot-018.jpg

Let me change the binwidth and see how the histogram looks like

> print(qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1))
> print(qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01))

qplot-019.jpg

  1. Another cool feature of the density and histogram is that one can create a density and histogram for separate categories in the data
> print(qplot(carat, data = diamonds, geom = "density", colour = color))
> print(qplot(carat, data = diamonds, geom = "histogram", fill = color))

qplot-020.jpg

  1. Obviously the next thing is the bar chart
> print(qplot(color, data = diamonds, geom = "bar"))
> print(qplot(color, data = diamonds, geom = "bar", weight = carat) +
+     scale_y_continuous("carat"))

qplot-021.jpg

  1. Timeseries plots
> print(qplot(date, unemploy/pop, data = economics, geom = "line"))
> print(qplot(date, uempmed, data = economics, geom = "line"))
> year <- function(x) as.POSIXlt(x)$year + 1900
> print(qplot(unemploy/pop, uempmed, data = economics, geom = c("point",
+     "path")))
> print(qplot(unemploy/pop, uempmed, data = economics, geom = "path",
+     colour = year(date)) + scale_area())

qplot-022.jpg

  1. Using conditional plots in multivariate data is extremely useful
> print(qplot(carat, data = diamonds, facets = color ~ ., geom = "histogram",
+     binwidth = 0.1, xlim = c(0, 3)))
> print(qplot(carat, ..density.., data = diamonds, facets = color ~
+     ., geom = "histogram", binwidth = 0.1, xlim = c(0, 3)))

qplot-023.jpg

  1. All other options that are generally available for plot in R are available for qplot
> print(qplot(carat, price, data = dsmall, xlab = "Price ($)",
+     ylab = "Weight (carats)", main = "Price-weight relationship"))
> print(qplot(carat, price/carat, data = dsmall, ylab = expression(frac(price,
+     carat)), xlab = "Weight (carats)", main = "Small diamonds",
+     xlim = c(0.2, 1)))
> print(qplot(carat, price, data = dsmall, log = "xy"))

qplot-024.jpg

There are a few important differences between plot and qplot:

  • qplot is not generic: you cannot pass any type of R object to qplot and expect to get some kind of default plot. Note, however, that ggplot() is generic, and may provide a starting point for producing visualisations of arbitrary R objects.
  • Usually you will supply a variable to the aesthetic attribute you’re interested in. This is then scaled and displayed with a legend. If you want to set the value, e.g., to make red points, use I(): colour = I(“red”).
  • While you can continue to use the base R aesthetic names (col, pch, cex, etc.), it’s a good idea to switch to the more descriptive ggplot2 aesthetic names (colour, shape and size). They’re much easier to remember!
  • To add further graphic elements to a plot produced in base graphics, you can use points(), lines() and text(). With ggplot2, you need to add additional layers to the existing plot

Takeaways

  • qplot stands for quickplot
  • They look more neater than the base plot
  • loess can be used for local regression plots
  • timeseries plots look publication friendly
  • Way to create lattice plots
  • jitter plots and boxplots have a new look