qplot - ggplot2 - RK's Musings

Purpose
To get equipped with the multitude of tools available for multivariate analysis Currently the tools which I use are present in base package and hence they are mostly pen and paper mode graphics. Meaning you can overlay one over the other and you end up making a graphic prettier.

However ggplot2 package gives a richer variety of tools. One thing I wanted to check is to give a check of my current skillset level as far as graphics are concerned. I took the graphics data that is present in ggplot2 and did all the possible charts that I can draw for understanding the data.

I had this feeling that my current graphic skillsets suck big time.

> library(ggplot2)
> data(diamonds)
> head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
> dsmall <- diamonds[sample(nrow(diamonds), 100), ]

In the initial stages, I had to understand various attributes of the dataset. What is x, y, z , table and depth ? Referred to some site on the internet to get a hang of x , y, z ?

Somehow I wanted to learn everything about the price dependencies from the dataset with out having any domain knowledge of diamonds.

I began crunching some basic numbers Basic Summary Stats

> summary(diamonds)
     carat              cut        color        clarity          depth
 Min.   :0.200   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00
 1st Qu.:0.400   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00
 Median :0.700   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80
 Mean   :0.798   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75
 3rd Qu.:1.040   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50
 Max.   :5.010                     I: 5422   VVS1   : 3655   Max.   :79.00
                                   J: 2808   (Other): 2531
     table           price             x                y
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900

       z
 Min.   : 0.000
 1st Qu.: 2.910
 Median : 3.530
 Mean   : 3.539
 3rd Qu.: 4.040
 Max.   :31.800

> par(mfrow = c(2, 3))
> plot(diamonds$x, diamonds$price, main = "x")
> plot(diamonds$y, diamonds$price, main = "y")
> plot(diamonds$z, diamonds$price, main = "z")
> plot(diamonds$depth, diamonds$price, main = "depth")
> plot(diamonds$table, diamonds$price, main = "table")
> plot(diamonds$carat, diamonds$price, main = "carat")

Then a boxplot of categorical variables

> par(mfrow = c(1, 1))
> boxplot(price ~ clarity, diamonds, main = "clarity")
> boxplot(price ~ cut, diamonds, main = "cut")

Subsequently a pairs plot of all the continous variables

> z <- diamonds[, c("x", "y", "z", "depth", "table", "carat", "price")]
> pairs(z)

What else am I capable of doing ?

> summary(lm(price ~ cut, diamonds))
Call:
lm(formula = price ~ cut, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max
 -4258  -2741  -1494   1360  15348

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   4358.76      98.79  44.122  < 2e-16 ***
cutGood       -429.89     113.85  -3.776 0.000160 ***
cutVery Good  -377.00     105.16  -3.585 0.000338 ***
cutPremium     225.50     104.40   2.160 0.030772 *
cutIdeal      -901.22     102.41  -8.800  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared: 0.01286,    Adjusted R-squared: 0.01279
F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
> summary(lm(price ~ clarity, diamonds))
Call:
lm(formula = price ~ clarity, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max
 -4737  -2727  -1429   1262  16254

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  3924.1687   144.5619  27.145  < 2e-16 ***
claritySI2   1138.8599   150.2746   7.579 3.55e-14 ***
claritySI1     71.8325   148.6049   0.483    0.629
clarityVS2      0.8207   148.8672   0.006    0.996
clarityVS1    -84.7133   150.9746  -0.561    0.575
clarityVVS2  -640.4316   154.7737  -4.138 3.51e-05 ***
clarityVVS1 -1401.0541   158.5401  -8.837  < 2e-16 ***
clarityIF   -1059.3296   171.8990  -6.163 7.21e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3935 on 53932 degrees of freedom
Multiple R-squared: 0.02715,    Adjusted R-squared: 0.02702
F-statistic:   215 on 7 and 53932 DF,  p-value: < 2.2e-16
> summary(lm(price ~ x, diamonds))
Call:
lm(formula = price ~ x, data = diamonds)

Residuals:
    Min      1Q  Median      3Q     Max
-8426.1 -1263.7  -185.0   973.1 32128.1

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -14094.056     41.732  -337.7   <2e-16 ***
x             3145.413      7.146   440.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1862 on 53938 degrees of freedom
Multiple R-squared: 0.7822,     Adjusted R-squared: 0.7822
F-statistic: 1.937e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
> summary(lm(price ~ y, diamonds))
Call:
lm(formula = price ~ y, data = diamonds)

Residuals:
      Min        1Q    Median        3Q       Max
-152436.0   -1229.3    -241.3     837.7   31436.0

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -13402.027     44.062  -304.2   <2e-16 ***
y             3022.887      7.536   401.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1999 on 53938 degrees of freedom
Multiple R-squared: 0.749,      Adjusted R-squared: 0.7489
F-statistic: 1.609e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
> summary(lm(price ~ z, diamonds))
Call:
lm(formula = price ~ z, data = diamonds)

Residuals:
      Min        1Q    Median        3Q       Max
-139561.1   -1234.9    -239.8     824.5   32084.6

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -13296.57      44.64  -297.9   <2e-16 ***
z             4868.79      12.37   393.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2027 on 53938 degrees of freedom
Multiple R-squared: 0.7418,     Adjusted R-squared: 0.7417
F-statistic: 1.549e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
> summary(lm(price ~ depth, diamonds))
Call:
lm(formula = price ~ depth, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max
 -3766  -2986  -1521   1396  14937

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  5763.67     740.56   7.783 7.21e-15 ***
depth         -29.65      11.99  -2.473   0.0134 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3989 on 53938 degrees of freedom
Multiple R-squared: 0.0001134,  Adjusted R-squared: 9.483e-05
F-statistic: 6.115 on 1 and 53938 DF,  p-value: 0.01340
> summary(lm(price ~ carat, diamonds))
Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
      Min        1Q    Median        3Q       Max
-18585.33   -804.81    -18.92    537.43  12731.68

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -2256.36      13.06  -172.8   <2e-16 ***
carat        7756.43      14.07   551.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared: 0.8493,     Adjusted R-squared: 0.8493
F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Is that all I can do ? Don’t I know any other thing that I can perform on this dataset ? let me think for 2 min .. I can do a multiple reg

> model <- lm(price ~ x + y + z + table + depth + cut + clarity +
+     carat, diamonds)
> summary(model)
Call:
lm(formula = price ~ x + y + z + table + depth + cut + clarity +
    carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max
-20741.8   -561.1   -112.0    403.9  11469.0

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)   2503.359    455.793   5.492 3.98e-08 ***
x             -869.652     36.695 -23.699  < 2e-16 ***
y               14.840     21.597   0.687    0.492
z              -41.610     37.405  -1.112    0.266
table          -28.878      3.252  -8.881  < 2e-16 ***
depth          -78.535      5.061 -15.517  < 2e-16 ***
cutGood        564.548     37.520  15.047  < 2e-16 ***
cutVery Good   705.968     36.009  19.605  < 2e-16 ***
cutPremium     725.949     35.994  20.169  < 2e-16 ***
cutIdeal       807.366     37.310  21.639  < 2e-16 ***
claritySI2    2721.859     48.934  55.623  < 2e-16 ***
claritySI1    3569.061     48.715  73.263  < 2e-16 ***
clarityVS2    4164.446     48.967  85.047  < 2e-16 ***
clarityVS1    4379.709     49.708  88.109  < 2e-16 ***
clarityVVS2   4863.608     51.197  94.998  < 2e-16 ***
clarityVVS1   4831.625     52.630  91.803  < 2e-16 ***
clarityIF     5128.992     56.869  90.189  < 2e-16 ***
carat        10494.676     53.769 195.182  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1262 on 53922 degrees of freedom
Multiple R-squared: 0.8999,     Adjusted R-squared: 0.8999
F-statistic: 2.851e+04 on 17 and 53922 DF,  p-value: < 2.2e-16

Everything looks significant except x and y Let me look at the residuals of the model

> par(mfrow = c(1, 1))
> qqnorm(resid(model))
> qqline(resid(model))

Its pathetic….Residuals are nowhere close to normal. The multivariate model that I have fit is a nice piece of bullshit and nothing more thatn that!

Ok, let me think for another 2 minutes …I think it is time that I fall back on Hadley Wickham and see what I can learn from the ggplot2 package

The first thing that I did not transform the data inspite of seeing a non linear relationship between price vs carat.

> par(mfrow = c(1, 1))
> plot(log(diamonds$carat), log(diamonds$price))

It does look linear Let me contrast it with qplot

> print(qplot(log(diamonds$carat), log(diamonds$price)))

An additional insight makes a lot of sense carat vs density

> print(qplot(carat, x * y * z, data = dsmall))

The other cutesy thing is to assign color or shape to categorical data

> print(qplot(carat, price, data = dsmall, colour = color))
> print(qplot(carat, price, data = dsmall, shape = cut))

Got introduced to another feature called alpha

> print(qplot(carat, price, data = diamonds, alpha = I(1/10)))
> print(qplot(carat, price, data = diamonds, alpha = I(1/100)))
> print(qplot(carat, price, data = diamonds, alpha = I(1/10)))

Aesthetics is also an art. For categorical variables, color and shape works well. Whereas shape works well for continous data.

If there is a huge dataset, obviously a loess type fits well. I have no clue about local regression functions. This is onething that I need to figure out as I go along. I have already downloaded a few books relating to it. One by Generalized Additive Models, -→ Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. -→ Hardel, W. (1990) Applied Nonparametric Regression. -→ Loader, C. (1999) Local Regression and Likelihood. -→ Chambers, J.M. and Hasite, T.J. (1993). Statistical Models in S. -→ Venables, W.N. and Ripley, B.D. (1994) Modern Applied Statistics with S-Plus.

God knows when I am going to read these books. Ok, there are 2 books out of the above list that are in my list that I have put on the wall. May be someday. The coming three days are holidays and I would ideally love to spend my entire time doing statistics. However I need to go for a workshop , a graphic workshop …

5.Anyways, coming back to loess provision in ggplot2 There is a way to specifying a set of geom. You can do this with a set of methods in a vector form c(…) etc Default method in ggplot2 is loess.

> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     span = 0.1))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     span = 0.2))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     span = 1))

However this does not work if there are larger number of points. You can use mgcv library

> library(mgcv)
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "gam", formula = y ~ s(x)))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "gam", formula = y ~ s(x, bs = "cs")))

One can also use lm to fit a linear model. One can load the splines package and use a natural spline

> library(splines)
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "lm"))
> print(qplot(carat, price, data = dsmall, geom = c("point", "smooth"),
+     method = "lm", formala = y ~ ns(x, 5)))

Now coming to the categorical variables, you can use jitter plot or boxplot

> print(qplot(color, price/carat, data = diamonds, geom = "boxplot",
+     color = "blue"))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter"))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+     alpha = I(1/5)))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+     alpha = I(1/50)))
> print(qplot(color, price/carat, data = diamonds, geom = "jitter",
+     alpha = I(1/200)))

For continous variables, histogram and density plots are very important

> print(qplot(carat, data = diamonds, geom = "histogram"))
> print(qplot(carat, data = diamonds, geom = "density"))

Let me change the binwidth and see how the histogram looks like

> print(qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1))
> print(qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01))

Another cool feature of the density and histogram is that one can create a density and histogram for separate categories in the data

> print(qplot(carat, data = diamonds, geom = "density", colour = color))
> print(qplot(carat, data = diamonds, geom = "histogram", fill = color))

Obviously the next thing is the bar chart

> print(qplot(color, data = diamonds, geom = "bar"))
> print(qplot(color, data = diamonds, geom = "bar", weight = carat) +
+     scale_y_continuous("carat"))

Timeseries plots

> print(qplot(date, unemploy/pop, data = economics, geom = "line"))
> print(qplot(date, uempmed, data = economics, geom = "line"))
> year <- function(x) as.POSIXlt(x)$year + 1900
> print(qplot(unemploy/pop, uempmed, data = economics, geom = c("point",
+     "path")))
> print(qplot(unemploy/pop, uempmed, data = economics, geom = "path",
+     colour = year(date)) + scale_area())

Using conditional plots in multivariate data is extremely useful

> print(qplot(carat, data = diamonds, facets = color ~ ., geom = "histogram",
+     binwidth = 0.1, xlim = c(0, 3)))
> print(qplot(carat, ..density.., data = diamonds, facets = color ~
+     ., geom = "histogram", binwidth = 0.1, xlim = c(0, 3)))

All other options that are generally available for plot in R are available for qplot

> print(qplot(carat, price, data = dsmall, xlab = "Price ($)",
+     ylab = "Weight (carats)", main = "Price-weight relationship"))
> print(qplot(carat, price/carat, data = dsmall, ylab = expression(frac(price,
+     carat)), xlab = "Weight (carats)", main = "Small diamonds",
+     xlim = c(0.2, 1)))
> print(qplot(carat, price, data = dsmall, log = "xy"))

There are a few important differences between plot and qplot:

qplot is not generic: you cannot pass any type of R object to qplot and expect to get some kind of default plot. Note, however, that ggplot() is generic, and may provide a starting point for producing visualisations of arbitrary R objects.
Usually you will supply a variable to the aesthetic attribute you’re interested in. This is then scaled and displayed with a legend. If you want to set the value, e.g., to make red points, use I(): colour = I(“red”).
While you can continue to use the base R aesthetic names (col, pch, cex, etc.), it’s a good idea to switch to the more descriptive ggplot2 aesthetic names (colour, shape and size). They’re much easier to remember!
To add further graphic elements to a plot produced in base graphics, you can use points(), lines() and text(). With ggplot2, you need to add additional layers to the existing plot

Takeaways

qplot stands for quickplot
They look more neater than the base plot
loess can be used for local regression plots
timeseries plots look publication friendly
Way to create lattice plots
jitter plots and boxplots have a new look