Boxplot Characteristics
Purpose
I thought I knew everything about boxplot and was even trying to skip the first chapter on boxplots. How naive of me ? I had recently heard stanford prof speaking about mindsets.
If there are 8 data points let’s say 1,2,3,…8 What is the median ?
> x <- 1:10 > print((x[5] + x[6])/2) [1] 5.5 > print(median(x)) [1] 5.5 |
Whats the first quartile and third quartile?
> boxplot(x) > y <- (boxplot(x)) > print(y) $stats [,1] [1,] 1.0 [2,] 3.0 [3,] 5.5 [4,] 8.0 [5,] 10.0 attr(,"class") 1 "integer" $n [1] 10 $conf [,1] [1,] 3.001801 [2,] 7.998199 $out numeric(0) $group numeric(0) $names [1] "1" |
Well, at 33 years of age, I have learnt a lesson that , knowledge about anything is not fixed. It is growing
I was thinking that first quartile is at 3 and third quartile is at 8 But R results are little different. conf attribute shows that it is Why ? I don’t know the answer as yet..
> boxplot.default function (x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL) { args <- list(x, ...) namedargs <- if (!is.null(attributes(args)$names)) attributes(args)$names != "" else rep(FALSE, length.out = length(args)) groups <- if (is.list(x)) x else args[!namedargs] if (0 == (n <- length(groups))) stop("invalid first argument") if (length(class(groups))) groups <- unclass(groups) if (!missing(names)) attr(groups, "names") <- names else { if (is.null(attr(groups, "names"))) attr(groups, "names") <- 1:n names <- attr(groups, "names") } cls <- sapply(groups, function(x) class(x)[1]) cl <- if (all(cls == cls[1])) cls[1] else NULL for (i in 1:n) groups[i] <- list(boxplot.stats(unclass(groups[[i]]), range)) stats <- matrix(0, nrow = 5, ncol = n) conf <- matrix(0, nrow = 2, ncol = n) ng <- out <- group <- numeric(0) ct <- 1 for (i in groups) { stats[, ct] <- i$stats conf[, ct] <- i$conf ng <- c(ng, i$n) if ((lo <- length(i$out))) { out <- c(out, i$out) group <- c(group, rep.int(ct, lo)) } ct <- ct + 1 } if (length(cl) && cl != "numeric") oldClass(stats) <- cl z <- list(stats = stats, n = ng, conf = conf, out = out, group = group, names = names) if (plot) { if (is.null(pars$boxfill) && is.null(args$boxfill)) pars$boxfill <- col do.call("bxp", c(list(z, notch = notch, width = width, varwidth = varwidth, log = log, border = border, pars = pars, outline = outline, horizontal = horizontal, add = add, at = at), args[namedargs])) invisible(z) } else z } <environment: namespace:graphics> |
Ok, the Five number summary is as follows median, lower quartile, upper quartile, extremes
> median(x) [1] 5.5 > y$conf [,1] [1,] 3.001801 [2,] 7.998199 > y$conf + c(-1.5, 1.5) * diff(y$conf) [,1] [1,] -4.492797 [2,] 15.492797 |
Ok, to end with here are the basic properties of a boxplot
- Median and Mean bars are measures of location
- Relative location of the median and the mean in the box is a measure of skewness
- Length of the box and whiskers are a measure of spread
- Length of the whiskers indicate the tail length of the distribution
- Outlying points are indicated with * / o
- The boxplots do not indicate multi modality or clusters
- If we compare the relative size and location of the boxes, we are comparing distributions
So, Obviously Histograms are better for understanding multimodal distributions