R cookbook
I am going over R cookbook mainly to review the syntax. It has been almost 2 months since I have written any R program. So, obviously my hands are rusty and hence the reason for going over the cookbook.
I will write down whatever I find new / good reminders of things I have forgotten about R
- Forgot about split function
> library(MASS)
> x <- with(Cars93, split(MPG.city, Origin))
> sapply(x, median)
USA non-USA
20 22
> lapply(x, median)
$USA
[1] 20
$`non-USA`
[1] 22 |
- Difference between sapply and lappy. The former gives a vector as output whereas latter gives list as output
- If the called function returns a structured object , always use lappy
> z <- list(x = runif(100), y = runif(100), z = runif(100))
> lapply(z, t.test)
$x
One Sample t-test
data: X[[1L]]
t = 16.8501, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.4410959 0.5588463
sample estimates:
mean of x
0.4999711
$y
One Sample t-test
data: X[[2L]]
t = 17.0593, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.4360076 0.5507845
sample estimates:
mean of x
0.493396
$z
One Sample t-test
data: X[[3L]]
t = 19.8378, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.4915784 0.6008450
sample estimates:
mean of x
0.5462117
> sapply(z, t.test)
x y z
statistic 16.85007 17.05927 19.83777
parameter 99 99 99
p.value 7.697001e-31 3.092998e-31 2.876633e-36
conf.int Numeric,2 Numeric,2 Numeric,2
estimate 0.4999711 0.493396 0.5462117
null.value 0 0 0
alternative "two.sided" "two.sided" "two.sided"
method "One Sample t-test" "One Sample t-test" "One Sample t-test"
data.name "X[[1L]]" "X[[2L]]" "X[[3L]]"
> batches = data.frame(f = as.factor(sample.int(10, 20, replace = T)),
v1 = runif(20))
> sapply(batches, class)
f v1
"factor" "numeric"
> lapply(batches, class)
$f
[1] "factor"
$v1
[1] "numeric" |
- One very useful way of using sapply is to pass a function and other variable along with a function
> x <- data.frame(matrix(rnorm(1000), 200, 5))
> head(x)
X1 X2 X3 X4 X5
1 0.2462347 -0.2475659 1.0894281 -0.9770150 0.2906779
2 1.7860306 0.2282453 -1.4134570 -0.8327972 -0.6294957
3 -0.1900189 2.0704738 1.2697864 -1.0238222 0.8206145
4 -0.9786056 0.2990741 -1.8738614 -0.8740837 -0.5114038
5 -0.7924509 0.1239347 0.1661355 -0.6518038 -1.2335057
6 -1.4474740 -1.9331664 0.2592431 -0.4734888 1.6203936
> colnames(x) <- letters[1:5]
> sapply(x, cor, y = x[, 5])
a b c d e
0.06175565 -0.03654796 0.02252372 -0.01408585 1.00000000
> lapply(x, cor, y = x[, 5])
$a
[1] 0.06175565
$b
[1] -0.03654796
$c
[1] 0.02252372
$d
[1] -0.01408585
$e
[1] 1 |
In the above code, I am passing a vector for y and it is used in the correlation function that is called on each column element. sapply gives a vector as an output whereas lapply gives output as a list
- tapply is used to apply function to groups of data. It contains vector of data, the groups vector that categorizes the input vector and the function.
> x <- data.frame(matrix(rnorm(1000), 200, 5))
> tapply(x[, 1], sample(letters[1:5], 200, T), summary)
$a
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.4010 -0.6451 -0.1532 -0.1656 0.6008 1.2530
$b
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.69900 -0.50800 -0.08974 -0.05318 0.70380 2.06200
$c
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.7180 -0.8657 -0.1659 -0.1219 0.5416 2.6900
$d
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.9450000 -0.4964000 0.0261200 -0.0003208 0.4923000 2.4350000
$e
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.8950 -0.9386 -0.2520 -0.2421 0.1317 3.1980
> by(x, sample(letters[1:5], 200, T), summary)
sample(letters[1:5], 200, T): a
X1 X2 X3 X4
Min. :-1.894702 Min. :-2.12825 Min. :-2.48184 Min. :-1.81276
1st Qu.:-0.315219 1st Qu.:-0.59759 1st Qu.:-0.51781 1st Qu.:-0.85265
Median :-0.001239 Median :-0.16609 Median :-0.01515 Median :-0.17397
Mean : 0.012802 Mean :-0.06084 Mean :-0.02153 Mean :-0.09181
3rd Qu.: 0.497065 3rd Qu.: 0.55039 3rd Qu.: 0.79936 3rd Qu.: 0.76880
Max. : 1.479541 Max. : 2.15728 Max. : 2.06674 Max. : 2.92034
X5
Min. :-1.66168
1st Qu.:-0.54460
Median : 0.06998
Mean : 0.08078
3rd Qu.: 0.75543
Max. : 1.77108 |
sample(letters[1:5], 200, T): b X1 X2 X3 X4 Min. :-2.6988 Min. :-1.919464 Min. :-2.22402 Min. :-2.7828 1st Qu.:-0.8711 1st Qu.:-0.383063 1st Qu.:-0.67027 1st Qu.:-0.4645 Median :-0.2265 Median :-0.007706 Median :-0.19612 Median : 0.1912 Mean :-0.1334 Mean : 0.005452 Mean :-0.08216 Mean : 0.1898 3rd Qu.: 0.7205 3rd Qu.: 0.553677 3rd Qu.: 0.33841 3rd Qu.: 1.1395 Max. : 2.6904 Max. : 1.851667 Max. : 2.21023 Max. : 2.8057 X5 Min. :-2.41833 1st Qu.:-0.52458 Median : 0.31183 Mean : 0.09115 3rd Qu.: 1.00719 Max. : 1.94639
sample(letters[1:5], 200, T): c
X1 X2 X3 X4
Min. :-2.0565 Min. :-2.3364 Min. :-2.0210 Min. :-2.0947
1st Qu.:-0.7051 1st Qu.:-0.6890 1st Qu.:-0.2665 1st Qu.:-0.3118
Median :-0.1054 Median :-0.2009 Median : 0.1013 Median : 0.1938
Mean :-0.1038 Mean :-0.1345 Mean : 0.1338 Mean : 0.2960
3rd Qu.: 0.6657 3rd Qu.: 0.4101 3rd Qu.: 0.6527 3rd Qu.: 1.2366
Max. : 2.0617 Max. : 1.6190 Max. : 1.4966 Max. : 2.7061
X5
Min. :-2.8578
1st Qu.:-0.8699
Median :-0.2601
Mean :-0.1859
3rd Qu.: 0.5110
Max. : 2.1267 |
sample(letters[1:5], 200, T): d X1 X2 X3 X4 Min. :-2.4011 Min. :-2.3560 Min. :-2.1363 Min. :-2.54557 1st Qu.:-1.0510 1st Qu.:-0.7428 1st Qu.:-0.4530 1st Qu.:-0.79773 Median :-0.3280 Median :-0.2157 Median : 0.2762 Median : 0.06078 Mean :-0.2108 Mean :-0.2219 Mean : 0.1424 Mean :-0.06605 3rd Qu.: 0.4572 3rd Qu.: 0.3877 3rd Qu.: 0.8852 3rd Qu.: 0.86327 Max. : 2.4768 Max. : 1.4466 Max. : 1.6618 Max. : 1.26067 X5 Min. :-2.2495 1st Qu.:-0.5001 Median :-0.1382 Mean :-0.1613 3rd Qu.: 0.3032 Max. : 1.4193
sample(letters[1:5], 200, T): e
X1 X2 X3 X4
Min. :-1.9366 Min. :-2.5626 Min. :-1.5326 Min. :-2.6420
1st Qu.:-0.8047 1st Qu.:-0.9031 1st Qu.:-0.5513 1st Qu.:-0.2864
Median :-0.3132 Median :-0.3650 Median : 0.3799 Median : 0.2160
Mean :-0.2036 Mean :-0.2917 Mean : 0.2980 Mean : 0.1280
3rd Qu.: 0.3988 3rd Qu.: 0.3309 3rd Qu.: 1.0764 3rd Qu.: 0.5996
Max. : 3.1978 Max. : 2.0349 Max. : 2.7986 Max. : 2.0513
X5
Min. :-1.63434
1st Qu.:-0.51284
Median :-0.04027
Mean : 0.05018
3rd Qu.: 0.57312
Max. : 1.97665 |
Clearly there is a difference in the reasons for using tapply and by. In the by function, you can pass a subset of rows according to grouping criteria whereas in tapply, the input is always a single vector and function runs only on that vector.
- mapply is used to apply a function to a parallel vector or lists.I never ever knew this till date. Fantastic learning. Let’s say you have written a function which works for two arguments. You can quickly vectorize the function by passing it to mapply
> test <- function(a, b, c) {
if (b == "Test") {
return(a + c)
}
else {
return(a * c)
}
}
> test.vectorized <- function(a, b, c) {
mapply(test, a, b, c)
}
> test.df <- data.frame(runif(10), sample(c("Test", "NoTest"),
10, T), runif(10))
> test.vectorized(test.df[, 1], test.df[, 2], test.df[, 3])
[1] 1.79459526 0.32283599 0.02181394 1.15925957 0.68642825 1.18831506
[7] 1.09246148 0.22514183 1.25755210 0.02422697 |
- Till date I have neveer thought about vectorizing a simple function. mapply is the best way to vectorize the function.
R cookbook - Data Structures
- You can turn a list in to matrix by merely giving a dim attribute
- stack can be used to combine list in to a 2 column data frame
> x1 = runif(3)
> x2 = runif(3)
> x3 = runif(3)
> stack(list(x1 = x1, x2 = x2, x3 = x3))
values ind
1 0.4405111 x1
2 0.7045491 x1
3 0.8090941 x1
4 0.4549867 x2
5 0.4280318 x2
6 0.6292135 x2
7 0.1691065 x3
8 0.1545277 x3
9 0.6390784 x3 |
Have never used the above function before
- use drop = FALSE for subsetting so that the resultant set is again a data frame
- To initialize a data frame from row data, use do.call(rbind, obs) I came across this in Hadley Wickham’s code and was totally clueless what it meant. Now I understand it
- Using matrix notation to select columns from the data frame is not the best procedure. Use list operators instead
- use subset to select or remove certain columns from the data frame
- If you attach a data frame and make changes to a variable, the changes will not be reflected in the original data frame. Only local copy will be changed