The Art of R Programming : Summary

book_cover This book is written by Normal Matloff , a professor who has worked both in the Computer science department and Statistics department at UCLA. Hence this book is markedly different from the books that are available on R. You get a nice blend of views about R. Also the author clearly states in his preface that this book is essentially a book for those “who want to develop software in R”.

If you want to write some adhoc code for doing some adhoc analysis, this book is definitely a stretch. However if you consider using R for doing your day to day work as well as doing research, then this book is an awesome reference. I have been programming in R for sometime and it is likely that I can remember ONLY specific points / libraries / packages that come up in my work. It is difficult to keep a lot of stuff in working memory. At the same time when you read such books , they help you recall some of the obvious things that you would have internalized but wouldn’t have cared to pause and think about them.

In this post, I will list down all such points(some of them are very basic things) that I came across in this book, that I had internalized but never gave a second thought.

a single function can take in variety of input classes(R is polymorphic).
instances of S3 classes are lists with an attribute class.
if class is a just a list with additional attribute, why do we need them ? Well, because genericFunctions in R take classes as input and invoke functions specific to that class.
to do a google search in all your packages installed use help.search(“multivariate”).
use sos package to get help. Obviously missing from the list of help resources mentioned in this book is www.stackoverflow.com
there is no scalar in R.
when we assign something to x, x is basically a pointer to a data structure.
try to use all() any() functions whenever possible.
sapply gives out a matrix when the entry is a vector.
NULL is a special R object with no mode. It is better to assign a variable NULL and then use it in a loop to grow the variable /vector.
sign function is a damn useful function when used appropriately.
: operator produces integers whereas c() produces double( remember this to do avoid so painful bugs in condition checking)
names(x) <- NULL removes the name of the elements in a vector.
sapply converts an input vector to an output matrix.
we can ask the function to skip over NA values.
understand the importance of NULL. If you assign a variable as NULL, then you can dynamically grow the vector in a loop. NULL is a special object with no mode.
R uses column-major order.
learnt about pixmap package that gives the grayscale of the image in a matrix format.
apply will generally not speed up the code. It makes for a compact code, that is easier to read and modify
when you subset a matrix, you get a vector, the original properties of matrix go missing. Hence use drop = FALSE , then the matrix nature is preserved in subsetting.
component names in a list are called tags.
names of the list can be abbreviated to whatever extent is possible with out causing ambiguity.
list with single bracket means you are working on another list. list with double brackets means you are working with the elements of the list.
you can remove a component from the list by setting it to NULL.
use length function to get the size of the list.
R chooses least common denominator for unlisting operations.
There is an unname() function that can be used to remove names from a vector. I had never had a chance to used this function till date.
lapply() gives you back a list whereas sapply() gives a vector or a matrix.
lists are heterogeneous counterparts to vector, the same way data frames are heterogeneous counterparts to matrix.
there are usually three ways to access a data frame df$a , df[,1] and df[[1]].
use drop = FALSE so that extracted elements have data.frame attribute.
never had a chance to use complete.cases() function which basically removes all rows whichever has NA.
learnt how to create dataframes dynamically in a loop using the assign function.
remember lapply on a dataframe will sort each of the individual columns. That is disastrous as all the data gets mixed up.
factors can be thought of vectors with a bit more information added like levels.
output of split is a list.
use table argument to get cross tabs.
cut() function is used to generate factors from data.
one can use ls() in different ways to print objects from various environments.
references to local variable actually go to the same memory location as the global one, until the value of the local changes. In that case, a new memory location is used.
get() is one of the most useful functions in R
you can access the environment using the function parent.frame()
use «- to assign values to global variables.
even though global assignment of variables is scorned by many , R uses a ton of global variable assignments in its internal code.
sweep function is used to add a specific vector to all rows or columns.
any subsetting function in vectors or matrices is nothing but a replacement function. Whenever you write classes of your own, it is always better to write replacement functions for the same. Basically it’s the same thing as operator functions for a class in C++.
use anonymous functions wherever possible.
closure consists of a function’s arguments and body together with its environment at the time of the call.
use assign function to manipulate non locals.
R promotes encapsulation, allows inheritance , classes are polymorphic
R has two types of classes S3 and S4. I have a good pneumonic to remember how S3 works. S3 is like a general manager in a company. Basically he does nothing . He merely delegates work. so a print
function in R is like a general manager which uses dispatch function to call the relevant function in the input object and make it to do the work. These are called genericFunctions that have nothing internally
but a dispatch feature.
always remember that , by having an access to the object name, one might get an error by invoking objectname.print, objectname.summary as these objects might be different name spaces. So, you must prefix the object name by namespace and then invoke the generic function.
getAnywhere is a function to get all the namespaces and objects for which a specific function is de need
S3 classes are written by specifying a list and then assigning a class attribute to that list. That’s about it. So , in one way this simplicity has a flip side. Lot of errors might creep in. This is one of the reasons for the existence of S4 classes which have a richer structure
If you want to specify inheritance, a simple vector of names can be assigned to class attribute.
S4 classes are considerably richer than S3 classes
S4 classes are defined using setClass function and functions are defined using setMethod
John Chambers advocated S4 classes where google R style guide advocates S3 classes.
If you are writing a general purpose code, then exists might be a very useful function.
Try to use cat instead of print as the latter can print only one expression and its output is numbered, which may be a nuisance.
came to know about snow package and Rdsm package that can be used for parallel computation in R.
There are ton of functions for gleaning directories, files etc. Infact I revamped all my chaotic iTunes collection using file and directory functions in R
grep(), grepl(), sub(), gsub(),nchar(), paste() , sprintf(), substr(), strsplit(), regexpr(), grepexpr() are basic functions that are needed for text analysis.
starting from R 2:10, there is a new function , debugonce() that can be used.Its very useful as you don’t
need to type debug and undebug always.
you can put conditional breakpoints also browser(s>1).
starting from R 2:10,you can use setBreakpoint( linename,linenumber) to directly place breakpoints
trace(gy,browser) is another way of placing a browser() at the start of the function with out modifying
the function.
option(errors=recover) is very useful to trace back the error for the first time.
need to check out debug package by Mark Bravington.
StatET new version has debugging facility. All my R code is written on an older form of StatET version. Looking at the number of people hitting the support list of StatET with questions, I have an inertia to move to the new StatET version that promises debugging.
If you are writing very important code /algo, it is better to keep options(warn=2) so that it instructs R to convert them to errors and makes the location of warnings easier to fin

For an experience R user, the last three chapters of the book cover very important topics like

How to enhance performance of the code ?
How to make code efficient ?
How to make R talk to C++, Python and other languages ?
How can one perform parallel computations in R ?

Takeaway:

This book is a valuable addition to the R literature and has something new to offer to any R programmer, be it a beginner or a seasoned programmer.