Reproducible Research with R and RStudio : Summary
The book starts by explaining an example project that one can download from the author’s github account. The project files serve as an introduction to reproducible research. I guess it might make sense to download this project, try to follow the instructions and create the relevant files. By compiling the example project, one gets a sense of what one can accomplished by reading through the book.
**Introducing Reproducible Research
**The highlight of an RR document is that data, analysis and results are all in one document. There is no separation between announcing the results and doing number crunching. The author gives a list of benefits that accrue to any researcher generating RR documents. They are
-
better work habits
-
better team work
-
changes are easier
-
high research impact
The author uses knitr
/ rmarkdown
in the book to discuss Reproducibility. The primary difference between the two is that the former demands that document be written using the markup language associated with the desired output. The latter is more straightforward in the sense that one markup can be used to produce a variety of outputs.
**
Getting Started with Reproducible Research
**The thing to keep in mind is that reproducibility is not an after thought - it is something you build into the project from the beginning. Some general aspects of RR are discussed by the author. If you do not believe in the benefits of RR, then you might have to carefully read this chapter to understand the benefits as it gives some RR tips to a newbie. This chapter also gives a road map to the reader as to what he/she can expect from the book. In any research project, there is data gathering stage, data analysis stage and presentation stage. The book contains a set of chapters addressing each stage of the project. More importantly, the book contains ways to tie each of the stages so as to produce a single compendium for your entire project.
**Getting started with R, RStudio and knitr/rmarkdown
**This chapter gives a basic introduction to R
and subsequently dives in to knitr
and rmarkdown
commands. It shows how one can create a .Rnw
or .Rtex
document and convert in to a pdf either through RStudio or the command line. rmarkdown
documents on the other hand are more convenient for reproducing simple projects where there are not many interdependencies between various tasks. Obviously the content in this chapter gives only a general idea. One has to dig through the documentation to make things work. One learning for me from this chapter is the option of creating .Rtex
documents in which the syntax can be less baroque.
**Getting started with File Management
**This chapter gives the basic directory structure that one can follow for organizing the project files. One can use the structure as a guideline for one’s own projects. The example project uses gnu make file for data munging. It also gives a crash course of bash.
**Storing, Collaborating, Accessing Files, and Versioning
The four activities mentioned in the chapter title can be done in many ways. The chapter focuses on Dropbox and Github. It is fairly easy to learn to use the limited functionality one gets from Dropbox. On the other hand, Github demands some learning from a newbie. One needs to get to know the basic terminology of git. The author does a commendable job of highlighting the main aspects of git version control and its tight integration with RStudio.
**
Gathering Data with R
This chapter talks about the way in which one can use GNU make utility to create a systematic way of gathering data. The use of make file makes it easy for other to reproduce the data preparation stage of a project. If you have written a make file in C++ or in some other context, it is pretty easy to follow the basic steps mentioned in the chapter. Else it might involve some learning curve. My guess is once you start writing make files for specific tasks, you will realize their tremendous value in any data analysis project. A nice starting point for learning make file is robjhyndman’s site.
Preparing Data for Analysis
This gives a whirlwind tour of data munging operations and data analysis in R.
**Statistical Modeling and knitr
**The chapter gives a brief description of chunk options that are frequently used in an RR document. Out of all the options, cache.extra
and dependson
are the options that I have never used in the past and is a learning for me. One of the reasons I like knitr
is its ability to cache objects. In the Sweave
era, I had to load separate packages, do all sorts of things to run a time intensive RR document. It was very painful to say the least. Thanks to knitr
it is extremely easy now. Even though cache
option is described at the end, I think it is one of the most useful features of the package. Another good thing is that you can combine various languages in RR document. Currently knitr
supports the following language engines :
-
Awk
-
Bash shell
-
CoffeeScript
-
Gawk
-
Haskell
-
Highlight
-
Python
-
R (default)
-
Ruby
-
SAS
-
Bourne shell
**
Showing results with tables
**In whatever analysis you do using R
, there are always situations where your output is in the form of a data.frame
or matrix
or some sort of list
structure that is formatted to display on the console as a table. One can use kable
to show data.frame
and matrix
structures. It is simple, effective but limited in scope. xtable
package on the other hand is extremely powerful. One can use various statistical model fitted objects and pass it on to xtable
function to obtain a table
and tabular
environment encoded for the results. The chapter also mentions texreg
that is far more powerful than the previous mentioned packages. With texreg
, you can show the output of more than one statistical model as a table in your RR document.There are times when the output classes are not supported by xtable
. In such cases, one has to manually hunt down the relevant table, create a data frame or matrix of the relevant results and then use xtable
function.
**Showing results with figures
**It is often better to know basic LaTeX
syntax for embedding graphics before using knitr
. One problem I have always faced with knitr
embedded graphics is that all the chunk options should be mentioned in one single line. You cannot have two lines for chunk options. Learnt a nice hack from this chapter where some of the environment level code can be used as markup rather than as chunk options .This chapter touches upon the main chunk options relating to graphics and does it well, without overwhelming the reader.
**Presentation with knitr/LaTeX
**The author says that much of the LaTeX in the book has been written using Sublime Text editor. I think this is the case with most of the people who intend to create an RR. Even though RStudio has a good environment to create a LaTeX file, I usually go back to my old editor to write LaTeX markup. How to cite bibliography in your document? and How to cite R packages in your document? are questions that every researcher has to think about in producing RR documents. The author does a good job of highlighting the main aspects of this thought process. The chapter ends with a brief discussion on Beamer. It gives a 10,000 ft. of beamer. I stumbled on to a nice link in this chapter that gives the reason for using fragile
in beamer.
**Large knitr/LaTeX Documents: Theses, Books, and Batch Reports
**This chapter is extremely useful for creating long RR documents. In fact if your RR document is not large, it makes sense to logically subdivide in to separate child documents. For knitr
, there are chunk options to specify parent
and child
relationships. These options are useful in knitting child documents independently of the other documents embedded in the parent document. You do not have to specify the preamble code again in each of the child documents as it inherits the code from the parent document. The author’s also shows a way to use Pandoc
to change rmarkdown document to tex
, which can then be included in the RR document.
The penultimate chapter is on rmarkdown
. The concluding chapter of the book discusses some general issues of reproducible research.
Takeaway:
This book gives a nice overview of the main tools that one can use in creating a RR document. Even though the title of the book has the term “RStudio” in it, the tools and the hacks mentioned are IDE agnostic. One can read a book length treatment for each of the tools mentioned in the book and might easily get lost in the details. Books such as these give a nice overview of all the tools and hence motivate the reader to dive into specifics as and when there is requirement.