NumPy 1.5 – Summary
Programming in Matlab / R exposes one to vectorized way of thinking. One doesn’t usually write loops often and one tends to think in terms of vectors,array, matrices etc. R for instance is designed to facilitate vectorized input and output. Almost all the functions in base R support vectorization. Most of the functions in the packages on CRAN are equipped to take vectorized input. In fact the program design itself makes vectorizing easy. Recycling rule in R for instance , sometimes makes your function automatically handle vectorized input, even though you never meant to handle such an input. For a R newbie , the fact that his/her function does much more than expected is a happy side effect. However once you have logged in decent number of hours in R, you realize that it is your responsibility that whatever code you write, it handles vectorized input . Hence these days, one of the first unit tests that I write is to check whether my code breaks down for vectorized input.
If you stick to only Python Standard Library, you will not get to see the power of thinking in vectors. At least that’s my impression. If you have programmed in R like me and then exploring Python, you will in fact be eagerly , thirstily, looking for stuff that facilitate vectorization. Thanks to Travis Oliphant, we now have Numerical Python(NumPy) library that gives a lot of R equivalent functions in Python.
Historically , Numarray and Numeric were Python libraries for matrix computations. Numeric was first released in 1995 . In 2001, a number of people inspired by Numeric created SciPy - an open source Python scientific computing library that provides functionality similar to Matlab, Maple and Mathematica. Around this time, people were growing increasingly unhappy with Numeric. Numarray was created as an alternative for Numeric that was better in some areas. Soon, there were a lot of developments around Numarray and SciPy that depended on Numeric could not take advantage of these developments.
In 2005, Travis Oliphant, a professor and an early contributor to SciPy decided to something about the situation. He tried to integrate some of the Numarray features in to Numeric. A complete rewrite took place that culminated in the release of NumPy 1.0 in 2006. Originally NumPy was a part of SciPy but today it exists as an independent library. This book is meant to teach some basic skills to work with NumPy and it does so with a good balance of theory and practical examples.
I feel NumPy does not have a steep learning curve if you are already exposed to Matlab or R. The reason is obvious. Once you are familiar with vectorization, you can easily spot the functions and understand them. I have thoroughly enjoyed reading this book as it has equipped me with skills to translate R code in to Python and at the same time get all the powerful features of NumPy. As such NumPy is a package that is usually a part of the basic tool kit for any researcher using Python. The fact that it has been around for quite sometime means that it has matured as a library. Its first release was NumPy 1.0 in 2006. As of today, you can get NumPy 1.6, i.e you have the advantage of using a very stable library that has gone through a life of almost 6 long years. NumPy is used by scientists, statisticians, researchers , academicians, quants who typically need to deal with huge data sets and are looking for quicker computations. NumPy’s USP is that it is the closest that you can get to run your code using C with out actually writing C code.
The author lists some key points of NumPy that make it appealing in the context of Big Data :
-
Much cleaner than straight Python code
-
Fewer loops required because operations work on arrays
-
Underlying algos have stood the test of time
-
NumPy’s arrays are stored more efficiently than an equivalent data structure in base Python
-
Array I/O is significantly faster
-
Performance improvement scales with the number of elements. Really pays off to use NumPy for large datasets.
-
Large portions of NumPy are written in C. That makes NumPy faster than pure Python.
The book is organized in to 10 chapters in such a manner that it takes a first time NumPy reader systematically over all the major features of NumPy.
Chapter 1 : NumPy Quick Start
The first chapter starts off with a set of screen shots relevant to installing NumPy on Window , Mac and other platforms. The highlight of the chapter is a simple example of summing two vectors and comparing the speed of NumPy code with that of plain vanilla Python code. I ran a single run of “generate two separate million dimensional vector , do some operations, add them” and found that NumPy was 11 times faster than Python code and was 1.5 times faster than using list comprehensions. Instead of relying on this single run, I ran a similar code 1000 times to get some summary statistics of the speed. The code compared the speed of NumPy and List comprehensions with that of plain vanilla Python loop. I found that the speed of NumPy with Python was on an average 14 times faster than plain vanilla Python code and speed of List comprehensions was on an average 2 times faster than plain vanilla Python code. So all in all, NumPy is a clear winner over both simple python loops and list comprehensions.
This chapter also introduced me to iPython. There was some learning curve for me. But once I figured out working with the magic commands, I realized that iPython was fantastic for interactive analysis. There was some inertia in going over iPython but soon I realized that magic commands were indeed magical. I found a few videos on iPython that were helpful in understanding iPython. After a brief encounter with iPython, I stumbled on to a chapter in a book titled,`Python for Unix and Linux System Administration` that goes at length in describing various iPython commands. Now when I look at R programming, I realize that R has a gui but probably needs something similar to iPython. May be it will be just a matter of time that some open source enthusiast will develop iPython equivalent in R. Some of the features of iPython that I find very useful are
- Searching history and executing them
- Repeating specific lines from history and executing them again
- Tabbing feature
- psearch
- bookmarking paths
- Ability to call external scripts.
Obviously there are many more magic commands that are available in iPython that I haven’t explored. I think it is hard to leave iPython once you start using it.
Chapter 2 : Beginning with NumPy fundamentals
At the heart of NumPy is the the object ndarray, that comprises two parts, actual data and meta data. ndarray contains homogenous data described by the metadata object, dtype. The first things that one needs to learn is to create an ndarray for various numerical types. Firstly the data in the ndarray can be obtained from an existing Python object like list or using the arange function or from an existing array using I/O or using linspace. The dtype object is used to manipulate the default numerical type of the ndarray data. This chapter starts off with a list of constructors for populating data and meta data. It then talks about ways to slice and dice the array. All R equivalent functions of cbind and rbind are mentioned. There is an important point that is not highlighted well enough in this chapter, i.e the slicing and dicing the ndarray does not give you a new ndarray but gives a pointer to the specific memory location. This means that if you extract a subset of data and update it, the original ndarray gets updated too. To avoid this inconvenience , a slicing function can be put through a copy function that creates a new ndarray from the existing ndarray. Very much like the unlist function in R, there are two functions , i.e flatten and ravel that can applied to an ndarray to retrieve the data elements in the form of a 1 D array. The difference between the two is that flatten allocates new memory whereas ravel doesn’t. So, based on whatever your requirement is, you might have to use flatten or ravel. By the end of this chapter, a reader gets a good idea of creating an ndarray, stacking/resizing/ reshaping/splitting ndarrays.
Chapter 3 : Get in to Terms with Commonly Used Functions
The first thing that any data analyst needs to do is to get the data in to the working environment. Much like the read.csv and read.delim functions in R ,you have loadtxt and genfromtxt function in Python. The syntax is more or less similar. genfromtext has arguments where you can define your own converters while extracting data. My guess is genfromtxt is far more useful than loadtxt as real world data is always messy and needs to be treated in someway or the other before getting it to a analytic environment. A sample list of ndarray functions such as sum , cumsum, cumprod, mean, var, std, argmax, argmin are mentioned .The author illustrates these functions by using some finance related examples like calculating stock returns, plotting SMA, Bollinger bands, Trend lines etc. By the end of this chapter, a reader gets a good idea of the doing I/O operations in NumPy and using various inbuilt functions for a ndarray object.
Chapter 4 : Convenience Functions for your convenience
This chapter gives a taste of basic statistical functions in NumPy such as covariance, correlation, polynomial fitting and smoothing functions. The best thing that I came across in this chapter in the vectorize function. Write whatever function you want to write, pass it to vectorize function in NumPy and your function is all set to take vectorized input. Coming from an R environment, where it is the coder’s responsibility to ensure vectorized input handling, this function is like a boon. You write a normal function and use numpy.vectorize and your function is ready. Wow! It’s so neat.
Chapter 5 : Working with Matrices and ufuncs
This chapter covers matrices, a subclass of ndarray object. All basic operations of matrices are covered like transpose, inverse, solving linear equations etc. The highlight of this chapter is ufuncs, an abbreviation for universal functions. Universal functions are not functions but objects representing functions such as add. Now here is where things get interesting. You have methods for these functions namely reduce, accumulate, reduceat and outer. accumulate is the NumPy equivalent of rowSums, ColSums in R. All these four methods of functions are very useful in data munging.
Chapter 6 : Move Further with NumPy modules
This is a chapter in book that made me feel that I was reading some code in R. Except for a few cosmetic changes in function signature , all the object and classes signatures of NumPy’s linalg module and random module look similar to R code. This chapter starts off with describing linalg package functionality.All decompositions are supported in linalg package, i.e cholesky, svd, pseudo inverse, eigen value decomposition etc. Subsequently the chapter talks about various statistical distributions and random number generators relating to those distributions. Overall a very easy read for a reader exposed to R.
Chapter 7 talks about special functions like Sorting, searching, financial utilities etc. For testing ndarrays, NumPy has a testing package. Chapter 8 describes the main functions in the testing package that can be used to test equivalence of two ndarrays/ matrices. I have skipped chapter 9 that gives a 10000 ft view of matplotlib, the graphics package in Python. I thought I would go over carefully and learn it properly rather than getting some cookbook kind of learning. The last chapter of this book talks about SciPy where there are many more modules for a researcher looking beyond NumPy. The last chapter has made me curious about Scikits.statsmodels. I am certain that the models covered in Scikits.statsmodels would be far less than the ocean of models that are available in R. But I am curious to know what sort of models are available in Scikits.statsmodels. Will go over it someday.
This book gives a good working knowledge of Numerical Python library. By explaining the nuts and bolts of the major functions of the 400 odd functions in the NumPy package, it gives the reader good enough ammunition to crunch data.