What to Learn in Python
Python is the greatest thing to happen to computer science since the Turing Machine! Well, no, but it has inspired me into a personal renaissance for software writing. Its flexibility, widespread community support, and leveraging of legacy C and Fortran code also make it an outstanding language for social science researchers.
If you are a new researcher looking to get started, or experienced and willing to walk away from your [:,:] lifestyle in Matlab—and licensing and training fees—then equip yourself with these 10 packages and get to it!
1. NumPy
NumPy, short for Numeric Python, is the cornerstone of Python’s mathematics and statistics operations. All scientific computing in Python starts and ends with NumPy!
Download NumPy
2. SciPy
SciPy, short for Scientific Python, is the little brother of NumPy, as it relies on NumPy data types for its operations. To distinguish itself, SciPy adds several of its own sophisticated data types, and integration and optimization techniques. Many of the packages proceeding this rely on some combination of NumPy and SciPy.
Download SciPy
3. Matplotlib
The third tine on Python’s scientific trident, Matplotlib (pylab) is the standard for 2D plotting. Highly extensible, and will display your results just the way you like ‘em.
Download Matplotlib
4. NetworkX
This package is what motivated me to learn Python. This is the best tool for analyzing network data–period. For novice social network analysts/graph theorist, the learning curve will be steep, but taking the time to learn NX will preclude you from having to waste your time with other inferior tools. Oh, and for those of you with accreditation concerns, its subversion is maintained by Los Alamos National Laboratory.
Download NetworkX
5. PyMC
This one is for all of you Bayesian/MCMC modelers out there. PyMC implements the Metropolis-Hastings algorithm as a Python class, providing flexibility when building your model. PyMC is also highly extensible, and well supported by the community.
Download PyMC
6. SimPy
Short for “Simulation in Python”, SimPy is an object-oriented, process-based discrete-event simulation language, making it a wholesale agent-based modeling environment written entirely in Python. While not as robust as REPAST or NetLogo, SimPy provides an excellent tool set for designing experiments, and because it is pure Python, the data can be fed to other analytical packages.
Download SimPy
7. SymPy
Not to be confused with the previous entry, SymPy is an full-featured Python library for symbolic mathematics. Oliver suggested I add Sage to the list, which is an excellent tool, but SymPy contains nearly all of the same functionality (algebraic evaluation, differentiation, expansion, complex numbers, etc.), but is contained in a pure Python distribution. This package is great for researchers who want symbolic mathematics support, but have no access to mega-expensive computer algebra systems, likeMathematica.
Download SymPy
UPDATE: How to use Python and SymPy to solve optimization problems.
8. html5lib
After the fall of BeautifulSoup, I was desperate for a web data parser that equaled soup’s flexibility and easy of use. Enter html5lib. If you need to download and organize large amounts of data from the Internet in a quick and easy way, then html5lib is the only package you will need. This module also supports the BeautifulSoup tree type, as well as many others, making it incredibly useful across a wide range of tasks. To take advantage of its power, you will need a little background in HTML (or XML, if that happens to be what you are parsing), but there are many tutorials available online to get you up to speed quickly.
Download html5lib
9. Pycluster
There are many clustering algorithms available for Python, but many of these packages are designed to cluster one-dimensional data. Data collected by social scientist, however, is often of a higher dimension–enter Pycluster. This package contains efficient implementations of hierarchical and k-means clustering, with several options for measuring distance. Still waiting for a clever binding to Matplotlib to draw the dendrogram, but in the meantime, you can use their Java program TreeView to display result.
Download Pycluster
10. cjson
This module implements a very fast JSON encoder/decoder for Python. JSON (JavaScript Object Notation) is useful for many things, but most notably for social scientist is how many social networking sites use JSON to encode public data about their users and their users’ relationships. JSON is also what is returned by Google’s SocialGraph API, so cjson allows researchers to feed this social network data directly into Python data types.
Download cjson
11. Pyevolve
A complete pure python genetic algorithm framework. I am wearing my computer science background on my sleeve with this one, but for people serious about designing pure Python agent-based models, Pyevolve provides the tools to create intricate experimental environments.
Download Pyevolve
12. MySQL for Python
A pure Python binding for MySQL, allowing the user to integrate MySQL execution into any Python script. Very straightforward and simple to use, and since many social science data sets are stored on MySQL databases, a necessity.
Download MySQL for Python
Updated 4/6/2009>: I have been negligent, as it pointed out in the comments, RPy has functionally been replaced by RPy2.
13. RPy2
There are very few statistical calculations that the combination of NumPy and SciPy cannot handle, but there are NO statistical operations R cannot do. RPy2 is a simple Python interface for R, able to execute any R function from within a Python script.
Download RPy2
I should also note that most (maybe all by now) of these packages come standard with the Enthought distribution of Python. If you are interested in using Python as a platform for scientific research, I highly recommend installing this distribution, which is free for academics.