Big Data : Review

“Big Data” has entered every one’s vocabulary, thanks to the wild success of few companies that have used data to provide valuable information and services. This book gives a bird’s eye view of the emerging field.

The book starts off with an interesting example of the way Google predicted the spread of flu in real time after analyzing two datasets, first one containing 50 million most common terms that Americans type and second one containing the data on the spread of seasonal flu from public health agency. Google did not start with a hypothesis, test a handful models and pick one amongst them. Instead Google tested a mammoth 450 million different mathematical models in order to test the search terms, comparing their predictions against the actual flu cases. They used this model when H1N1 crisis struck in 2009 and it gave more meaningful and valuable real time information than any public health official system.

There is no formal meaning for the term, “Big Data”. Informally it means the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. It is estimated that the total amount of stored info in the world is close to 1200 exabytes(1 exa byte = 1000 GB). Only about 2% of it is analog. So, after the basic digitization efforts, the next game changer, the book predicts is going to be “Big Data”. At its core, big data is about predictions. Though it is described as a branch of computer science, “machine learning”, this characterization is misleading. Big Data is not about trying to “teach” computer to “think” like humans. Instead, it’s about applying math to huge quantities of data in order to infer probabilities.

The book talks about the three shifts that are happening in the way we analyze information :

“N = all” : Gone are the days when you had constraints on getting the entire dataset. In the “small data” world, one could start with a few hypotheses, employ stats to get the right sample from the population, employ estimation theory to find the right statistic to summarize data, and then draw conclusions. This procedure is becoming irrelevant, at least a predominant part of it. Once you are dealing with the entire dataset, there is a richer structure of data available. The variations amongst subcategories can be studied to one’s heart’s content.
Messy: Gone are the days when you had to crave for exact datasets. The data nowadays is stored across multiple servers and platforms and is inherently messy. Instead of trying to make it structured and make it suitable to relational database, the database technologies are evolving where “no structure” is the core philosophy. noSQL, Hadoob, Map Reduce, etc. are a testimony to the fact that database technology has undergone a radical change. With messy data, comes the advantage of breadth. Simple models with lots of data are performing better than elaborate models with less data. One of the examples mentioned in this context is the grammar checker in MS word. Instead of spending efforts in developing more efficient algos than that are already available, the guys at MSFT decided to focus efforts on building a large corpus. This shift in big data thinking dramatically increased the efficiency of algos. Simple algos were performing much better than complicated ones with large corpus of words as ammunition. Google has taken grammar check to a completely different level by harnessing big data.
Correlations : “What” is important and matters more than “Why”. Causality can be bid goodbye once you have a huge datasets. Correlations that are notoriously unstable in small datasets can provide excellent patterns in analyzing big data. Non linear relationships can be culled out. Typically nonlinear relationships have more parameters to estimate and hence the data needed to make sense of these parameters becomes huge. Also the parameters have high standard errors in the “small data” world. Enter “Big Data” world, the “N=all” means that the parameters will tend to show stability. Does it mean whether it is end of theory? Not really. Big data itself is founded on theory. For instance, it employs statistical theories and mathematical ones, and at times uses computer science theory, too. Yes, these are not theories about the causal dynamics of a particular phenomenon like gravity, but they are theories nonetheless. Models based on them hold very useful predictive power. In fact, big data may offer a fresh look and new insights precisely because it is unencumbered by the conventional thinking and inherent biases implicit in the theories of a specific field.

The book goes on to describe various players in the Big Data Value chain

Data Holders : They may not have done the original collection, but they control access to information and use it themselves or license it to others who extract its value.
Data Specialists : companies with the expertise or technologies to carry out complex analysis
Companies and individuals with big data mindset : Their strength is that they see opportunities before others do—even if they lack the data or the skills to act upon those opportunities. Indeed, perhaps it is precisely because, as outsiders, they lack these things that their minds are free of imaginary prison bars: they see what is possible rather than being limited by a sense of what is feasible.

Who holds the most value in the big-data value chain? According to the authors,

Today the answer would appear to be those who have the mindset, the innovative ideas. As we saw from the dotcom era, those with a first-mover advantage can really prosper. But this advantage may not hold for very long. As the era of big data moves forward, others will adopt the mindset and the advantage of the early pioneers will diminish, relatively speaking.

Perhaps, then, the crux of the value is really in the skills? After all, a gold mine isn’t worth anything if you can’t extract the gold. Yet the history of computing suggests otherwise. Today expertise in database management, data science, analytics, machine-learning algorithms, and the like are in hot demand. But over time, as big data becomes more a part of everyday life, as the tools get better and easier to use, and as more people acquire the expertise, the value of the skills will also diminish in relative terms. Similarly, computer programming ability became more common between the 1960s and 1980s. Today, offshore outsourcing firms have reduced the value of programming even more; what was once the paragon of technical acumen is now an engine of development for the world’s poor. This isn’t to say that big-data expertise is unimportant. But it isn’t the most crucial source of value, since one can bring it in from the outside.

Today, in big data’s early stages, the ideas and the skills seem to hold the greatest worth. But eventually most value will be in the data itself. This is because we’ll be able to do more with the information, and also because data holders will better appreciate the potential value of the asset they possess. As a result, they’ll probably hold it more tightly than ever, and charge outsiders a high price for access. To continue with the metaphor of the gold mine: the gold itself will matter most.

What skills are needed to work in this Big Data world?

Mathematics and statistics, perhaps with a sprinkle of programming and network science, will be as foundational to the modern workplace as numeracy was a century ago and literacy before that. In the past, to be an excellent biologist one needed to know lots of other biologists. That hasn’t changed entirely. Yet today big-data breadth matters too, not just subject-expertise depth. Solving a puzzling biological problem may be as likely to happen through an association with an astrophysicist or a data-visualization designer.

The book ends with chapters on risks and control, where the authors cover a variety of issues that will have to be dealt in the “Big Data” world. The book in trying to explain the field, gives a ton of examples. Here are some that I found interesting :

Google Flu trends – Fitting half a billion models to cull out 45 variables that detect the spread of flu.
Entire dataset of Sumo Wrestlers results analyzed by freakonomics authors to cull out interesting patterns
Farecast, a site that helps predict the direction of air fares over different routes
Hadoob : Open source alternative to Google’s Map Reduce , a system to handle gigantic datasets
Recaptcha : Instead of typing in random letters, people type two words from text-scanning projects that a computer’s optical character-recognition program couldn’t understand. One word is meant to confirm what other users have typed and thus is a signal that the person is a human; the other is a new word in need of disambiguation. To ensure accuracy, the system presents the same fuzzy word to an average of five different people to type in correctly before it trusts it’s right. The data had a primary use—to prove the user was human—but it also had a secondary purpose: to decipher unclear words in digitized texts.[
23and Me – DNA sequencing using BIG Data mindset
Billion Prices project : A project that scours web for price information and gives an indication of CPI real time. This kind of information is crucial for policy makers.
ZestFinance - Its technology helps lenders decide whether or not to offer relatively small, short-term loans to people who seem to have poor credit. Yet where traditional credit scoring is based on just a handful of strong signals like previous late payments, ZestFinance analyzes a huge number of “weaker” variables. In 2012 it boasted a loan default rate that was a third less than the industry average. But the only way to make the system work is to embrace messiness.
Endgame cracked : Chess endgames with lesser than 6 or fewer pieces on the board has been cracked. There is no way a human can outsmart a computer.
NY city manholes problem solved using Big Data thinking.
Nuance made a blunder while licensing technology to Google for the service, GOOG-411 for local search listings. Google retained the voice translation records and reused the data in to a whole gamut of services.
Flyontime : Visitors to the site can interactively find out (among many other correlations) how likely it is that in- clement weather will delay flights at a particular airport. The web- site combines flight and weather information from official data sources that are freely available and accessible through the Internet.
Decide : Price-prediction engine for zillions of consumer products.
Prismatic : Prismatic aggregates and ranks content from across the Web on the basis of text analysis, user preferences, social-network-related popularity, and big-data analytics. Importantly, the system does not make a big distinction between a teenager’s blog post, a corporate website, and an article in the Washington Post: if the content is deemed relevant and popular (by how widely it is viewed and how much it is shared), it appears at the top of the screen.
Zynga : “We are an analytics company masquerading as a gaming company. Everything is run by the numbers,”says an Zynga executive.

clip_image002[6] Takeaway:

Big data is a resource and a tool. It is meant to inform, rather than explain; it points us toward understanding, but it can still lead to misunderstanding, depending on how well or poorly it is wielded. This book gives a good overview of the major drivers that are taking us towards the Big Data world.