O’Reilly Datashow: Apache Spark Journey from Academia to Industry

The following are the learnings from the podcast:

Ion Stoica from Berkeley started MESOS as a class room project
Cluster management to support multiple Hadoop frameworks
What to build on TOP of MESOS?
First commit 2000 lines of code
Workloads first Hadoop was not good enough
Spark - a nice component in Hadoop ecosystem
Startup- real time stack and historical stack. Difficult to maintain two separate code bases
Historical data analysis with Hadoop- Try to recompute the metrics
Basic requirements was to enable real time queries, iterative machine learning on TOP of Hadoop that was essentially a distributed batch processing engine
Started working on it from 2009
First spark summit in 2014
Students made it possible
Databricks company was lunched and first spark Summit was organised in 2014
Align the incentives of the students with the project
Spark becomes Apache project
2012 Spark tutorial
Work with data software companies to sell Spark
Make it easy to develop on TOP of it
Added Scala, Java, Python, Machine libraries, Graph libraries

Contents