O’Reilly Datashow: Apache Spark Journey from Academia to Industry
Contents
The following are the learnings from the podcast:
- Ion Stoica from Berkeley started MESOS as a class room project
- Cluster management to support multiple Hadoop frameworks
- What to build on TOP of MESOS?
- First commit 2000 lines of code
- Workloads first Hadoop was not good enough
- Spark - a nice component in Hadoop ecosystem
- Startup- real time stack and historical stack. Difficult to maintain two separate code bases
- Historical data analysis with Hadoop- Try to recompute the metrics
- Basic requirements was to enable real time queries, iterative machine learning on TOP of Hadoop that was essentially a distributed batch processing engine
- Started working on it from 2009
- First spark summit in 2014
- Students made it possible
- Databricks company was lunched and first spark Summit was organised in 2014
- Align the incentives of the students with the project
- Spark becomes Apache project
- 2012 Spark tutorial
- Work with data software companies to sell Spark
- Make it easy to develop on TOP of it
- Added Scala, Java, Python, Machine libraries, Graph libraries