What have I learnt from this awesome course on Spark?

From Memory

Here is my attempt here is to write 50 takeaways from the course :

Emacs configuration : Running from emacs was the first thing I wanted to do so as to quickly work through the various commands (setq python-shell-interpreter "C:/ProgramData/Anaconda2/envs/sparkl/Scripts/ipython")
Spark makes it easy to write map-reduce jobs. The spark engine takes care of writing the map-reduce jobs and submits it to storage manager and resource scheduler
Spark has APIs in Java, Scala, Python and R and thus makes it easy for a programmer to get up and running
RDD - Resilient Distributed Dataset - Main data structure of Spark Core
Basic architecture
- sparkSQL + MLlib + Graph X + Spark Streaming
- Distributed Storage + Resource Scheduler
- Spark Core
RDD traits : In-memory, distributed, fault-tolerant
Pair RDD where the data is in the form of a key value pair
Transformations and Actions can be applied on RDDs
Actions will realize the RDD whereas Transformations will not. This lazy evaluation makes Spark very powerful
Common transformations are
- map
- flatmap
- filter
- readText
- join
- groupBy
Common Actions are
- collect
- take
- first
One can use persist to cache the RDD on nodes
One can use cache so that the shuffle operation does not create operational bottlenecks
Install spark using virtual box
SparkDriver + DAG Scheduler + TaskSetManager are the three components work together after submitting a spark job
SparkDriver gives the spark context based on which the client can communicate with the spark cluster
One can use any distributed storage and resource scheduler. If you use Hadoop, then the distributed storage is HDFS and resource scheduler is YARN
If you write any reduce job, the definition should take two arguments lambda x,y: x+y
In order to work with SparkStreaming , you need to work with DStreams objects
DStream Objects take in 4 arguments - two functions that define the way the stats are collected for entering window, leaving window, and two other arguments that specify window and sliding window details
Spark Streaming is not truly event streaming - it is microbatching
One can feed input from Kafka in to Spark Streaming
For true event streaming, one might have to use Storm or Flink to get events and then Spark can further process them
There are more methods for DataFrames than RDDs as mentioned in the implementation of PR
Spark has MLlib that has collaborative filtering, regression, clustering and some of the common algos implemented
Collaborative filtering via Alternating Least Squares is implemented in Spark
Spark came out of Berkley - Databricks is the company that sells commercial version of Spark
Spark can be run in shell mode, local mode or cluster mode
Most of the developments are available in Java and Scala APIs. Python API usually does not incorporate the most recent developments in Spark APIs
sort operation on RDDs
Accumulate operation in the context of Spark
You can specify functions at the node level and cluster level using the accumulate operator
Page Rank - 10,000 ft overview
Spark bundles the relevant libraries and sends them to each of the node that does data processing
Spark can be run on a single machine too - with hadoop + virtual box(ubuntu). Ofcourse, it is not meant to be run on a single machine
join operations creates a massive shuffle of data across the cluster
sparkSQL is a fantastic way to get the power of mapreduce and SQL all at one place.
sparkSQL can be run on Spark DataFrames
namedtuple can be used to create classes on the fly
One can easily ingest massive json array data in to HDFS with one line of code. This means that entire MRN data can be put in to HDFS
RDDs are immutable
One can create an RDD from a text file, streaming system, from memory, from another RDD
Lazy evaluation makes Spark very powerful
MLlib has specific classes for specific algos, i.e. Ratings class to capture user ratings
Spark gives the power of combining transformation +action paradigm, SQL paradigm + DataFrame paradigm + Distributed storage
LDA can be done using MLLib
Java API looks very painful
Spark can be used to write data to HDFS/Cassandra/any other place such as local disk
Spark is written in Scala
Once an use spark submit commands to submit a sparkjob

From my notes

For iterative evaluation, MLlib is super powerful
DataFrame - in memory data table
Row object in DataFrame
Problem with Map Reduce is that all intermittent steps write the data to HDFS
Broadcast variables - used to store data across each of the nodes
Functions available with PairRDDs are
- keys
- values
- groupbyKey
- reducebyKey
- combinebyKey
closures - Spark pulls all the relevant objects and classes needed to work on a job and distributes it to each node
RDDs have lineage
RDDs are fault tolerant
No REPL in Hadoop Map Reduce
Hadoop Map reduce can only do batch processing
Spark advantages
- express in intuitive way
- interactive shell for python
- data kept in memory
- can do stream processing

Reflections

I think reflecting on the course learnings has helped me understand the content better. I think I am better equipped to go over Jose Course

From 0 to 1 Spark for Data Science with Python

Contents

From Memory

From my notes

Reflections