From 0 to 1 Spark for Data Science with Python
Contents
What have I learnt from this awesome course on Spark?
From Memory
Here is my attempt here is to write 50 takeaways from the course :
- Emacs configuration : Running from emacs was the first thing I wanted to do so as to quickly work through the various commands
(setq python-shell-interpreter "C:/ProgramData/Anaconda2/envs/sparkl/Scripts/ipython")
- Spark makes it easy to write map-reduce jobs. The spark engine takes care of writing the map-reduce jobs and submits it to storage manager and resource scheduler
- Spark has APIs in Java, Scala, Python and R and thus makes it easy for a programmer to get up and running
- RDD - Resilient Distributed Dataset - Main data structure of Spark Core
- Basic architecture
- sparkSQL + MLlib + Graph X + Spark Streaming
- Distributed Storage + Resource Scheduler
- Spark Core
- RDD traits : In-memory, distributed, fault-tolerant
- Pair RDD where the data is in the form of a key value pair
- Transformations and Actions can be applied on RDDs
- Actions will realize the RDD whereas Transformations will not. This lazy evaluation makes Spark very powerful
- Common transformations are
- map
- flatmap
- filter
- readText
- join
- groupBy
- Common Actions are
- collect
- take
- first
- One can use
persist
to cache the RDD on nodes - One can use
cache
so that the shuffle operation does not create operational bottlenecks - Install spark using virtual box
- SparkDriver + DAG Scheduler + TaskSetManager are the three components work together after submitting a spark job
- SparkDriver gives the spark context based on which the client can communicate with the spark cluster
- One can use any distributed storage and resource scheduler. If you use Hadoop, then the distributed storage is HDFS and resource scheduler is YARN
- If you write any reduce job, the definition should take two arguments
lambda x,y: x+y
- In order to work with SparkStreaming , you need to work with DStreams objects
- DStream Objects take in 4 arguments - two functions that define the way the stats are collected for entering window, leaving window, and two other arguments that specify window and sliding window details
- Spark Streaming is not truly event streaming - it is microbatching
- One can feed input from Kafka in to Spark Streaming
- For true event streaming, one might have to use Storm or Flink to get events and then Spark can further process them
- There are more methods for DataFrames than RDDs as mentioned in the implementation of PR
- Spark has MLlib that has collaborative filtering, regression, clustering and some of the common algos implemented
- Collaborative filtering via Alternating Least Squares is implemented in Spark
- Spark came out of Berkley - Databricks is the company that sells commercial version of Spark
- Spark can be run in shell mode, local mode or cluster mode
- Most of the developments are available in Java and Scala APIs. Python API usually does not incorporate the most recent developments in Spark APIs
sort
operation on RDDsAccumulate
operation in the context of Spark- You can specify functions at the node level and cluster level using the
accumulate
operator - Page Rank - 10,000 ft overview
- Spark bundles the relevant libraries and sends them to each of the node that does data processing
- Spark can be run on a single machine too - with hadoop + virtual box(ubuntu). Ofcourse, it is not meant to be run on a single machine
join
operations creates a massive shuffle of data across the cluster- sparkSQL is a fantastic way to get the power of mapreduce and SQL all at one place.
- sparkSQL can be run on Spark DataFrames
namedtuple
can be used to create classes on the fly- One can easily ingest massive json array data in to HDFS with one line of code. This means that entire MRN data can be put in to HDFS
- RDDs are immutable
- One can create an RDD from a text file, streaming system, from memory, from another RDD
- Lazy evaluation makes Spark very powerful
- MLlib has specific classes for specific algos, i.e. Ratings class to capture user ratings
- Spark gives the power of combining transformation +action paradigm, SQL paradigm + DataFrame paradigm + Distributed storage
- LDA can be done using MLLib
- Java API looks very painful
- Spark can be used to write data to HDFS/Cassandra/any other place such as local disk
- Spark is written in Scala
- Once an use spark submit commands to submit a sparkjob
From my notes
- For iterative evaluation, MLlib is super powerful
- DataFrame - in memory data table
- Row object in DataFrame
- Problem with Map Reduce is that all intermittent steps write the data to HDFS
- Broadcast variables - used to store data across each of the nodes
- Functions available with PairRDDs are
- keys
- values
- groupbyKey
- reducebyKey
- combinebyKey
- closures - Spark pulls all the relevant objects and classes needed to work on a job and distributes it to each node
- RDDs have lineage
- RDDs are fault tolerant
- No REPL in Hadoop Map Reduce
- Hadoop Map reduce can only do batch processing
- Spark advantages
- express in intuitive way
- interactive shell for python
- data kept in memory
- can do stream processing
Reflections
I think reflecting on the course learnings has helped me understand the content better. I think I am better equipped to go over Jose Course