Apache Beam - Immersion
Contents
I have spent the last 2 days working on Apache Beam and here are my learnings:
Learning from Data Flow for Dummies
- If one were to build a hash tag auto-complete function, one might have to create a few map reduce tasks spread across a big cluster. This would involve creating a cluster, running the map reduce on the cluster and then collecting the outputs
- Given a problem to solve, it is slightly tedious to convert the job in to map reduce tasks
- Map Reduce
- Ingredients - Processed Ingredients - Shuffle - Sandwich
- If the reduce step is associative, then it makes things so much more simpler. The M + R can be done on each machine before the shuffle phase thus saving the massive operational time of shuffling a larger set of key value pairs
- If all you want to run SQL stuff on the data, then there is no need to look beyond Big Query as it does everything
- Evolution of technologies
- 2003 - GFS
- 2004 - Map Reduce
- 2006 - Big Table
- 2007 - Paxos
- 2010 - Flume
- 2011 - Dremel
- 2012 - Spanner
- 2013 - Mill Wheel
- Flume Java in 2011 is the basis for DataFlow
- Millwheel - Network nitty gritties
- PCollection - Parallel Collection
- DataFlow Runner optimizes the DAG
- DataFlow sets up a cluster, does the work and destroys the cluster
- Flatten - Synonymous to putting elements in a bag
- Apache Beam has the potential to do what Map Reduce did in the open world
Active Recall from JGarg course
- Beam is a way to create framework agnostic data pipelines that can be written on any language of your choice, i.e. Java, Python or Go
- This beam code can then be run on any runner such as DataFlow, Flink, Spark etc
- The main data abstractions are PCollections and PTransforms.
- PTransforms form the nodes
- PCollections form the edges
- Beam has a set of io connectors. Python SDK can read from
- Text
- Avro
- MongoDB
- PubSub
- Any file in the Google Storage bucket
- Python SDK can write the output to
- File
- Big Query
- Avro
- Parquet
- ParDo is akin to the map step
- CombinePerKey is akin to reduce step
- There are some derivative functions such as Filter, Map, FlatMap that can inturn be used to make the code efficient
- You can write custom datatypes in PCollection
- Comments can be put in place so that these appear in the DAG Step
- Comments should be unique and they cannot be replicated
- With open can be used that automatically runs the pipeline
- All the runner settings can be customized via Pipeline options class
- Using the Beam SDK workers, the respective runners create a DAG and execute the DAG
- Beam can be used to specify streaming data pipelines
- Concept of watermarking so that you can kickstart the computations in a window
- Windowing - Fixed, Sliding, Session, Global
- Triggers - Early, Ontime, Late triggers
- Accumulation Type - Discard or Additive
- Basis idea of processing time and event time interplay
- Apache Beam is a portable and unified framework for writing data pipelines
- Apache Beam is based on Flume Java and Millwheel
- Google has taken something very proprietary and has worked with open source Apache to make Beam an open source framework
- Beam + Flink will be increasingly adopted across enterprises
- Beam Capability matrix addresses four questions
- What is being computed ?
- How is it being computed ?
- When is it being computed ?
- What refinements are being applied ?
- DataFlow/Flink are DAG optimizers
- There are cases when one can give side inputs to PTransforms
- ParDo takes a class that inherits from beam.DoFn Class
- Custom PTransform inherit from beam.– I do not recollect this part
- One has to override the process function
- ParDo can be used to Filter, Map and many other things - it is a general purpose parallel operation
- One Transform can give rise to multiple PCollection using tagged output features
- PCollections can be joined via the join feature
- Beam Code can contain branching
- For debugging purpose, one can run Beam on a local mode
- Session window means - the window is open until a period of inactivity
- Sliding window contains a window time and sliding time
- Specifying Global window means essentially batch processing
- Default trigger is the on-time trigger that includes watermark
- Bounded and Unbounded datasets is a better way to think than batch and stream
- Apache Flink - Stream is at the heart of framework
- Java SDK has all the joins implemented - Python has only the basic join
- One can send messages to topic, let a beam program read the messages from a subscription and publish messages to another topic. This whole pipeline is an interesting way to learn PubSub and Beam
- Python 3 is still underway for Beam
- One needs to provide custom classes for serializing and deserializing custom data types in PCollection
- PCollection is immutable
- PCollection can be spread across multiple clusters
- PTransform functions provide an abstraction of map reduce tasks
- One can easily write a program in beam rather than actually think through Map Reduce tasks
What have I scribbled in the notes ?
- All elements of PCollection should have the same type
- Beam has functions to assign a timestamp to each element
- Pubsub by default assigns a timestamp to every message
- ParDo can take a class that overrides four functions - createaccumulator, addinput, mergeaccumulator, extractoutput
- Four questions
- What results are calculated?
- Where in event time are the results calculated ?
- When in processing time are the results materialized ?
- How do refinements of results relate ?
- Early and Speculative firings
- Tiggers prompt beam to emit data
- One can create composite triggers also
- Flink is StreamFirst Architecture
- Flink is much better than Storm and Spark
- ParDo is a generic function that can be used to filter dataset, do formatting, do type conversion, extract parts, do computations on each element
- Flume Java paper is available in the public domain
Takeaway
I have relearnt a ton of aspects by immersing in Apache Beam from
to