I have spent the last 2 days working on Apache Beam and here are my learnings:

Learning from Data Flow for Dummies

If one were to build a hash tag auto-complete function, one might have to create a few map reduce tasks spread across a big cluster. This would involve creating a cluster, running the map reduce on the cluster and then collecting the outputs
Given a problem to solve, it is slightly tedious to convert the job in to map reduce tasks
Map Reduce
- Ingredients - Processed Ingredients - Shuffle - Sandwich
If the reduce step is associative, then it makes things so much more simpler. The M + R can be done on each machine before the shuffle phase thus saving the massive operational time of shuffling a larger set of key value pairs
If all you want to run SQL stuff on the data, then there is no need to look beyond Big Query as it does everything
Evolution of technologies
- 2003 - GFS
- 2004 - Map Reduce
- 2006 - Big Table
- 2007 - Paxos
- 2010 - Flume
- 2011 - Dremel
- 2012 - Spanner
- 2013 - Mill Wheel
Flume Java in 2011 is the basis for DataFlow
Millwheel - Network nitty gritties
PCollection - Parallel Collection
DataFlow Runner optimizes the DAG
DataFlow sets up a cluster, does the work and destroys the cluster
Flatten - Synonymous to putting elements in a bag
Apache Beam has the potential to do what Map Reduce did in the open world

Active Recall from JGarg course

Beam is a way to create framework agnostic data pipelines that can be written on any language of your choice, i.e. Java, Python or Go
This beam code can then be run on any runner such as DataFlow, Flink, Spark etc
The main data abstractions are PCollections and PTransforms.
PTransforms form the nodes
PCollections form the edges
Beam has a set of io connectors. Python SDK can read from
- Text
- Avro
- MongoDB
- PubSub
- Any file in the Google Storage bucket
Python SDK can write the output to
- File
- Big Query
- Avro
- Parquet
ParDo is akin to the map step
CombinePerKey is akin to reduce step
There are some derivative functions such as Filter, Map, FlatMap that can inturn be used to make the code efficient
You can write custom datatypes in PCollection
Comments can be put in place so that these appear in the DAG Step
Comments should be unique and they cannot be replicated
With open can be used that automatically runs the pipeline
All the runner settings can be customized via Pipeline options class
Using the Beam SDK workers, the respective runners create a DAG and execute the DAG
Beam can be used to specify streaming data pipelines
Concept of watermarking so that you can kickstart the computations in a window
Windowing - Fixed, Sliding, Session, Global
Triggers - Early, Ontime, Late triggers
Accumulation Type - Discard or Additive
Basis idea of processing time and event time interplay
Apache Beam is a portable and unified framework for writing data pipelines
Apache Beam is based on Flume Java and Millwheel
Google has taken something very proprietary and has worked with open source Apache to make Beam an open source framework
Beam + Flink will be increasingly adopted across enterprises
Beam Capability matrix addresses four questions
1. What is being computed ?
2. How is it being computed ?
3. When is it being computed ?
4. What refinements are being applied ?
DataFlow/Flink are DAG optimizers
There are cases when one can give side inputs to PTransforms
ParDo takes a class that inherits from beam.DoFn Class
Custom PTransform inherit from beam.– I do not recollect this part
One has to override the process function
ParDo can be used to Filter, Map and many other things - it is a general purpose parallel operation
One Transform can give rise to multiple PCollection using tagged output features
PCollections can be joined via the join feature
Beam Code can contain branching
For debugging purpose, one can run Beam on a local mode
Session window means - the window is open until a period of inactivity
Sliding window contains a window time and sliding time
Specifying Global window means essentially batch processing
Default trigger is the on-time trigger that includes watermark
Bounded and Unbounded datasets is a better way to think than batch and stream
Apache Flink - Stream is at the heart of framework
Java SDK has all the joins implemented - Python has only the basic join
One can send messages to topic, let a beam program read the messages from a subscription and publish messages to another topic. This whole pipeline is an interesting way to learn PubSub and Beam
Python 3 is still underway for Beam
One needs to provide custom classes for serializing and deserializing custom data types in PCollection
PCollection is immutable
PCollection can be spread across multiple clusters
PTransform functions provide an abstraction of map reduce tasks
One can easily write a program in beam rather than actually think through Map Reduce tasks

What have I scribbled in the notes ?

All elements of PCollection should have the same type
Beam has functions to assign a timestamp to each element
Pubsub by default assigns a timestamp to every message
ParDo can take a class that overrides four functions - create_accumulator, add_input, merge_accumulator, extract_output
Four questions
- What results are calculated?
- Where in event time are the results calculated ?
- When in processing time are the results materialized ?
- How do refinements of results relate ?
Early and Speculative firings
Tiggers prompt beam to emit data
One can create composite triggers also
Flink is StreamFirst Architecture
Flink is much better than Storm and Spark
ParDo is a generic function that can be used to filter dataset, do formatting, do type conversion, extract parts, do computations on each element
Flume Java paper is available in the public domain

Takeaway

I have relearnt a ton of aspects by immersing in Apache Beam from [2019-12-30 Mon] to [2020-01-01 Wed]

Apache Beam - Immersion

Contents

Learning from Data Flow for Dummies

Active Recall from JGarg course

What have I scribbled in the notes ?

Takeaway