Apache Beam - Active Retrieval

It was [2019-12-16 Mon] ago that I spent three hours doing Beam Katas. The previous three days [2019-12-13 Fri] - [2019-12-15 Sun], I had immersed myself in to Beam, trying to understand it from JGarg’s course.

In this post, I will spend the next 15 minutes and do active recall of all the stuff :

Beam is a data processing pipeline language that can be used to specify integration logic
You can write the logic using Java, Scala or Python and execute the logic on any runner such as Flink, Storm, DataFlow, or any engine
There are a set of functions that look very similar to map and reduce that can be used to create data pipelines
PCollections
Apache from its humble origins of Hadoop has now grown in a massive ecosystem with Yarn, Map Reduce, Spark, ZooKeeper, Paxos, HBase, Cassandra, Oozie, Flume, Pig, Hive, Flink, Storm, Kafka and similar stuff
It is interesting to draw parallels to what is available on GCP. If you look at each of the available component in the Hadoop ecosystem, you will find a parallel component in the GCP world and in the AWS world.
A solid understanding of the Hadoop Ecosystem will help one understand the various components of AWS and GCP ecosystem
Some of the components in the AWS and GCP are managed services of Hadoop ecosystem elements
the strange syntax code that is used to define the beam pipeline
Options class to instantiate a beam pipeline
One can run beam pipeline locally on a laptop
Beam Pipeline is essentially a DAG and needs an executor to realize the operations
Spark vs Flink - The fact that latter is a true stream processing engine whereas spark streaming can be thought of batch processing engine at heart that also does stream processing
There are some methods that you need to override to come up with custom transformations
There are also PCollections and PTransformations objects
There are DoFn objects - What they do ? I do not recollect at all
Almost everything can be done by DoFn objects
There are light weight classes such as FlatMap that can do similar stuff that DoFn can do
Read from file and Writing to a file are also PTransforms
There are some join operations that can be done
There is no way to keep track of column metadata. Unlike Spark that provides DataFrames, the user has to manually manage data via Apache Beam
CountWords is the “hello world” example in Beam
You need to run beam code on some engine - I think it can be run on Spark !
There are concepts around windowing that can be used to look at building data pipelines using real time data
Fantastic podcast - Frances Perry on Apache Beam real time

Sadly these are all what I remember about Beam from the three day immersion I had two weeks ago. My next step would be go over Beam again

Contents