Apache Beam - Active Retrieval
Contents
It was
ago that I spent three hours doing Beam Katas. The previous three days - , I had immersed myself in to Beam, trying to understand it from JGarg’s course.In this post, I will spend the next 15 minutes and do active recall of all the stuff :
- Beam is a data processing pipeline language that can be used to specify integration logic
- You can write the logic using Java, Scala or Python and execute the logic on any runner such as Flink, Storm, DataFlow, or any engine
- There are a set of functions that look very similar to map and reduce that can be used to create data pipelines
- PCollections
- Apache from its humble origins of Hadoop has now grown in a massive ecosystem with Yarn, Map Reduce, Spark, ZooKeeper, Paxos, HBase, Cassandra, Oozie, Flume, Pig, Hive, Flink, Storm, Kafka and similar stuff
- It is interesting to draw parallels to what is available on GCP. If you look at each of the available component in the Hadoop ecosystem, you will find a parallel component in the GCP world and in the AWS world.
- A solid understanding of the Hadoop Ecosystem will help one understand the various components of AWS and GCP ecosystem
- Some of the components in the AWS and GCP are managed services of Hadoop ecosystem elements
- the strange syntax code that is used to define the beam pipeline
- Options class to instantiate a beam pipeline
- One can run beam pipeline locally on a laptop
- Beam Pipeline is essentially a DAG and needs an executor to realize the operations
- Spark vs Flink - The fact that latter is a true stream processing engine whereas spark streaming can be thought of batch processing engine at heart that also does stream processing
- There are some methods that you need to override to come up with custom transformations
- There are also PCollections and PTransformations objects
- There are DoFn objects - What they do ? I do not recollect at all
- Almost everything can be done by DoFn objects
- There are light weight classes such as FlatMap that can do similar stuff that DoFn can do
- Read from file and Writing to a file are also PTransforms
- There are some join operations that can be done
- There is no way to keep track of column metadata. Unlike Spark that provides DataFrames, the user has to manually manage data via Apache Beam
- CountWords is the “hello world” example in Beam
- You need to run beam code on some engine - I think it can be run on Spark !
- There are concepts around windowing that can be used to look at building data pipelines using real time data
- Fantastic podcast - Frances Perry on Apache Beam real time
Sadly these are all what I remember about Beam from the three day immersion I had two weeks ago. My next step would be go over Beam again