Apache Beam - Learnings - APAC PyCon

The main points from beam are :

extra packages can be specified by pipeline options
set up options - StandardSection
PCollection - Iterable Collection - Rows
Beam does automatic de-duplication of messages from pubsub
Window applies only when aggregation function is called
Automatically enable “logging”
Interesting example of combining messages from bar code information and meta data from cloud store
Python 3 support is coming
Usecase
- Take the barcode data and pump it in to a pubsub
- Take the metadata from cloud document store
- Write a pipeline that combines barcode data and metadata and pushes in to big query
- Simple way to combine streaming and batch data
Kinesis and Kinesis analytics (SQL) are other products that do similar stuff but are not open sourced
Kafka support for Beam is coming
You can write your own transformations by inheriting from beam.DoFn class
beam.io has a out of the box function to write data to big query
Spread out a dictionary across many machines - done easily via beam

Learnt that one can easily write stuff to Big Query once it is processed in Beam

Contents