Apache Beam - Learnings - APAC PyCon
Contents
The main points from beam are :
- extra packages can be specified by
pipeline
options - set up options - StandardSection
- PCollection - Iterable Collection - Rows
- Beam does automatic de-duplication of messages from pubsub
- Window applies only when aggregation function is called
- Automatically enable “logging”
- Interesting example of combining messages from bar code information and meta data from cloud store
- Python 3 support is coming
- Usecase
- Take the barcode data and pump it in to a pubsub
- Take the metadata from cloud document store
- Write a pipeline that combines barcode data and metadata and pushes in to big query
- Simple way to combine streaming and batch data
- Kinesis and Kinesis analytics (SQL) are other products that do similar stuff but are not open sourced
- Kafka support for Beam is coming
- You can write your own transformations by inheriting from beam.DoFn class
- beam.io has a out of the box function to write data to big query
- Spread out a dictionary across many machines - done easily via beam
Learnt that one can easily write stuff to Big Query once it is processed in Beam