The main points from beam are :

  • extra packages can be specified by pipeline options
  • set up options - StandardSection
  • PCollection - Iterable Collection - Rows
  • Beam does automatic de-duplication of messages from pubsub
  • Window applies only when aggregation function is called
  • Automatically enable “logging”
  • Interesting example of combining messages from bar code information and meta data from cloud store
  • Python 3 support is coming
  • Usecase
    • Take the barcode data and pump it in to a pubsub
    • Take the metadata from cloud document store
    • Write a pipeline that combines barcode data and metadata and pushes in to big query
    • Simple way to combine streaming and batch data
  • Kinesis and Kinesis analytics (SQL) are other products that do similar stuff but are not open sourced
  • Kafka support for Beam is coming
  • You can write your own transformations by inheriting from beam.DoFn class
  • beam.io has a out of the box function to write data to big query
  • Spread out a dictionary across many machines - done easily via beam

Learnt that one can easily write stuff to Big Query once it is processed in Beam