The following are the my learning for the day :

Link : https://www.youtube.com/watch?v=y4B3rLbXIAY&t=2s

  • Learnt about Avro format

  • Implemented Sliding Timewindows in Apache Beam

  • CAP - Tradeoffs between consistency and accuracy is pretty much acknowledged by many in the Industry

  • Cassandra - Use it when you need to work on large database where you cannot solve by merely read replication

  • Cassandra uses a hashing mechanism to distribute load across cluster

  • Cassandra has the leader - follower architecture for a partition

  • I have still not understood the internals of Cassandra - Is it a pure peer to peer database ?

  • Kafka

    • Messaging
    • Partitioned Messages
    • Leader-Follower architecture
    • Kafka Connect API
    • Kafka SQL API called KSQL that can be used to query Kafka
    • Replace the old way of doing ETL with Stream at the center
    • NY Times stores all its data on Kafka since 1970
  • I think getting connection to ERT is a wonderful way to explore real time streaming analytics

    • Should make the most of it
  • Some of the projects that I can think of

    • Pump ERT data in to Kafka and then used KSQL to fire SQL queries
    • Pump ERT to pubsub and then move the data to GCP
  • Tim Berglund: Kafka as a Platform: the Ecosystem from the Ground Up

    • Kafka is a log of events
    • In Kafka - Topic is a persistent log of events - Events are key value pairs
    • Logs have strict ordering
    • Constant time reads and writes in Kafka
    • What separates Kafka from other is the idea of Partition
    • Split the partition in to pieces
    • Run key in to hashing algo and based on the output, the data goes in to a specific partition
    • Key selection becomes a data modeling concern
    • Ordering is within a partition only
    • Replication - Key part of Kafka
    • Producers
    • Consumers
    • Reads and Write to the leader partition
    • Schema registry
  • Job tracker and task tracker in the context of Hadoop ecosystem

  • Name Node and Job Node

  • You should spend atleast thrice amount of time on output as much as you would do on the input

  • Consistent Hashing - That’s how Cassandra works and makes it a distributed database