My Learnings - Kafka as a Platform
Contents
The following are the my learning for the day :
Link : https://www.youtube.com/watch?v=y4B3rLbXIAY&t=2s
-
Learnt about Avro format
-
Implemented Sliding Timewindows in Apache Beam
-
CAP - Tradeoffs between consistency and accuracy is pretty much acknowledged by many in the Industry
-
Cassandra - Use it when you need to work on large database where you cannot solve by merely read replication
-
Cassandra uses a hashing mechanism to distribute load across cluster
-
Cassandra has the leader - follower architecture for a partition
-
I have still not understood the internals of Cassandra - Is it a pure peer to peer database ?
-
Kafka
- Messaging
- Partitioned Messages
- Leader-Follower architecture
- Kafka Connect API
- Kafka SQL API called KSQL that can be used to query Kafka
- Replace the old way of doing ETL with Stream at the center
- NY Times stores all its data on Kafka since 1970
-
I think getting connection to ERT is a wonderful way to explore real time streaming analytics
- Should make the most of it
-
Some of the projects that I can think of
- Pump ERT data in to Kafka and then used KSQL to fire SQL queries
- Pump ERT to pubsub and then move the data to GCP
-
Tim Berglund: Kafka as a Platform: the Ecosystem from the Ground Up
- Kafka is a log of events
- In Kafka - Topic is a persistent log of events - Events are key value pairs
- Logs have strict ordering
- Constant time reads and writes in Kafka
- What separates Kafka from other is the idea of Partition
- Split the partition in to pieces
- Run key in to hashing algo and based on the output, the data goes in to a specific partition
- Key selection becomes a data modeling concern
- Ordering is within a partition only
- Replication - Key part of Kafka
- Producers
- Consumers
- Reads and Write to the leader partition
- Schema registry
-
Job tracker and task tracker in the context of Hadoop ecosystem
-
Name Node and Job Node
-
You should spend atleast thrice amount of time on output as much as you would do on the input
-
Consistent Hashing - That’s how Cassandra works and makes it a distributed database