In distributed computing, identical workloads or services run on different computers but present themselves to the developer as a single system. Why would that be useful?
Well, say that you go to the library and want to check out a book, like I <3 Logs. Unfortunately, your local library is hosting a fundraising event, and is closed to patrons! Luckily, you can head over to the other side of town, and request a copy of that same resource. Distributed computing solves problems in a similar way: if one computer has a problem, other computers in the group can pick up the slack. A distributed computer system can also easily handle more load by transparently integrating more computers into the system.
Apache Kafka® is more than a data streaming technology; it’s also an example of a distributed computing system. Handy! Let’s go into the details. You can also get a brief introduction to Kafka clusters with the video below!
Within the context of Kafka, a cluster is a group of servers working together for three reasons: speed (low latency), durability, and scalability. Several data streams can be processed by separate servers, which decreases the latency of data delivery. Data is replicated across multiple servers, so if one fails, another server has the data backed up, ensuring stability - meaning data durability and availability. Kafka also balances the load across multiple servers to provide scalability. But what are the mechanics?
Kafka brokers are servers with special jobs to do: managing the load balancing, replication, and stream decoupling within the Kafka cluster. How do they get these jobs done? Well, first of all, in order to start a Kafka cluster, the developer authenticates to a bootstrap server (or a few). These are the first servers in the cluster. Then, the brokers also balance the load and handle replication, and those two features are key to Kafka’s speed, scalability, and stability.
How does Kafka balance load, exactly? This is where Kafka brokers and partitions come into play. In Kafka, data (individual pieces of data are events) are stored within logical groupings called topics. Those topics are split into partitions, the underlying structure of which is a log.
Each of those logs can live on a different broker or server in the cluster! This balances the work of writing and reading from the partition across the cluster. This is also what makes Kafka fast and uniquely scalable, because, for sure, in any distributed system, those topics could be handled by different brokers, but the partitions add another layer of scalability.
In order to keep data durable across a distributed system, it is replicated across a cluster. Each node in a cluster has one of two roles when it comes to replication: leader or follower. Followers stay in sync with the leader in order to keep up to date on the newest data. The diagram below shows what might happen if you had a Topic A with a replication factor of two. This means that the leader node is responsible for the first instance of each partition, and the follower node ensures that it is replicated on each server.
Every time we’re introduced to a new type of distributed computing, we should ask, “What happens in the case of failure?” In this case, we might wonder what happens when a follower fails, as well as what happens when a leader fails.
We might think of leaders and followers as volleyball players on a team. If the team captain or one of the players gets tired (fails) then who decides which player will assume the new role for the player or team captain? That’s the role of the coach, who is external to the team and has information about each player. The coach makes their decisions based on that information. In Kafka, KRaft plays the role of the coach.
Note: KRaft is a metadata management system (introduced by KIP-500) that uses event sourcing to manage the data in the cluster. It simplifies Kafka’s architecture, by managing the metadata itself, removing the need for an external system. Before Kafka 3.4, only Zookeeper was available as the metadata source. In a future release (estimated to be 4.0) ZooKeeper will be removed in favor of KRaft. Read this introductory blog post to learn more about the change, or dive deeper with Jack Vanlightly’s resource on KRaft and Raft: Why Apache Kafka Doesn’t Need fsync to Be Safe
OK, so what happens when a replica fails? In Kafka, the replica has a chance of rejoining the group organized by the leader by sending it a request to rejoin the cluster. Of course, once it joins, its data may not be accurate, so the follower replicates the writes that are stored on the leader.
And what happens when a leader fails?
In this case, KRaft triggers what’s called a leader election process, in which a new leader for the group is chosen. In KRaft, a broker that’s responsible for metadata is called a controller. There’s usually more than one of these in a cluster. Now, when a controller recognizes a failed leader, that controller will then trigger a voting process in which it proposes itself as the new leader. Once the state in each broker is updated to reflect the new leader role, the controller is confirmed as the new leader. You can learn more about this process in the Apache Kafka Control Plane module of Confluent Developer’s Kafka Internals course.
Let’s talk for a moment about data centers. A data center is a physical place for storing data. (You can tour a Microsoft data center in this video, or a Google one here!). Now, when a data center is housed by a company or person on a server on their own property, that’s called on-premise. This has its benefits but can get expensive and requires a separate set of firmware and hardware skills to maintain servers, so many developers choose to run their software in the cloud. This means they’re renting space on other people’s servers.
With Apache Kafka, you have both options available to you. You can run open source Kafka on your own computers or you can run Confluent Cloud.
Hopefully, this introduction to Apache Kafka has been helpful to you! If you’d like to learn more about how Kafka implements a distributed computing system, I highly recommend these resources from my colleagues and from the open-source community:
Learn Kafka Architecture: Gain more confidence with clusters in this course from Confluent Developer
Making Kafka Cloud Native: A talk by Jay Kreps on the work involved in making a cloud offering for Kafka
A Kafkaesque Raft Protocol: A talk by Jason Gustafson on the benefits of KRaft as a metadata system internal to Kafka
How Does Kafka Work In a Nutshell?: Details about the Kafka network from the official docs
Learn when to consider expanding to multiple Apache Kafka clusters, how to manage the operations for large clusters, and tools and resources for efficient operations.
The term “event” shows up in a lot of different Apache Kafka® arenas. There’s “event-driven design,” “event sourcing,” “designing events,” and “event streaming.” What is an event, and what is the difference between the role an event has to play in each of these contexts?