Master your next Kafka interview with our comprehensive collection of questions and expert-crafted answers. Get prepared with real scenarios that top companies ask.
Prepare for your Kafka interview with proven strategies, practice questions, and personalized feedback from industry experts who've been in your shoes.
Choose your preferred way to study these interview questions
Can you explain leader and follower replicas and how leader election works in Kafka?
Can you explain leader and follower replicas and how leader election works in Kafka?
In Kafka, each partition has one leader replica and zero or more follower replicas. Producers and consumers talk to the leader only. Followers continuously fetch data from the leader and try to stay in sync. Kafka tracks the in-sync replicas, or ISR, which are followers that are caught up enough to be considered safe for failover.
Leader election happens when the current leader dies or becomes unavailable. The controller picks a new leader from the ISR, not from any out-of-date replica, so Kafka avoids data loss in normal cases. If unclean.leader.election.enable=true, Kafka can elect a non-ISR replica, which improves availability but risks losing committed data. A good answer also mentions replication factor, ISR, and acks=all, because they all work together to determine durability and failover behavior.
How does Kafka achieve high throughput compared with more traditional messaging systems?
How does Kafka achieve high throughput compared with more traditional messaging systems?
Kafka gets high throughput by optimizing for sequential disk and batched network I/O, instead of treating every message like a tiny independent transaction.
It writes to an append-only log, so disk access is mostly sequential, which is much faster than random writes.
Producers batch records and compress them, cutting syscall, network, and storage overhead.
Consumers read sequentially too, often in large chunks, which is cache-friendly and efficient.
It relies heavily on the OS page cache and uses zero-copy transfer with sendfile, reducing extra memory copies.
Brokers are simple, they do not track per-message acknowledgments for each consumer, because consumers manage offsets.
Partitioning spreads load across brokers, disks, and consumers, so throughput scales horizontally.
How do replication factor and in-sync replicas affect durability and availability in Kafka?
How do replication factor and in-sync replicas affect durability and availability in Kafka?
They control the tradeoff between not losing data and keeping writes available.
Replication factor is how many copies of a partition exist. Higher RF improves durability and failover, but costs more storage and network.
In-sync replicas, ISR, are replicas fully caught up with the leader. Kafka only considers these safe for leader election.
With acks=all, the leader waits for all ISR members to confirm before acknowledging. That gives stronger durability.
min.insync.replicas sets the minimum ISR count required to accept writes. If ISR drops below that, producers using acks=all get errors instead of risking weaker durability.
Example, RF=3 and min.insync.replicas=2 lets Kafka survive one broker loss without losing committed data, while still accepting writes.
If you set RF=3 but min.insync.replicas=1, availability is higher, but durability is weaker because one leader ack can be enough.
What is the difference between at-most-once, at-least-once, and exactly-once delivery semantics in Kafka?
What is the difference between at-most-once, at-least-once, and exactly-once delivery semantics in Kafka?
Hereโs the clean way to explain it:
At-most-once: messages may be lost, but never duplicated. This happens if you commit offsets before processing, or if a producer sends without retries.
At-least-once: messages are not lost, but they can be processed more than once. Typical setup is producer retries plus consumer commits offsets after successful processing.
Exactly-once: each message is processed once, with no loss and no duplicates, end to end within Kafka-supported workflows.
In Kafka, exactly-once usually means idempotent producers plus transactions, and consumers using read_committed. Important nuance, exactly-once is strongest when you stay inside Kafka, like consume-process-produce pipelines with Kafka Streams. If your consumer writes to an external DB, you still need idempotency or transactional coordination there.
How does transactional messaging in Kafka work, and what types of problems does it solve?
How does transactional messaging in Kafka work, and what types of problems does it solve?
Kafka transactions let a producer write to multiple partitions, and even commit consumer offsets, as one atomic unit. The producer gets a transactional.id, initializes transactions, calls beginTransaction(), sends records, then either commitTransaction() or abortTransaction(). Kafka uses producer IDs and sequence numbers for idempotence, plus a transaction coordinator and special transaction markers so consumers with isolation.level=read_committed only see committed data.
It solves a few big problems:
- Exactly-once processing in stream apps, especially Kafka Streams and consume-transform-produce loops.
- Avoiding partial writes, where some partitions get data and others do not.
- Preventing duplicates caused by retries or producer restarts.
- Atomically writing output records and consumer offsets, so you do not reprocess after a crash.
- Hiding aborted records from downstream consumers, which keeps results consistent.
What is the difference between offsets managed automatically by consumers and offsets committed manually?
What is the difference between offsets managed automatically by consumers and offsets committed manually?
The difference is really about control versus convenience.
With automatic offset commits, the consumer periodically commits the latest polled offsets in the background, usually based on enable.auto.commit=true.
It is simple to use, but risky, because offsets may be committed before message processing actually finishes.
With manual commits, your code calls commitSync() or commitAsync() after processing succeeds.
That gives you tighter delivery guarantees, especially if you want at-least-once behavior.
The tradeoff is more application logic and responsibility for handling failures, retries, and commit timing.
In practice, auto-commit is fine for simple, low-risk consumers. Manual commit is better when processing is important and you need to avoid losing messages or marking them done too early.
What happens when a broker fails, and how does Kafka recover from that failure?
What happens when a broker fails, and how does Kafka recover from that failure?
When a broker fails, clients notice first through timeouts and metadata refreshes, and the cluster reacts based on partition leadership and replication.
Each partition has a leader and follower replicas, tracked by the controller.
If the failed broker was a leader, the controller elects a new leader from the in-sync replicas, ISR.
Producers and consumers refresh metadata, then automatically talk to the new leader.
If the broker only hosted follower replicas, traffic usually continues with no visible impact.
When the broker comes back, it rejoins, catches up by replicating missing data, and can re-enter the ISR once fully synced.
If replication factor is too low, or no ISR is available, that partition can become unavailable temporarily. Recovery is fast when replication is healthy and min.insync.replicas is configured properly.
How do idempotent producers work, and when would you enable them?
How do idempotent producers work, and when would you enable them?
Idempotent producers prevent duplicates caused by retries. Kafka assigns each producer a Producer ID, or PID, and every record gets a per-partition sequence number. The broker tracks the latest sequence it has accepted for that producer and partition, so if the producer retries the same send, the broker recognizes the duplicate and drops it instead of appending it again.
Iโd enable it basically anytime I care about avoiding duplicate writes from transient failures, which is most production cases. It is especially useful with acks=all, retries, and unstable networks or broker failover. It gives per-partition exactly-once write semantics, but not end-to-end exactly-once by itself. For multi-partition atomicity or consume-process-produce exactly-once, you pair it with transactions. In practice, Iโd leave enable.idempotence=true on by default unless I had a very specific reason not to.
What is the difference between eager rebalancing and cooperative rebalancing?
What is the difference between eager rebalancing and cooperative rebalancing?
The core difference is how disruptive the rebalance is.
Eager rebalancing stops all consumers in the group, revokes all partitions, then reassigns everything at once.
Cooperative rebalancing is incremental, consumers keep partitions they can safely keep, and only move the partitions that actually need to change.
Eager is simpler but causes a full stop-the-world pause, which can increase latency and duplicate processing during rebalances.
Cooperative reduces downtime and churn, so it is better for large groups or stateful consumers like Kafka Streams.
Cooperative needs assignors that support it, like CooperativeStickyAssignor, and usually takes multiple rebalance rounds to converge.
So in an interview, Iโd say eager is all-at-once and disruptive, cooperative is gradual and less disruptive.
How would you choose the right number of partitions for a topic in a production system?
How would you choose the right number of partitions for a topic in a production system?
Iโd size partitions from three angles: throughput, parallelism, and operational cost.
Start with target throughput, producers write rate and consumers read rate. One partition has a practical ceiling, so estimate MB/s per partition from benchmark data, not theory.
Match consumer parallelism. In a consumer group, max active consumers is basically the partition count, so choose enough partitions for peak concurrency plus headroom.
Plan for future growth, because increasing partitions later is easy technically but can break key ordering and trigger data skew.
Avoid over-partitioning. Too many partitions increase file handles, metadata, controller load, rebalance time, and replication overhead.
Use key distribution to validate. If keys are skewed, more partitions will not fix hotspot partitions without a better partitioning strategy.
In practice, I benchmark, add 30 to 50 percent headroom, then check broker limits and consumer group size.
What are the tradeoffs between having too few partitions and too many partitions?
What are the tradeoffs between having too few partitions and too many partitions?
It is mostly a balance between parallelism and overhead.
Too few partitions limits consumer parallelism, so one slow consumer can bottleneck throughput.
It also reduces flexibility, because you cannot scale a consumer group past the partition count.
Too many partitions increases broker overhead, metadata, file handles, memory use, and controller work.
It can make leader elections, rebalances, and recovery slower, especially in large clusters.
Producers may see more batching inefficiency, and consumers may spend more time coordinating than processing.
In practice, size partitions to match expected throughput, retention, and consumer parallelism, not just peak scale. I usually start with enough partitions for near-term growth, then avoid extreme counts unless there is a clear throughput need.
How would you explain Apache Kafka to a team that currently uses a traditional message queue and is unsure why Kafka would be a better fit?
How would you explain Apache Kafka to a team that currently uses a traditional message queue and is unsure why Kafka would be a better fit?
Iโd frame Kafka as an event streaming platform, not just a message queue. A traditional queue is usually about one producer, one consumer, and once a message is consumed it is gone. Kafka is built for high throughput, durable logs, and multiple consumers reading the same data independently.
Kafka stores events on disk for a retention period, so you can replay data.
Multiple consumer groups can read the same topic without interfering with each other.
It scales horizontally really well with partitions and brokers.
It is great when data needs to feed several systems, like analytics, services, and monitoring.
It is less ideal if you only need simple task distribution with strict per-message deletion.
So if your use case is just background jobs, a queue may be enough. If you want a central event backbone, Kafka is usually the better fit.
What are the core components of Kafka, and how do topics, partitions, brokers, producers, consumers, and consumer groups work together?
What are the core components of Kafka, and how do topics, partitions, brokers, producers, consumers, and consumer groups work together?
Kafka is basically a distributed commit log, and these pieces fit together to move data reliably and at scale.
Topic: a named stream of records, like orders or payments.
Partition: a topic is split into partitions, each an ordered, append-only log for parallelism and scale.
Broker: a Kafka server that stores partitions and serves reads and writes.
Producer: writes records to a topic, often choosing a partition by key so related events stay ordered.
Consumer: reads records from partitions and tracks its position with offsets.
Consumer group: multiple consumers share the work; each partition in a group is read by only one consumer at a time.
Put together, producers send events to topic partitions on brokers, and consumers in a group read those partitions in parallel. Kafka replicates partitions across brokers for fault tolerance.
What is the role of partitions in Kafka, and how do they affect scalability, ordering, and parallelism?
What is the role of partitions in Kafka, and how do they affect scalability, ordering, and parallelism?
Partitions are Kafkaโs unit of storage and work distribution. A topic is split into partitions, and each partition is an ordered, append-only log. That design is what lets Kafka scale horizontally.
Scalability: partitions spread data across brokers, so storage and throughput grow as you add brokers and partitions.
Ordering: Kafka only guarantees order within a single partition, not across the whole topic.
Parallelism: consumers in the same consumer group can read different partitions at the same time, which increases throughput.
Limitation: one partition can be consumed by only one consumer in a group at a time, so partition count caps consumer parallelism.
Tradeoff: more partitions improve throughput, but add overhead for metadata, rebalancing, and file handles.
A common pattern is choosing a key, like customerId, so all events for that key land in the same partition and keep order.
How do you handle offset management in applications where message processing must be reliable and recoverable?
How do you handle offset management in applications where message processing must be reliable and recoverable?
I treat offsets as part of the processing contract, not just a Kafka detail. The rule is simple, only commit when the work is truly done, so a restart replays safely instead of losing data.
Disable auto commit for anything important, use manual commits after successful processing.
Prefer idempotent processing, so if a record is replayed, the side effect is safe.
If writing to a database, store the business result and the consumed offset in the same transaction when possible.
For Kafka-to-Kafka flows, use transactions for exactly-once semantics, producer writes and offset commits together.
On failures, do not commit, send poison messages to a DLQ after retries, and alert on repeated errors.
Keep ordering in mind, commit per partition progress, not some global checkpoint.
During rebalances, use a consumer rebalance listener to flush in-flight work before partitions are revoked.
What happens during a consumer group rebalance, and why can rebalances become disruptive?
What happens during a consumer group rebalance, and why can rebalances become disruptive?
During a rebalance, Kafka pauses normal consumption so the group coordinator can reassign partitions across the active consumers in the group. It happens when a consumer joins, leaves, crashes, or when topic partition counts change.
Consumers revoke their current partitions, then get a new assignment.
During that window, processing can pause, so latency spikes.
If offsets are not committed carefully, you can see duplicate processing or brief gaps.
Eager rebalancing is more disruptive because everyone gives up partitions at once.
Frequent membership changes, slow consumers, long GC pauses, or session timeouts can trigger repeated rebalances.
This gets disruptive because stop-the-world reassignments hurt throughput and stability. Cooperative rebalancing helps by moving partitions more gradually, reducing full-group pauses.
How have you reduced the impact of consumer group rebalancing in a real system?
How have you reduced the impact of consumer group rebalancing in a real system?
I usually answer this with a quick framework: identify why rebalances happen, reduce how often they happen, then shorten the ones you cannot avoid.
In a Kafka-based payments pipeline, we had frequent rebalances causing lag spikes during deploys and traffic bursts. I reduced it by:
- Moving to static membership with group.instance.id, so rolling restarts stopped triggering full churn.
- Using cooperative sticky assignment, which avoided revoking all partitions at once.
- Increasing max.poll.interval.ms and tuning batch size, so long processing did not look like a dead consumer.
- Decoupling poll from processing, we polled continuously and handed work to an internal worker pool.
- Making shutdown graceful, commit final offsets, drain in-flight work, then leave the group cleanly.
That cut rebalance frequency a lot and reduced recovery time from minutes to seconds.
What is the difference between log retention and log compaction, and when would you use each?
What is the difference between log retention and log compaction, and when would you use each?
Kafka has two different cleanup strategies, and they solve different problems.
Log retention deletes old segments based on time or size, like retention.ms or retention.bytes.
It keeps data for a fixed window, so consumers must read it before it expires.
Use retention for event streams, audit-style logs, clickstreams, metrics, or anything where event history matters.
Log compaction keeps the latest value per key, removing older records for the same key, but not necessarily immediately.
Use compaction for changelog topics, CDC, user profiles, account state, or cache rebuilds, where you want the current state, not full history.
You can also combine them. For example, compacted topics with retention let you keep the latest state, while also aging out very old tombstones or stale data.
How does Kafka handle large messages, and what are the risks of increasing message size limits?
How does Kafka handle large messages, and what are the risks of increasing message size limits?
Kafka can handle large messages, but it is optimized for lots of smaller records. If you must send big payloads, you usually tune broker, producer, and consumer limits together, like message.max.bytes, max.request.size, and fetch.max.bytes. A common pattern is to store the large object in external storage, like S3, and put only a pointer in Kafka.
Large messages increase end-to-end latency, memory pressure, and GC pauses.
They reduce throughput because fewer records fit in a batch or page cache.
Replication gets heavier, so follower lag and ISR instability can increase.
Consumers may hit fetch or deserialization bottlenecks, especially during rebalancing.
Bigger limits raise failure blast radius, one oversized record can block retries or cause request timeouts.
So yes, Kafka supports it, but raising size limits is usually the last option, not the first.
How do you preserve message ordering in Kafka, and what design choices can break ordering guarantees?
How do you preserve message ordering in Kafka, and what design choices can break ordering guarantees?
Kafka only guarantees ordering within a single partition, so the main way to preserve order is to make sure all related events always go to the same partition.
Use a stable message key, like customerId or orderId, so Kafka hashes it to the same partition.
Keep the partition count stable when possible, because increasing partitions can change key-to-partition mapping for new records.
On the producer, retries plus max.in.flight.requests.per.connection > 1 can reorder messages after failures; use idempotence and keep in-flight low if strict order matters.
Multiple producers writing the same key without coordination can break logical ordering.
On the consumer side, parallel processing of one partition can reorder completion unless you serialize handling.
If an event chain spans multiple partitions, topics, or services, Kafka itself is not giving you global order anymore.
If you need strict per-entity order, partition by that entity and process each partition sequentially.
If a business requirement says all events for the same customer must be processed in order, how would you design the topic and producer strategy?
If a business requirement says all events for the same customer must be processed in order, how would you design the topic and producer strategy?
Iโd make ordering a partition-level guarantee and ensure every event for a given customer always lands in the same partition.
Use the customerId as the Kafka message key, so Kafka hashes it to a consistent partition.
Create the topic with enough partitions for throughput, knowing order is only guaranteed within each partition.
Keep the partition count stable if possible, because increasing partitions can remap keys and affect ordering for new events.
Configure the producer for safety, enable.idempotence=true, acks=all, and usually limit in-flight requests if strict ordering under retries matters.
On the consumer side, process records sequentially per partition, or preserve per-key order if doing async work.
If one customer is extremely hot, Iโd call out the tradeoff: strict per-customer ordering can limit parallelism for that customer.
How would you troubleshoot a consumer that is consistently lagging behind production traffic?
How would you troubleshoot a consumer that is consistently lagging behind production traffic?
Iโd work it from outside in: verify whether the issue is capacity, bad consumer behavior, or downstream slowness.
Check lag by partition and consumer group in kafka-consumer-groups --describe, not just total lag.
Compare production rate vs consumption rate. If producers outpace consumers, you need more throughput, not just tuning.
Confirm partition count vs consumer count. More consumers than partitions gives no extra parallelism.
Look for slow processing, DB calls, retries, or blocking I/O in the consumer app. This is often the real bottleneck.
Review poll settings like max.poll.records, fetch sizes, session.timeout.ms, and max.poll.interval.ms.
Check for rebalances, GC pauses, CPU, memory, network, and broker-side fetch throttling.
Verify commits are healthy. Bad commit logic can make lag look worse or cause reprocessing.
If needed, scale consumers, increase partitions, batch work, or decouple slow downstream processing.
What Kafka metrics do you monitor most closely in production, and why?
What Kafka metrics do you monitor most closely in production, and why?
I group them into producer, broker, consumer, and health signals, because each one catches a different failure mode.
Producer: record-send-rate, request-latency-avg, error and retry rates, these show throughput, broker pressure, and whether acks are slowing down.
Broker: under-replicated partitions, offline partitions, ISR shrink/expand, request handler idle, these are my first checks for cluster risk.
Consumer: lag, lag growth rate, rebalance count, fetch latency, these tell me if consumers are keeping up or thrashing.
Storage and network: disk usage, flush time, network in/out, saturation here usually becomes a cluster-wide problem fast.
End-to-end: message age or end-to-end latency, because healthy broker metrics can still hide user-visible delay.
If I had to prioritize, Iโd start with consumer lag, under-replicated partitions, and request latency.
How do you define, detect, and respond to consumer lag in a business-critical Kafka pipeline?
How do you define, detect, and respond to consumer lag in a business-critical Kafka pipeline?
Consumer lag is the gap between the latest produced offset and the last committed or processed offset for a consumer group. In a business-critical pipeline, I define it two ways: technical lag in offsets, and business lag in time, like "events must be processed within 30 seconds." Offsets alone can be misleading if message sizes or traffic patterns change.
Define SLOs per topic, for example max age of oldest unprocessed message, max lag per partition, and recovery time after spikes.
Detect with broker and client metrics, records-lag, records-lag-max, consumer throughput, rebalance frequency, and age of last successful processing.
Alert on sustained lag growth, not one-off spikes, and segment by consumer group, partition, and dependency.
Respond by checking if the bottleneck is consumer CPU, downstream I/O, hot partitions, slow deserialization, or rebalances.
Mitigate with more consumers if partitions allow, better batching, backpressure, pause or resume, DLQs for poison pills, and partition rebalancing or key redesign.
Have you worked with Kafka Connect in distributed mode, and what operational issues did you encounter?
Have you worked with Kafka Connect in distributed mode, and what operational issues did you encounter?
Yes. Iโve run Kafka Connect in distributed mode for CDC and sink pipelines, usually 3 to 6 workers behind a load balancer, with configs and offsets stored in Kafka topics.
A few operational issues came up a lot:
- Rebalances during connector changes or worker restarts, which briefly paused tasks and caused lag spikes.
- Mis-sized internal topics, especially low replication on config, offset, and status, which made the cluster fragile.
- Bad converters or schema mismatches, leading to poison pill records and repeated task failures.
- Sink backpressure, where downstream systems slowed writes, tasks piled up, and retries increased.
- Plugin/version drift across workers, causing classloading errors when one node had a different connector build.
What helped was standardizing plugin deployment, tuning tasks.max, setting proper DLQs and error tolerance, monitoring rebalance frequency and task failure rates, and carefully sizing internal topics and consumer settings.
What are some common causes of uneven partition distribution or data skew, and how would you fix them?
What are some common causes of uneven partition distribution or data skew, and how would you fix them?
A few things usually cause skew in Kafka, and the fix depends on whether the problem is on the producer side, the topic layout, or the cluster itself.
Bad key choice, like using customerId when a few customers dominate traffic, creates hot partitions. Pick a higher-cardinality key or add salting.
Null keys with sticky partitioning can still look uneven in bursts. If ordering is not required, use a better partitioning strategy.
Too few partitions means not enough spread. Increase partition count, but plan for ordering and consumer rebalance impact.
Custom partitioners sometimes hash poorly or route disproportionately. Review the logic and test key distribution.
Uneven broker leader placement or partition reassignment can create broker-level hotspots. Rebalance leaders and partitions with Cruise Control or reassignment tools.
Consumer-side skew also happens when one partition has much more data. Scale partitions, optimize processing, or isolate hot keys into a separate topic.
What factors influence producer performance, and how do batch size, linger time, compression, and acknowledgments affect throughput and latency?
What factors influence producer performance, and how do batch size, linger time, compression, and acknowledgments affect throughput and latency?
Producer performance mainly comes down to how efficiently it batches records, how much network round-trip it waits on, and how much durability you ask Kafka to guarantee.
batch.size: Larger batches improve throughput by sending more records per request, but they can add latency if traffic is low and batches take time to fill.
linger.ms: Higher linger lets the producer wait briefly for more records, which improves batching and throughput, but increases end-to-end latency.
compression.type: Compression like snappy, lz4, or zstd usually improves throughput and reduces network and disk usage, but adds CPU overhead. zstd compresses best, often with more CPU cost.
acks: acks=0 gives lowest latency, weakest durability. acks=1 is a common middle ground. acks=all gives strongest durability, but usually higher latency and lower throughput.
Other factors: record size, key distribution, number of partitions, retries, max.in.flight.requests.per.connection, broker load, and network bandwidth.
How do acknowledgments settings such as acks=0, acks=1, and acks=all change the reliability characteristics of a producer?
How do acknowledgments settings such as acks=0, acks=1, and acks=all change the reliability characteristics of a producer?
acks controls when the broker tells the producer "your write is accepted," so it is a direct tradeoff between latency and durability.
acks=0: fire-and-forget. Producer does not wait for any broker response, fastest, but messages can be lost with no visibility if the broker is down or the request fails.
acks=1: leader-only ack. The leader writes the record locally and responds, but if it crashes before followers replicate, acknowledged data can still be lost.
acks=all or -1: strongest durability. The leader waits for all in-sync replicas to persist before acknowledging, so loss risk is much lower.
With acks=all, reliability also depends on min.insync.replicas, if too few replicas are in sync, the write fails instead of accepting weaker durability.
For highest safety, pair acks=all with idempotence enabled and retries.
What is min.insync.replicas, and how does it interact with producer acknowledgments?
What is min.insync.replicas, and how does it interact with producer acknowledgments?
min.insync.replicas is the minimum number of in-sync replicas, including the leader, that must have the record for Kafka to accept a write. It is a durability guardrail at the topic or broker level.
With acks=all, the producer waits for all required in-sync replicas, and if ISR count drops below min.insync.replicas, the broker rejects writes, typically with NotEnoughReplicas.
With acks=1, only the leader ack matters, so min.insync.replicas is effectively not protecting that write path.
With acks=0, the producer does not wait at all, so there is no durability guarantee.
Example: replication factor 3, min.insync.replicas=2, acks=all, one broker can fail and writes still succeed; if two replicas are unavailable, writes fail instead of risking data loss.
For strong durability, pair replication factor at least 3 with acks=all and min.insync.replicas=2.
What is the difference between SSL, SASL, ACLs, and RBAC in the Kafka ecosystem?
What is the difference between SSL, SASL, ACLs, and RBAC in the Kafka ecosystem?
They solve different layers of security in Kafka:
SSL, or TLS, encrypts traffic and can also do mutual auth with client certs. It answers, "is the connection private, and who is this cert holder?"
SASL is an authentication framework on top of the connection, using mechanisms like SCRAM, PLAIN, GSSAPI, or OAUTHBEARER. It answers, "who is this user or service?"
ACLs are authorization rules in Kafka, like "User A can read topic X" or "Service B can write to group Y."
RBAC is higher-level authorization, assigning roles to users or groups, then roles map to permissions. Easier to manage at scale than lots of individual ACLs.
In practice, SSL and SASL handle secure transport and identity, ACLs and RBAC handle what that identity is allowed to do.
A common setup is TLS for encryption, SASL SCRAM for login, then ACLs or RBAC for access control.
How do you secure a Kafka cluster using authentication, authorization, encryption, and network controls?
How do you secure a Kafka cluster using authentication, authorization, encryption, and network controls?
Iโd answer it in layers, because Kafka security is really defense in depth.
Authentication: use SASL, typically SCRAM for users or mTLS with client certs, and separate principals for apps, admins, brokers, and Connect.
Authorization: enable ACLs or RBAC, follow least privilege, and scope access to specific topics, consumer groups, and transactional IDs.
Encryption: enable TLS for client-broker and broker-broker traffic, validate certs properly, and encrypt data at rest with disk or volume encryption.
Network controls: put brokers on private subnets, restrict listener exposure, lock down ports with security groups or firewalls, and isolate internal vs external listeners.
Ops hardening: rotate secrets and certs, audit access, centralize logs, disable anonymous access, and secure ZooKeeper or KRaft controller communication too.
In interviews, Iโd also mention testing misconfigurations regularly, because weak ACLs or open listeners are common failure points.
How would you design Kafka access controls for multiple teams sharing the same cluster?
How would you design Kafka access controls for multiple teams sharing the same cluster?
Iโd do it with strict tenant boundaries, least privilege, and automation, so teams can move fast without stepping on each other.
Use naming conventions by team and environment, like teamA.dev.*, teamA.prod.*, and map ACLs to those prefixes.
Create principals per app, not shared users, via mTLS or SASL/OAUTH, so every producer and consumer has its own identity.
Grant only required permissions, WRITE on producer topics, READ on consumer topics and groups, DESCRIBE where needed, no wildcard admin access.
Separate platform admins from application teams using RBAC if available, or tightly scoped ACLs plus IaC workflows and approvals.
Isolate sensitive workloads further with quotas, dedicated topics, and possibly separate clusters for highly regulated data.
Audit continuously, log auth failures, ACL changes, unusual access patterns, and review permissions on a schedule.
Iโd also standardize topic creation through templates, so retention, replication, and ACL defaults are consistent.
What are the implications of running Kafka across availability zones or across regions?
What are the implications of running Kafka across availability zones or across regions?
Running Kafka across AZs is common and usually the sweet spot for HA. Running it across regions is a much bigger tradeoff.
Across availability zones: good fault tolerance, but higher network latency and cross-AZ traffic cost.
Replication settings matter more, min.insync.replicas, replication factor, and acks=all protect durability during an AZ loss.
Leader placement affects latency, producers and consumers pay more if they talk to brokers in other AZs.
Watch partition reassignments and inter-broker replication, they can get expensive and slower across AZs.
Across regions: usually not recommended for one Kafka cluster, latency hurts throughput, elections get slower, and failures can feel messy.
For multi-region, the common pattern is separate clusters per region with MirrorMaker 2 or Cluster Linking. That gives better isolation, lower local latency, and cleaner disaster recovery.
How would you design a disaster recovery strategy for Kafka?
How would you design a disaster recovery strategy for Kafka?
Iโd design Kafka DR around clear RPO and RTO targets first, because that drives whether you need active-passive or active-active.
Use cross-region replication, typically MirrorMaker 2 or Cluster Linking, to copy critical topics, configs, and consumer offsets.
Keep the primary cluster in one region, DR in another, with separate brokers, ZooKeeper or KRaft quorum, networking, and storage failure domains.
Set topic configs intentionally, like acks=all, strong replication factor, min.insync.replicas, and retention long enough to absorb failover delays.
Automate failover with DNS, client config switching, and runbooks, but treat consumer offset validation as a key step.
Test regularly with controlled failovers, broker loss, region isolation, and restore drills, because untested DR is mostly paperwork.
Iโd also classify topics by business criticality. Not every topic needs the same replication cost or recovery priority.
What is MirrorMaker, and when would you use it instead of building your own replication approach?
What is MirrorMaker, and when would you use it instead of building your own replication approach?
MirrorMaker is Kafkaโs built-in tool for copying topics, partitions, and offsets between clusters. Think of it as managed consumer-plus-producer replication, useful when you want cluster-to-cluster data movement without writing custom apps.
Use MirrorMaker when you need DR, cross-region replication, cloud migration, or data sharing between environments.
Prefer MirrorMaker 2 over MM1, it handles topic config sync, consumer group offsets, and active-active patterns much better.
It saves you from owning retry logic, partition mapping, failover behavior, monitoring, and operational edge cases.
Build your own only if you need custom transforms, strict filtering, non-Kafka destinations, or very specific replication semantics.
In interviews, Iโd say MirrorMaker is the default for Kafka-to-Kafka replication, custom code is justified only for specialized business logic.
What makes a good Kafka connector configuration, and how do you validate reliability, error handling, and schema compatibility?
What makes a good Kafka connector configuration, and how do you validate reliability, error handling, and schema compatibility?
A good connector config is boring in the best way, explicit, observable, and easy to recover. Iโd focus on correctness first, then throughput.
Set clear delivery semantics, idempotent sink behavior where possible, and stable keys or partitioning.
Tune retries, backoff, timeouts, max.poll.interval.ms, batch sizes, and task parallelism to match source and sink limits.
Use errors.tolerance, dead letter queues, and structured logging so bad records are isolated, not hidden.
Keep converters and schema settings explicit, like Avro or Protobuf with Schema Registry, subject naming, and compatibility mode.
Lock down secrets, ACLs, TLS, and use config providers instead of hardcoding credentials.
For validation, Iโd run happy path, malformed data, schema evolution, duplicate, replay, and downstream outage tests. Then check lag, DLQ volume, retry behavior, offset commits, and whether schema changes preserve backward or forward compatibility before promoting to production.
What are the most common reasons Kafka performance degrades, and how do you diagnose the root cause?
What are the most common reasons Kafka performance degrades, and how do you diagnose the root cause?
Kafka slowdowns usually come from a few predictable buckets, so I diagnose by narrowing it to producers, brokers, consumers, or the network.
Producer side: tiny batches, acks=all with slow replicas, compression overhead, high retries, or too many in-flight requests.
Broker side: disk I/O saturation, page cache misses, too many partitions, compaction pressure, GC pauses, controller churn, or under-replicated partitions.
Consumer side: lag from slow processing, bad poll loop design, rebalances, fetch settings, or skewed partitions.
Cluster health: replication traffic, ISR shrink, uneven leadership, hot partitions, and network latency between brokers or clients.
Diagnosis: start with producer latency, broker request metrics, consumer lag, disk and network utilization, JVM GC, partition skew, and under-replicated partitions. Correlate Kafka metrics with OS metrics and recent config or deployment changes.
I usually check RequestHandlerAvgIdlePercent, UnderReplicatedPartitions, fetch/produce request latency, consumer lag, and disk wait first.
How do you approach capacity planning for a Kafka cluster in terms of storage, network, CPU, partitions, and replication?
How do you approach capacity planning for a Kafka cluster in terms of storage, network, CPU, partitions, and replication?
I start from workload, not brokers. Estimate peak ingress and egress in MB/s, average record size, retention, consumer fan-out, and acceptable recovery time. Then map that into storage, network, CPU, partition count, and replication overhead with headroom.
Storage: daily data x retention x replication factor, then add 30 to 50 percent free space for compaction, rebalancing, and broker failure.
Network: budget for producer writes, consumer reads, and replication traffic. RF=3 means each write is sent across the network multiple times.
CPU: compression, TLS, batching, fetches, and consumer groups drive CPU. Benchmark your actual serializer and compression codec.
Partitions: size for throughput and parallelism, but avoid over-partitioning because it increases metadata, open files, and leader elections.
Replication: usually RF=3 in production, then check if the cluster can tolerate one broker down without saturating storage or network.
Validate with load tests, then keep 20 to 30 percent capacity headroom.
What is the purpose of Kafka Connect, and how is it different from writing custom producers and consumers?
What is the purpose of Kafka Connect, and how is it different from writing custom producers and consumers?
Kafka Connect is Kafkaโs integration framework for moving data between Kafka and external systems, like databases, S3, Elasticsearch, or SaaS apps. Its main purpose is to standardize and simplify data ingestion and export without writing and operating custom app code.
Connectors are reusable plugins, sources pull data into Kafka, sinks push data out.
It gives you offset management, scaling, fault tolerance, retries, and config-driven deployment out of the box.
It supports Single Message Transforms, schema converters, and works well with Schema Registry.
Custom producers and consumers give maximum flexibility for business logic, but you own code, deployments, retries, error handling, and monitoring.
Use Kafka Connect for common integration patterns, use custom apps when you need complex processing, enrichment, or nonstandard behavior.
How do Kafka Streams applications manage state, and what are state stores, changelog topics, and standby replicas?
How do Kafka Streams applications manage state, and what are state stores, changelog topics, and standby replicas?
Kafka Streams manages state locally on each instance, so stateful operations like aggregations, joins, and windowing can read and update data fast without calling an external database.
State stores are the local storage layer, usually RocksDB or in-memory, keyed by record key and used by processors.
Changelog topics are Kafka topics where every state update is written, so the local store can be rebuilt after a crash or rebalance.
On recovery, Streams restores the state store by replaying the changelog, which gives fault tolerance.
Standby replicas are extra, passive copies of a taskโs state on other instances, kept warm by consuming the changelog.
If an active instance dies, a standby can be promoted quickly, reducing restore time and failover impact.
So the pattern is, local state for speed, changelogs for durability, standby replicas for faster high availability.
What is the role of event time, stream time, windows, and grace periods in Kafka Streams or similar Kafka-based stream processing?
What is the role of event time, stream time, windows, and grace periods in Kafka Streams or similar Kafka-based stream processing?
These concepts are about handling late and out of order data correctly, instead of just using wall clock time.
Event time is when the event actually happened, usually from a timestamp in the record.
Stream time is the processorโs notion of progress, typically the max event timestamp seen so far in a partition or task.
Windows group records into bounded time buckets, like tumbling, hopping, or session windows, so you can aggregate by time range.
Grace period is extra allowed lateness after a windowโs end, so late events can still update results.
Once stream time passes window end + grace, the window is closed and later records for it are dropped.
Example: if a click happened at 10:01 but arrives at 10:06, event time puts it in the 10:00 to 10:05 window. A 2 minute grace accepts it, a 0 grace drops it.
What is Kafka Streams, and how does it differ from using plain consumers and producers?
What is Kafka Streams, and how does it differ from using plain consumers and producers?
Kafka Streams is Kafkaโs client library for building stream processing apps directly on top of Kafka. You write normal Java code, and it handles the hard parts like state management, repartitioning, fault tolerance, and scaling across instances.
With plain consumers and producers, you manually poll, transform, manage offsets, write output, and handle retries or state yourself.
Kafka Streams gives you high-level primitives like map, filter, join, aggregate, windowing, and stream-table processing.
It supports local state stores backed by changelog topics, so stateful operations are much easier and recoverable.
It runs as a library inside your app, not a separate cluster like Spark or Flink.
Use plain consumers/producers for simple pipelines; use Kafka Streams when you need stateful processing, joins, aggregations, or time-windowed logic.
What is Schema Registry, and why is schema evolution important in Kafka environments?
What is Schema Registry, and why is schema evolution important in Kafka environments?
Schema Registry is a centralized service that stores and versions message schemas, usually for Avro, Protobuf, or JSON Schema. Producers register a schema, and consumers use the schema ID in the Kafka message to deserialize data correctly. It helps enforce compatibility rules so teams do not accidentally break downstream consumers.
Schema evolution matters because Kafka topics often have long-lived data and many independent producers and consumers. If you change a field, like adding, removing, or renaming it, older services may still need to read new messages, or new services may need to read old messages. With compatibility settings like backward, forward, or full, Schema Registry lets you evolve schemas safely, reduce deployment risk, and avoid runtime parsing failures across distributed systems.
How do you communicate Kafka tradeoffs to non-experts, especially when reliability, cost, and delivery guarantees are pulling in different directions?
How do you communicate Kafka tradeoffs to non-experts, especially when reliability, cost, and delivery guarantees are pulling in different directions?
I translate Kafka tradeoffs into business language first: what risk are we reducing, what latency can we tolerate, and what does failure cost us? Non-experts usually do not care about acks=all or idempotence, they care about questions like, "Could we lose an order?" or "How much more are we paying to make that nearly impossible?"
Frame it as a triangle: reliability, cost, speed. You can optimize two more easily than all three.
Use concrete scenarios: losing analytics events is different from losing payment events.
Explain guarantees plainly: at-most-once can lose data, at-least-once can duplicate, exactly-once costs more and adds complexity.
Quantify the price: more replicas, cross-zone traffic, storage retention, and operational overhead.
Recommend by tier: critical topics get stronger guarantees, low-value topics get cheaper settings.
That keeps the conversation decision-focused, not config-focused.
How do you handle duplicate events, late-arriving events, or out-of-order events in a Kafka-based architecture?
How do you handle duplicate events, late-arriving events, or out-of-order events in a Kafka-based architecture?
I handle them at a few layers, because Kafka gives you ordering per partition, not globally, and delivery can still be at-least-once.
Duplicates: make consumers idempotent, use a business key or event ID, and store processed IDs or use upserts/compaction.
Producer side: enable idempotent producers and, if needed, transactions for exactly-once within Kafka pipelines.
Out-of-order events: partition by the entity key so related events stay in one partition, then use sequence numbers, versions, or event timestamps to detect stale updates.
Late-arriving events: define a lateness window in stream processing, use event time not processing time, and send very late data to a side topic or reconciliation flow.
Operations: add DLQs only for poison messages, not normal lateness, and monitor duplicate rate, lag, and reorder frequency.
In practice, I usually combine idempotent writes plus keyed partitioning plus watermark/window logic.
How do you manage backward, forward, and full compatibility when evolving Avro, Protobuf, or JSON schemas?
How do you manage backward, forward, and full compatibility when evolving Avro, Protobuf, or JSON schemas?
I handle it by setting clear compatibility rules in a schema registry, then making only additive, well-governed changes unless there is a planned breaking version.
Backward compatibility: new consumers can read old data, so I can add optional fields with defaults, but avoid removing or renaming fields casually.
Forward compatibility: old consumers can read new data, so producers should not start requiring new fields that older readers cannot ignore.
Full compatibility: both directions work, which usually means additive changes only, with defaults and stable field identities.
Avro: use defaults carefully, never reuse field names for different meanings.
Protobuf: never reuse field numbers, reserve removed fields, prefer adding new optional fields.
JSON Schema: harder to enforce, so I rely on registry checks, versioning, and consumer-driven contract tests.
In practice, I use BACKWARD or FULL in Schema Registry, run CI compatibility checks, and treat renames as add plus deprecate plus later remove.
Tell me about a Kafka production incident you were involved in. What happened, how did you diagnose it, and what changes did you make afterward?
Tell me about a Kafka production incident you were involved in. What happened, how did you diagnose it, and what changes did you make afterward?
Iโd answer this with a tight STAR flow: situation, impact, actions, result, then what changed permanently.
At one company, consumer lag suddenly spiked for a payments event stream during peak traffic, and downstream reconciliation started missing SLAs. I checked broker health first, then compared producer rate, consumer lag, rebalance frequency, and processing latency. The clue was frequent consumer group rebalances plus long GC pauses on one consumer node. That node was under-memory sized, causing missed heartbeats and partition thrashing. We stabilized it by scaling out consumers, increasing heap carefully, tuning max.poll.interval.ms and session settings, and reducing per-message processing time by moving a slow external call out of the hot path. Afterward, we added lag and rebalance alerts, load tests for burst traffic, stricter capacity baselines, and a runbook so on-call could isolate broker issues versus consumer issues fast.
Describe a time when a Kafka-based design you implemented did not work as expected. What assumptions were wrong, and what did you learn?
Describe a time when a Kafka-based design you implemented did not work as expected. What assumptions were wrong, and what did you learn?
Iโd answer this with a quick STAR structure: situation, wrong assumption, fix, lesson.
At one company, I helped design an event pipeline where services published order updates to Kafka, and downstream consumers built read models and triggered notifications. We assumed partitioning by customerId was good enough because most queries were customer-centric. That was wrong. Order state transitions needed strict per-order ordering, and different orders for the same customer created hotspots, lag, and some out-of-order effects in downstream materializations.
We fixed it by repartitioning on orderId, adding idempotent consumers, and tightening our schema and event contract around state transitions. The big lesson was that partition keys should follow the unit of ordering, not just access patterns. I also learned to test with real traffic distributions, because โeven enoughโ in staging can collapse in production.
Imagine a consumer accidentally commits offsets before processing messages successfully. What failure modes would this create, and how would you correct the design?
Imagine a consumer accidentally commits offsets before processing messages successfully. What failure modes would this create, and how would you correct the design?
That creates classic at-most-once behavior by accident. The main failure mode is silent data loss: if the consumer crashes after committing but before finishing the work, Kafka thinks those records are done, so they will not be re-read after restart.
Lost messages, offsets advance past records that never actually completed
Partial side effects, like writing to a DB for some records in a batch, then crashing
Inconsistent downstream state, especially if processing touches multiple systems
Harder recovery, because replay from committed offsets skips the failed work
Iโd fix it by committing offsets only after successful processing, usually after the DB transaction or external side effect completes. If possible, make processing idempotent so retries are safe. For stronger guarantees, use Kafka transactions for consume-process-produce flows, or persist both result and offset atomically in the same database transaction.
Suppose a topic suddenly shows explosive growth in storage usage. How would you investigate whether the issue is retention, compaction, producer behavior, or consumer behavior?
Suppose a topic suddenly shows explosive growth in storage usage. How would you investigate whether the issue is retention, compaction, producer behavior, or consumer behavior?
Iโd triage it in this order: confirm what changed, identify whether growth is from more bytes in, slower cleanup, or consumers pinning data, then narrow by topic config and broker metrics.
Check topic configs first: retention.ms, retention.bytes, cleanup.policy, segment.bytes, min.cleanable.dirty.ratio, and any recent config changes.
Compare produce rate vs delete/compact rate using broker JMX or monitoring: if bytes in spiked, suspect producers; if cleanup lagged, suspect retention or compaction.
Inspect log dirs and segment age/size. Old segments not deleting usually means retention settings, active segments, or broker clock/config issues.
For compacted topics, look at cleaner metrics like max compaction lag, dirty ratio, cleaner backlog, and tombstone retention. A growing backlog points to compaction falling behind.
Validate producer behavior: message size, key cardinality, duplicate sends, retries, compression changes, or a bad loop flooding the topic.
Check consumer lag and offsets, but note consumers do not control topic storage unless retention is tied to their progress externally.
If a product team wants to use Kafka as both a queue and a database, how would you respond, and what architectural guidance would you provide?
If a product team wants to use Kafka as both a queue and a database, how would you respond, and what architectural guidance would you provide?
Iโd say Kafka can act like both, but you should be precise about what role it plays. It is excellent as a durable event log and a decoupling layer. It can behave like a queue with consumer groups, and like a replayable source of truth for event-driven state. But it is not a general-purpose OLTP database.
As a queue, Kafka works well for high-throughput async processing, retries, and fan-out.
As a database, use it as an event store or system of record for immutable facts, not for ad hoc queries or complex transactions.
If you need current state lookups, put a materialized view on top, like Kafka Streams state stores, ksqlDB tables, Elasticsearch, or a relational DB.
Keep retention, compaction, schemas, keys, and idempotency explicit.
Guidance: let Kafka own event history, let specialized databases own serving queries and transactional business state.
Have you ever had to migrate from ZooKeeper-based Kafka to KRaft mode, or plan for it? What technical and operational considerations mattered most?
Have you ever had to migrate from ZooKeeper-based Kafka to KRaft mode, or plan for it? What technical and operational considerations mattered most?
Yes. I have planned and helped execute the move conceptually and operationally, and the biggest lesson is to treat it as both a metadata migration and an ops model change, not just a version upgrade.
First, verify version compatibility and the exact migration path, because KRaft migration is supported only in specific Kafka versions and mixed assumptions can break the rollout.
Separate controller and broker roles deliberately. KRaft makes controllers first-class, so quorum sizing, hardware, placement, and fault domains matter a lot.
Inventory every ZooKeeper dependency, including scripts, monitoring, ACL workflows, and admin tooling, because many teams forget these hidden operational ties.
Rehearse in lower environments with realistic topic counts, ACLs, and broker failures, then validate leader elections, metadata propagation, and rollback options.
Update observability and runbooks. You monitor controller quorum health now, not ZooKeeper, and on-call procedures need to reflect that.
1. Can you explain leader and follower replicas and how leader election works in Kafka?
In Kafka, each partition has one leader replica and zero or more follower replicas. Producers and consumers talk to the leader only. Followers continuously fetch data from the leader and try to stay in sync. Kafka tracks the in-sync replicas, or ISR, which are followers that are caught up enough to be considered safe for failover.
Leader election happens when the current leader dies or becomes unavailable. The controller picks a new leader from the ISR, not from any out-of-date replica, so Kafka avoids data loss in normal cases. If unclean.leader.election.enable=true, Kafka can elect a non-ISR replica, which improves availability but risks losing committed data. A good answer also mentions replication factor, ISR, and acks=all, because they all work together to determine durability and failover behavior.
2. How does Kafka achieve high throughput compared with more traditional messaging systems?
Kafka gets high throughput by optimizing for sequential disk and batched network I/O, instead of treating every message like a tiny independent transaction.
It writes to an append-only log, so disk access is mostly sequential, which is much faster than random writes.
Producers batch records and compress them, cutting syscall, network, and storage overhead.
Consumers read sequentially too, often in large chunks, which is cache-friendly and efficient.
It relies heavily on the OS page cache and uses zero-copy transfer with sendfile, reducing extra memory copies.
Brokers are simple, they do not track per-message acknowledgments for each consumer, because consumers manage offsets.
Partitioning spreads load across brokers, disks, and consumers, so throughput scales horizontally.
3. How do replication factor and in-sync replicas affect durability and availability in Kafka?
They control the tradeoff between not losing data and keeping writes available.
Replication factor is how many copies of a partition exist. Higher RF improves durability and failover, but costs more storage and network.
In-sync replicas, ISR, are replicas fully caught up with the leader. Kafka only considers these safe for leader election.
With acks=all, the leader waits for all ISR members to confirm before acknowledging. That gives stronger durability.
min.insync.replicas sets the minimum ISR count required to accept writes. If ISR drops below that, producers using acks=all get errors instead of risking weaker durability.
Example, RF=3 and min.insync.replicas=2 lets Kafka survive one broker loss without losing committed data, while still accepting writes.
If you set RF=3 but min.insync.replicas=1, availability is higher, but durability is weaker because one leader ack can be enough.
No strings attached, free trial, fully vetted.
Try your first call for free with every mentor you're meeting. Cancel anytime, no questions asked.
4. What is the difference between at-most-once, at-least-once, and exactly-once delivery semantics in Kafka?
Hereโs the clean way to explain it:
At-most-once: messages may be lost, but never duplicated. This happens if you commit offsets before processing, or if a producer sends without retries.
At-least-once: messages are not lost, but they can be processed more than once. Typical setup is producer retries plus consumer commits offsets after successful processing.
Exactly-once: each message is processed once, with no loss and no duplicates, end to end within Kafka-supported workflows.
In Kafka, exactly-once usually means idempotent producers plus transactions, and consumers using read_committed. Important nuance, exactly-once is strongest when you stay inside Kafka, like consume-process-produce pipelines with Kafka Streams. If your consumer writes to an external DB, you still need idempotency or transactional coordination there.
5. How does transactional messaging in Kafka work, and what types of problems does it solve?
Kafka transactions let a producer write to multiple partitions, and even commit consumer offsets, as one atomic unit. The producer gets a transactional.id, initializes transactions, calls beginTransaction(), sends records, then either commitTransaction() or abortTransaction(). Kafka uses producer IDs and sequence numbers for idempotence, plus a transaction coordinator and special transaction markers so consumers with isolation.level=read_committed only see committed data.
It solves a few big problems:
- Exactly-once processing in stream apps, especially Kafka Streams and consume-transform-produce loops.
- Avoiding partial writes, where some partitions get data and others do not.
- Preventing duplicates caused by retries or producer restarts.
- Atomically writing output records and consumer offsets, so you do not reprocess after a crash.
- Hiding aborted records from downstream consumers, which keeps results consistent.
6. What is the difference between offsets managed automatically by consumers and offsets committed manually?
The difference is really about control versus convenience.
With automatic offset commits, the consumer periodically commits the latest polled offsets in the background, usually based on enable.auto.commit=true.
It is simple to use, but risky, because offsets may be committed before message processing actually finishes.
With manual commits, your code calls commitSync() or commitAsync() after processing succeeds.
That gives you tighter delivery guarantees, especially if you want at-least-once behavior.
The tradeoff is more application logic and responsibility for handling failures, retries, and commit timing.
In practice, auto-commit is fine for simple, low-risk consumers. Manual commit is better when processing is important and you need to avoid losing messages or marking them done too early.
7. What happens when a broker fails, and how does Kafka recover from that failure?
When a broker fails, clients notice first through timeouts and metadata refreshes, and the cluster reacts based on partition leadership and replication.
Each partition has a leader and follower replicas, tracked by the controller.
If the failed broker was a leader, the controller elects a new leader from the in-sync replicas, ISR.
Producers and consumers refresh metadata, then automatically talk to the new leader.
If the broker only hosted follower replicas, traffic usually continues with no visible impact.
When the broker comes back, it rejoins, catches up by replicating missing data, and can re-enter the ISR once fully synced.
If replication factor is too low, or no ISR is available, that partition can become unavailable temporarily. Recovery is fast when replication is healthy and min.insync.replicas is configured properly.
8. How do idempotent producers work, and when would you enable them?
Idempotent producers prevent duplicates caused by retries. Kafka assigns each producer a Producer ID, or PID, and every record gets a per-partition sequence number. The broker tracks the latest sequence it has accepted for that producer and partition, so if the producer retries the same send, the broker recognizes the duplicate and drops it instead of appending it again.
Iโd enable it basically anytime I care about avoiding duplicate writes from transient failures, which is most production cases. It is especially useful with acks=all, retries, and unstable networks or broker failover. It gives per-partition exactly-once write semantics, but not end-to-end exactly-once by itself. For multi-partition atomicity or consume-process-produce exactly-once, you pair it with transactions. In practice, Iโd leave enable.idempotence=true on by default unless I had a very specific reason not to.
Find your perfect mentor match
Get personalized mentor recommendations based on your goals and experience level
9. What is the difference between eager rebalancing and cooperative rebalancing?
The core difference is how disruptive the rebalance is.
Eager rebalancing stops all consumers in the group, revokes all partitions, then reassigns everything at once.
Cooperative rebalancing is incremental, consumers keep partitions they can safely keep, and only move the partitions that actually need to change.
Eager is simpler but causes a full stop-the-world pause, which can increase latency and duplicate processing during rebalances.
Cooperative reduces downtime and churn, so it is better for large groups or stateful consumers like Kafka Streams.
Cooperative needs assignors that support it, like CooperativeStickyAssignor, and usually takes multiple rebalance rounds to converge.
So in an interview, Iโd say eager is all-at-once and disruptive, cooperative is gradual and less disruptive.
10. How would you choose the right number of partitions for a topic in a production system?
Iโd size partitions from three angles: throughput, parallelism, and operational cost.
Start with target throughput, producers write rate and consumers read rate. One partition has a practical ceiling, so estimate MB/s per partition from benchmark data, not theory.
Match consumer parallelism. In a consumer group, max active consumers is basically the partition count, so choose enough partitions for peak concurrency plus headroom.
Plan for future growth, because increasing partitions later is easy technically but can break key ordering and trigger data skew.
Avoid over-partitioning. Too many partitions increase file handles, metadata, controller load, rebalance time, and replication overhead.
Use key distribution to validate. If keys are skewed, more partitions will not fix hotspot partitions without a better partitioning strategy.
In practice, I benchmark, add 30 to 50 percent headroom, then check broker limits and consumer group size.
11. What are the tradeoffs between having too few partitions and too many partitions?
It is mostly a balance between parallelism and overhead.
Too few partitions limits consumer parallelism, so one slow consumer can bottleneck throughput.
It also reduces flexibility, because you cannot scale a consumer group past the partition count.
Too many partitions increases broker overhead, metadata, file handles, memory use, and controller work.
It can make leader elections, rebalances, and recovery slower, especially in large clusters.
Producers may see more batching inefficiency, and consumers may spend more time coordinating than processing.
In practice, size partitions to match expected throughput, retention, and consumer parallelism, not just peak scale. I usually start with enough partitions for near-term growth, then avoid extreme counts unless there is a clear throughput need.
12. How would you explain Apache Kafka to a team that currently uses a traditional message queue and is unsure why Kafka would be a better fit?
Iโd frame Kafka as an event streaming platform, not just a message queue. A traditional queue is usually about one producer, one consumer, and once a message is consumed it is gone. Kafka is built for high throughput, durable logs, and multiple consumers reading the same data independently.
Kafka stores events on disk for a retention period, so you can replay data.
Multiple consumer groups can read the same topic without interfering with each other.
It scales horizontally really well with partitions and brokers.
It is great when data needs to feed several systems, like analytics, services, and monitoring.
It is less ideal if you only need simple task distribution with strict per-message deletion.
So if your use case is just background jobs, a queue may be enough. If you want a central event backbone, Kafka is usually the better fit.
13. What are the core components of Kafka, and how do topics, partitions, brokers, producers, consumers, and consumer groups work together?
Kafka is basically a distributed commit log, and these pieces fit together to move data reliably and at scale.
Topic: a named stream of records, like orders or payments.
Partition: a topic is split into partitions, each an ordered, append-only log for parallelism and scale.
Broker: a Kafka server that stores partitions and serves reads and writes.
Producer: writes records to a topic, often choosing a partition by key so related events stay ordered.
Consumer: reads records from partitions and tracks its position with offsets.
Consumer group: multiple consumers share the work; each partition in a group is read by only one consumer at a time.
Put together, producers send events to topic partitions on brokers, and consumers in a group read those partitions in parallel. Kafka replicates partitions across brokers for fault tolerance.
14. What is the role of partitions in Kafka, and how do they affect scalability, ordering, and parallelism?
Partitions are Kafkaโs unit of storage and work distribution. A topic is split into partitions, and each partition is an ordered, append-only log. That design is what lets Kafka scale horizontally.
Scalability: partitions spread data across brokers, so storage and throughput grow as you add brokers and partitions.
Ordering: Kafka only guarantees order within a single partition, not across the whole topic.
Parallelism: consumers in the same consumer group can read different partitions at the same time, which increases throughput.
Limitation: one partition can be consumed by only one consumer in a group at a time, so partition count caps consumer parallelism.
Tradeoff: more partitions improve throughput, but add overhead for metadata, rebalancing, and file handles.
A common pattern is choosing a key, like customerId, so all events for that key land in the same partition and keep order.
15. How do you handle offset management in applications where message processing must be reliable and recoverable?
I treat offsets as part of the processing contract, not just a Kafka detail. The rule is simple, only commit when the work is truly done, so a restart replays safely instead of losing data.
Disable auto commit for anything important, use manual commits after successful processing.
Prefer idempotent processing, so if a record is replayed, the side effect is safe.
If writing to a database, store the business result and the consumed offset in the same transaction when possible.
For Kafka-to-Kafka flows, use transactions for exactly-once semantics, producer writes and offset commits together.
On failures, do not commit, send poison messages to a DLQ after retries, and alert on repeated errors.
Keep ordering in mind, commit per partition progress, not some global checkpoint.
During rebalances, use a consumer rebalance listener to flush in-flight work before partitions are revoked.
16. What happens during a consumer group rebalance, and why can rebalances become disruptive?
During a rebalance, Kafka pauses normal consumption so the group coordinator can reassign partitions across the active consumers in the group. It happens when a consumer joins, leaves, crashes, or when topic partition counts change.
Consumers revoke their current partitions, then get a new assignment.
During that window, processing can pause, so latency spikes.
If offsets are not committed carefully, you can see duplicate processing or brief gaps.
Eager rebalancing is more disruptive because everyone gives up partitions at once.
Frequent membership changes, slow consumers, long GC pauses, or session timeouts can trigger repeated rebalances.
This gets disruptive because stop-the-world reassignments hurt throughput and stability. Cooperative rebalancing helps by moving partitions more gradually, reducing full-group pauses.
17. How have you reduced the impact of consumer group rebalancing in a real system?
I usually answer this with a quick framework: identify why rebalances happen, reduce how often they happen, then shorten the ones you cannot avoid.
In a Kafka-based payments pipeline, we had frequent rebalances causing lag spikes during deploys and traffic bursts. I reduced it by:
- Moving to static membership with group.instance.id, so rolling restarts stopped triggering full churn.
- Using cooperative sticky assignment, which avoided revoking all partitions at once.
- Increasing max.poll.interval.ms and tuning batch size, so long processing did not look like a dead consumer.
- Decoupling poll from processing, we polled continuously and handed work to an internal worker pool.
- Making shutdown graceful, commit final offsets, drain in-flight work, then leave the group cleanly.
That cut rebalance frequency a lot and reduced recovery time from minutes to seconds.
18. What is the difference between log retention and log compaction, and when would you use each?
Kafka has two different cleanup strategies, and they solve different problems.
Log retention deletes old segments based on time or size, like retention.ms or retention.bytes.
It keeps data for a fixed window, so consumers must read it before it expires.
Use retention for event streams, audit-style logs, clickstreams, metrics, or anything where event history matters.
Log compaction keeps the latest value per key, removing older records for the same key, but not necessarily immediately.
Use compaction for changelog topics, CDC, user profiles, account state, or cache rebuilds, where you want the current state, not full history.
You can also combine them. For example, compacted topics with retention let you keep the latest state, while also aging out very old tombstones or stale data.
19. How does Kafka handle large messages, and what are the risks of increasing message size limits?
Kafka can handle large messages, but it is optimized for lots of smaller records. If you must send big payloads, you usually tune broker, producer, and consumer limits together, like message.max.bytes, max.request.size, and fetch.max.bytes. A common pattern is to store the large object in external storage, like S3, and put only a pointer in Kafka.
Large messages increase end-to-end latency, memory pressure, and GC pauses.
They reduce throughput because fewer records fit in a batch or page cache.
Replication gets heavier, so follower lag and ISR instability can increase.
Consumers may hit fetch or deserialization bottlenecks, especially during rebalancing.
Bigger limits raise failure blast radius, one oversized record can block retries or cause request timeouts.
So yes, Kafka supports it, but raising size limits is usually the last option, not the first.
20. How do you preserve message ordering in Kafka, and what design choices can break ordering guarantees?
Kafka only guarantees ordering within a single partition, so the main way to preserve order is to make sure all related events always go to the same partition.
Use a stable message key, like customerId or orderId, so Kafka hashes it to the same partition.
Keep the partition count stable when possible, because increasing partitions can change key-to-partition mapping for new records.
On the producer, retries plus max.in.flight.requests.per.connection > 1 can reorder messages after failures; use idempotence and keep in-flight low if strict order matters.
Multiple producers writing the same key without coordination can break logical ordering.
On the consumer side, parallel processing of one partition can reorder completion unless you serialize handling.
If an event chain spans multiple partitions, topics, or services, Kafka itself is not giving you global order anymore.
If you need strict per-entity order, partition by that entity and process each partition sequentially.
21. If a business requirement says all events for the same customer must be processed in order, how would you design the topic and producer strategy?
Iโd make ordering a partition-level guarantee and ensure every event for a given customer always lands in the same partition.
Use the customerId as the Kafka message key, so Kafka hashes it to a consistent partition.
Create the topic with enough partitions for throughput, knowing order is only guaranteed within each partition.
Keep the partition count stable if possible, because increasing partitions can remap keys and affect ordering for new events.
Configure the producer for safety, enable.idempotence=true, acks=all, and usually limit in-flight requests if strict ordering under retries matters.
On the consumer side, process records sequentially per partition, or preserve per-key order if doing async work.
If one customer is extremely hot, Iโd call out the tradeoff: strict per-customer ordering can limit parallelism for that customer.
22. How would you troubleshoot a consumer that is consistently lagging behind production traffic?
Iโd work it from outside in: verify whether the issue is capacity, bad consumer behavior, or downstream slowness.
Check lag by partition and consumer group in kafka-consumer-groups --describe, not just total lag.
Compare production rate vs consumption rate. If producers outpace consumers, you need more throughput, not just tuning.
Confirm partition count vs consumer count. More consumers than partitions gives no extra parallelism.
Look for slow processing, DB calls, retries, or blocking I/O in the consumer app. This is often the real bottleneck.
Review poll settings like max.poll.records, fetch sizes, session.timeout.ms, and max.poll.interval.ms.
Check for rebalances, GC pauses, CPU, memory, network, and broker-side fetch throttling.
Verify commits are healthy. Bad commit logic can make lag look worse or cause reprocessing.
If needed, scale consumers, increase partitions, batch work, or decouple slow downstream processing.
23. What Kafka metrics do you monitor most closely in production, and why?
I group them into producer, broker, consumer, and health signals, because each one catches a different failure mode.
Producer: record-send-rate, request-latency-avg, error and retry rates, these show throughput, broker pressure, and whether acks are slowing down.
Broker: under-replicated partitions, offline partitions, ISR shrink/expand, request handler idle, these are my first checks for cluster risk.
Consumer: lag, lag growth rate, rebalance count, fetch latency, these tell me if consumers are keeping up or thrashing.
Storage and network: disk usage, flush time, network in/out, saturation here usually becomes a cluster-wide problem fast.
End-to-end: message age or end-to-end latency, because healthy broker metrics can still hide user-visible delay.
If I had to prioritize, Iโd start with consumer lag, under-replicated partitions, and request latency.
24. How do you define, detect, and respond to consumer lag in a business-critical Kafka pipeline?
Consumer lag is the gap between the latest produced offset and the last committed or processed offset for a consumer group. In a business-critical pipeline, I define it two ways: technical lag in offsets, and business lag in time, like "events must be processed within 30 seconds." Offsets alone can be misleading if message sizes or traffic patterns change.
Define SLOs per topic, for example max age of oldest unprocessed message, max lag per partition, and recovery time after spikes.
Detect with broker and client metrics, records-lag, records-lag-max, consumer throughput, rebalance frequency, and age of last successful processing.
Alert on sustained lag growth, not one-off spikes, and segment by consumer group, partition, and dependency.
Respond by checking if the bottleneck is consumer CPU, downstream I/O, hot partitions, slow deserialization, or rebalances.
Mitigate with more consumers if partitions allow, better batching, backpressure, pause or resume, DLQs for poison pills, and partition rebalancing or key redesign.
25. Have you worked with Kafka Connect in distributed mode, and what operational issues did you encounter?
Yes. Iโve run Kafka Connect in distributed mode for CDC and sink pipelines, usually 3 to 6 workers behind a load balancer, with configs and offsets stored in Kafka topics.
A few operational issues came up a lot:
- Rebalances during connector changes or worker restarts, which briefly paused tasks and caused lag spikes.
- Mis-sized internal topics, especially low replication on config, offset, and status, which made the cluster fragile.
- Bad converters or schema mismatches, leading to poison pill records and repeated task failures.
- Sink backpressure, where downstream systems slowed writes, tasks piled up, and retries increased.
- Plugin/version drift across workers, causing classloading errors when one node had a different connector build.
What helped was standardizing plugin deployment, tuning tasks.max, setting proper DLQs and error tolerance, monitoring rebalance frequency and task failure rates, and carefully sizing internal topics and consumer settings.
26. What are some common causes of uneven partition distribution or data skew, and how would you fix them?
A few things usually cause skew in Kafka, and the fix depends on whether the problem is on the producer side, the topic layout, or the cluster itself.
Bad key choice, like using customerId when a few customers dominate traffic, creates hot partitions. Pick a higher-cardinality key or add salting.
Null keys with sticky partitioning can still look uneven in bursts. If ordering is not required, use a better partitioning strategy.
Too few partitions means not enough spread. Increase partition count, but plan for ordering and consumer rebalance impact.
Custom partitioners sometimes hash poorly or route disproportionately. Review the logic and test key distribution.
Uneven broker leader placement or partition reassignment can create broker-level hotspots. Rebalance leaders and partitions with Cruise Control or reassignment tools.
Consumer-side skew also happens when one partition has much more data. Scale partitions, optimize processing, or isolate hot keys into a separate topic.
27. What factors influence producer performance, and how do batch size, linger time, compression, and acknowledgments affect throughput and latency?
Producer performance mainly comes down to how efficiently it batches records, how much network round-trip it waits on, and how much durability you ask Kafka to guarantee.
batch.size: Larger batches improve throughput by sending more records per request, but they can add latency if traffic is low and batches take time to fill.
linger.ms: Higher linger lets the producer wait briefly for more records, which improves batching and throughput, but increases end-to-end latency.
compression.type: Compression like snappy, lz4, or zstd usually improves throughput and reduces network and disk usage, but adds CPU overhead. zstd compresses best, often with more CPU cost.
acks: acks=0 gives lowest latency, weakest durability. acks=1 is a common middle ground. acks=all gives strongest durability, but usually higher latency and lower throughput.
Other factors: record size, key distribution, number of partitions, retries, max.in.flight.requests.per.connection, broker load, and network bandwidth.
28. How do acknowledgments settings such as acks=0, acks=1, and acks=all change the reliability characteristics of a producer?
acks controls when the broker tells the producer "your write is accepted," so it is a direct tradeoff between latency and durability.
acks=0: fire-and-forget. Producer does not wait for any broker response, fastest, but messages can be lost with no visibility if the broker is down or the request fails.
acks=1: leader-only ack. The leader writes the record locally and responds, but if it crashes before followers replicate, acknowledged data can still be lost.
acks=all or -1: strongest durability. The leader waits for all in-sync replicas to persist before acknowledging, so loss risk is much lower.
With acks=all, reliability also depends on min.insync.replicas, if too few replicas are in sync, the write fails instead of accepting weaker durability.
For highest safety, pair acks=all with idempotence enabled and retries.
29. What is min.insync.replicas, and how does it interact with producer acknowledgments?
min.insync.replicas is the minimum number of in-sync replicas, including the leader, that must have the record for Kafka to accept a write. It is a durability guardrail at the topic or broker level.
With acks=all, the producer waits for all required in-sync replicas, and if ISR count drops below min.insync.replicas, the broker rejects writes, typically with NotEnoughReplicas.
With acks=1, only the leader ack matters, so min.insync.replicas is effectively not protecting that write path.
With acks=0, the producer does not wait at all, so there is no durability guarantee.
Example: replication factor 3, min.insync.replicas=2, acks=all, one broker can fail and writes still succeed; if two replicas are unavailable, writes fail instead of risking data loss.
For strong durability, pair replication factor at least 3 with acks=all and min.insync.replicas=2.
30. What is the difference between SSL, SASL, ACLs, and RBAC in the Kafka ecosystem?
They solve different layers of security in Kafka:
SSL, or TLS, encrypts traffic and can also do mutual auth with client certs. It answers, "is the connection private, and who is this cert holder?"
SASL is an authentication framework on top of the connection, using mechanisms like SCRAM, PLAIN, GSSAPI, or OAUTHBEARER. It answers, "who is this user or service?"
ACLs are authorization rules in Kafka, like "User A can read topic X" or "Service B can write to group Y."
RBAC is higher-level authorization, assigning roles to users or groups, then roles map to permissions. Easier to manage at scale than lots of individual ACLs.
In practice, SSL and SASL handle secure transport and identity, ACLs and RBAC handle what that identity is allowed to do.
A common setup is TLS for encryption, SASL SCRAM for login, then ACLs or RBAC for access control.
31. How do you secure a Kafka cluster using authentication, authorization, encryption, and network controls?
Iโd answer it in layers, because Kafka security is really defense in depth.
Authentication: use SASL, typically SCRAM for users or mTLS with client certs, and separate principals for apps, admins, brokers, and Connect.
Authorization: enable ACLs or RBAC, follow least privilege, and scope access to specific topics, consumer groups, and transactional IDs.
Encryption: enable TLS for client-broker and broker-broker traffic, validate certs properly, and encrypt data at rest with disk or volume encryption.
Network controls: put brokers on private subnets, restrict listener exposure, lock down ports with security groups or firewalls, and isolate internal vs external listeners.
Ops hardening: rotate secrets and certs, audit access, centralize logs, disable anonymous access, and secure ZooKeeper or KRaft controller communication too.
In interviews, Iโd also mention testing misconfigurations regularly, because weak ACLs or open listeners are common failure points.
32. How would you design Kafka access controls for multiple teams sharing the same cluster?
Iโd do it with strict tenant boundaries, least privilege, and automation, so teams can move fast without stepping on each other.
Use naming conventions by team and environment, like teamA.dev.*, teamA.prod.*, and map ACLs to those prefixes.
Create principals per app, not shared users, via mTLS or SASL/OAUTH, so every producer and consumer has its own identity.
Grant only required permissions, WRITE on producer topics, READ on consumer topics and groups, DESCRIBE where needed, no wildcard admin access.
Separate platform admins from application teams using RBAC if available, or tightly scoped ACLs plus IaC workflows and approvals.
Isolate sensitive workloads further with quotas, dedicated topics, and possibly separate clusters for highly regulated data.
Audit continuously, log auth failures, ACL changes, unusual access patterns, and review permissions on a schedule.
Iโd also standardize topic creation through templates, so retention, replication, and ACL defaults are consistent.
33. What are the implications of running Kafka across availability zones or across regions?
Running Kafka across AZs is common and usually the sweet spot for HA. Running it across regions is a much bigger tradeoff.
Across availability zones: good fault tolerance, but higher network latency and cross-AZ traffic cost.
Replication settings matter more, min.insync.replicas, replication factor, and acks=all protect durability during an AZ loss.
Leader placement affects latency, producers and consumers pay more if they talk to brokers in other AZs.
Watch partition reassignments and inter-broker replication, they can get expensive and slower across AZs.
Across regions: usually not recommended for one Kafka cluster, latency hurts throughput, elections get slower, and failures can feel messy.
For multi-region, the common pattern is separate clusters per region with MirrorMaker 2 or Cluster Linking. That gives better isolation, lower local latency, and cleaner disaster recovery.
34. How would you design a disaster recovery strategy for Kafka?
Iโd design Kafka DR around clear RPO and RTO targets first, because that drives whether you need active-passive or active-active.
Use cross-region replication, typically MirrorMaker 2 or Cluster Linking, to copy critical topics, configs, and consumer offsets.
Keep the primary cluster in one region, DR in another, with separate brokers, ZooKeeper or KRaft quorum, networking, and storage failure domains.
Set topic configs intentionally, like acks=all, strong replication factor, min.insync.replicas, and retention long enough to absorb failover delays.
Automate failover with DNS, client config switching, and runbooks, but treat consumer offset validation as a key step.
Test regularly with controlled failovers, broker loss, region isolation, and restore drills, because untested DR is mostly paperwork.
Iโd also classify topics by business criticality. Not every topic needs the same replication cost or recovery priority.
35. What is MirrorMaker, and when would you use it instead of building your own replication approach?
MirrorMaker is Kafkaโs built-in tool for copying topics, partitions, and offsets between clusters. Think of it as managed consumer-plus-producer replication, useful when you want cluster-to-cluster data movement without writing custom apps.
Use MirrorMaker when you need DR, cross-region replication, cloud migration, or data sharing between environments.
Prefer MirrorMaker 2 over MM1, it handles topic config sync, consumer group offsets, and active-active patterns much better.
It saves you from owning retry logic, partition mapping, failover behavior, monitoring, and operational edge cases.
Build your own only if you need custom transforms, strict filtering, non-Kafka destinations, or very specific replication semantics.
In interviews, Iโd say MirrorMaker is the default for Kafka-to-Kafka replication, custom code is justified only for specialized business logic.
36. What makes a good Kafka connector configuration, and how do you validate reliability, error handling, and schema compatibility?
A good connector config is boring in the best way, explicit, observable, and easy to recover. Iโd focus on correctness first, then throughput.
Set clear delivery semantics, idempotent sink behavior where possible, and stable keys or partitioning.
Tune retries, backoff, timeouts, max.poll.interval.ms, batch sizes, and task parallelism to match source and sink limits.
Use errors.tolerance, dead letter queues, and structured logging so bad records are isolated, not hidden.
Keep converters and schema settings explicit, like Avro or Protobuf with Schema Registry, subject naming, and compatibility mode.
Lock down secrets, ACLs, TLS, and use config providers instead of hardcoding credentials.
For validation, Iโd run happy path, malformed data, schema evolution, duplicate, replay, and downstream outage tests. Then check lag, DLQ volume, retry behavior, offset commits, and whether schema changes preserve backward or forward compatibility before promoting to production.
37. What are the most common reasons Kafka performance degrades, and how do you diagnose the root cause?
Kafka slowdowns usually come from a few predictable buckets, so I diagnose by narrowing it to producers, brokers, consumers, or the network.
Producer side: tiny batches, acks=all with slow replicas, compression overhead, high retries, or too many in-flight requests.
Broker side: disk I/O saturation, page cache misses, too many partitions, compaction pressure, GC pauses, controller churn, or under-replicated partitions.
Consumer side: lag from slow processing, bad poll loop design, rebalances, fetch settings, or skewed partitions.
Cluster health: replication traffic, ISR shrink, uneven leadership, hot partitions, and network latency between brokers or clients.
Diagnosis: start with producer latency, broker request metrics, consumer lag, disk and network utilization, JVM GC, partition skew, and under-replicated partitions. Correlate Kafka metrics with OS metrics and recent config or deployment changes.
I usually check RequestHandlerAvgIdlePercent, UnderReplicatedPartitions, fetch/produce request latency, consumer lag, and disk wait first.
38. How do you approach capacity planning for a Kafka cluster in terms of storage, network, CPU, partitions, and replication?
I start from workload, not brokers. Estimate peak ingress and egress in MB/s, average record size, retention, consumer fan-out, and acceptable recovery time. Then map that into storage, network, CPU, partition count, and replication overhead with headroom.
Storage: daily data x retention x replication factor, then add 30 to 50 percent free space for compaction, rebalancing, and broker failure.
Network: budget for producer writes, consumer reads, and replication traffic. RF=3 means each write is sent across the network multiple times.
CPU: compression, TLS, batching, fetches, and consumer groups drive CPU. Benchmark your actual serializer and compression codec.
Partitions: size for throughput and parallelism, but avoid over-partitioning because it increases metadata, open files, and leader elections.
Replication: usually RF=3 in production, then check if the cluster can tolerate one broker down without saturating storage or network.
Validate with load tests, then keep 20 to 30 percent capacity headroom.
39. What is the purpose of Kafka Connect, and how is it different from writing custom producers and consumers?
Kafka Connect is Kafkaโs integration framework for moving data between Kafka and external systems, like databases, S3, Elasticsearch, or SaaS apps. Its main purpose is to standardize and simplify data ingestion and export without writing and operating custom app code.
Connectors are reusable plugins, sources pull data into Kafka, sinks push data out.
It gives you offset management, scaling, fault tolerance, retries, and config-driven deployment out of the box.
It supports Single Message Transforms, schema converters, and works well with Schema Registry.
Custom producers and consumers give maximum flexibility for business logic, but you own code, deployments, retries, error handling, and monitoring.
Use Kafka Connect for common integration patterns, use custom apps when you need complex processing, enrichment, or nonstandard behavior.
40. How do Kafka Streams applications manage state, and what are state stores, changelog topics, and standby replicas?
Kafka Streams manages state locally on each instance, so stateful operations like aggregations, joins, and windowing can read and update data fast without calling an external database.
State stores are the local storage layer, usually RocksDB or in-memory, keyed by record key and used by processors.
Changelog topics are Kafka topics where every state update is written, so the local store can be rebuilt after a crash or rebalance.
On recovery, Streams restores the state store by replaying the changelog, which gives fault tolerance.
Standby replicas are extra, passive copies of a taskโs state on other instances, kept warm by consuming the changelog.
If an active instance dies, a standby can be promoted quickly, reducing restore time and failover impact.
So the pattern is, local state for speed, changelogs for durability, standby replicas for faster high availability.
41. What is the role of event time, stream time, windows, and grace periods in Kafka Streams or similar Kafka-based stream processing?
These concepts are about handling late and out of order data correctly, instead of just using wall clock time.
Event time is when the event actually happened, usually from a timestamp in the record.
Stream time is the processorโs notion of progress, typically the max event timestamp seen so far in a partition or task.
Windows group records into bounded time buckets, like tumbling, hopping, or session windows, so you can aggregate by time range.
Grace period is extra allowed lateness after a windowโs end, so late events can still update results.
Once stream time passes window end + grace, the window is closed and later records for it are dropped.
Example: if a click happened at 10:01 but arrives at 10:06, event time puts it in the 10:00 to 10:05 window. A 2 minute grace accepts it, a 0 grace drops it.
42. What is Kafka Streams, and how does it differ from using plain consumers and producers?
Kafka Streams is Kafkaโs client library for building stream processing apps directly on top of Kafka. You write normal Java code, and it handles the hard parts like state management, repartitioning, fault tolerance, and scaling across instances.
With plain consumers and producers, you manually poll, transform, manage offsets, write output, and handle retries or state yourself.
Kafka Streams gives you high-level primitives like map, filter, join, aggregate, windowing, and stream-table processing.
It supports local state stores backed by changelog topics, so stateful operations are much easier and recoverable.
It runs as a library inside your app, not a separate cluster like Spark or Flink.
Use plain consumers/producers for simple pipelines; use Kafka Streams when you need stateful processing, joins, aggregations, or time-windowed logic.
43. What is Schema Registry, and why is schema evolution important in Kafka environments?
Schema Registry is a centralized service that stores and versions message schemas, usually for Avro, Protobuf, or JSON Schema. Producers register a schema, and consumers use the schema ID in the Kafka message to deserialize data correctly. It helps enforce compatibility rules so teams do not accidentally break downstream consumers.
Schema evolution matters because Kafka topics often have long-lived data and many independent producers and consumers. If you change a field, like adding, removing, or renaming it, older services may still need to read new messages, or new services may need to read old messages. With compatibility settings like backward, forward, or full, Schema Registry lets you evolve schemas safely, reduce deployment risk, and avoid runtime parsing failures across distributed systems.
44. How do you communicate Kafka tradeoffs to non-experts, especially when reliability, cost, and delivery guarantees are pulling in different directions?
I translate Kafka tradeoffs into business language first: what risk are we reducing, what latency can we tolerate, and what does failure cost us? Non-experts usually do not care about acks=all or idempotence, they care about questions like, "Could we lose an order?" or "How much more are we paying to make that nearly impossible?"
Frame it as a triangle: reliability, cost, speed. You can optimize two more easily than all three.
Use concrete scenarios: losing analytics events is different from losing payment events.
Explain guarantees plainly: at-most-once can lose data, at-least-once can duplicate, exactly-once costs more and adds complexity.
Quantify the price: more replicas, cross-zone traffic, storage retention, and operational overhead.
Recommend by tier: critical topics get stronger guarantees, low-value topics get cheaper settings.
That keeps the conversation decision-focused, not config-focused.
45. How do you handle duplicate events, late-arriving events, or out-of-order events in a Kafka-based architecture?
I handle them at a few layers, because Kafka gives you ordering per partition, not globally, and delivery can still be at-least-once.
Duplicates: make consumers idempotent, use a business key or event ID, and store processed IDs or use upserts/compaction.
Producer side: enable idempotent producers and, if needed, transactions for exactly-once within Kafka pipelines.
Out-of-order events: partition by the entity key so related events stay in one partition, then use sequence numbers, versions, or event timestamps to detect stale updates.
Late-arriving events: define a lateness window in stream processing, use event time not processing time, and send very late data to a side topic or reconciliation flow.
Operations: add DLQs only for poison messages, not normal lateness, and monitor duplicate rate, lag, and reorder frequency.
In practice, I usually combine idempotent writes plus keyed partitioning plus watermark/window logic.
46. How do you manage backward, forward, and full compatibility when evolving Avro, Protobuf, or JSON schemas?
I handle it by setting clear compatibility rules in a schema registry, then making only additive, well-governed changes unless there is a planned breaking version.
Backward compatibility: new consumers can read old data, so I can add optional fields with defaults, but avoid removing or renaming fields casually.
Forward compatibility: old consumers can read new data, so producers should not start requiring new fields that older readers cannot ignore.
Full compatibility: both directions work, which usually means additive changes only, with defaults and stable field identities.
Avro: use defaults carefully, never reuse field names for different meanings.
Protobuf: never reuse field numbers, reserve removed fields, prefer adding new optional fields.
JSON Schema: harder to enforce, so I rely on registry checks, versioning, and consumer-driven contract tests.
In practice, I use BACKWARD or FULL in Schema Registry, run CI compatibility checks, and treat renames as add plus deprecate plus later remove.
47. Tell me about a Kafka production incident you were involved in. What happened, how did you diagnose it, and what changes did you make afterward?
Iโd answer this with a tight STAR flow: situation, impact, actions, result, then what changed permanently.
At one company, consumer lag suddenly spiked for a payments event stream during peak traffic, and downstream reconciliation started missing SLAs. I checked broker health first, then compared producer rate, consumer lag, rebalance frequency, and processing latency. The clue was frequent consumer group rebalances plus long GC pauses on one consumer node. That node was under-memory sized, causing missed heartbeats and partition thrashing. We stabilized it by scaling out consumers, increasing heap carefully, tuning max.poll.interval.ms and session settings, and reducing per-message processing time by moving a slow external call out of the hot path. Afterward, we added lag and rebalance alerts, load tests for burst traffic, stricter capacity baselines, and a runbook so on-call could isolate broker issues versus consumer issues fast.
48. Describe a time when a Kafka-based design you implemented did not work as expected. What assumptions were wrong, and what did you learn?
Iโd answer this with a quick STAR structure: situation, wrong assumption, fix, lesson.
At one company, I helped design an event pipeline where services published order updates to Kafka, and downstream consumers built read models and triggered notifications. We assumed partitioning by customerId was good enough because most queries were customer-centric. That was wrong. Order state transitions needed strict per-order ordering, and different orders for the same customer created hotspots, lag, and some out-of-order effects in downstream materializations.
We fixed it by repartitioning on orderId, adding idempotent consumers, and tightening our schema and event contract around state transitions. The big lesson was that partition keys should follow the unit of ordering, not just access patterns. I also learned to test with real traffic distributions, because โeven enoughโ in staging can collapse in production.
49. Imagine a consumer accidentally commits offsets before processing messages successfully. What failure modes would this create, and how would you correct the design?
That creates classic at-most-once behavior by accident. The main failure mode is silent data loss: if the consumer crashes after committing but before finishing the work, Kafka thinks those records are done, so they will not be re-read after restart.
Lost messages, offsets advance past records that never actually completed
Partial side effects, like writing to a DB for some records in a batch, then crashing
Inconsistent downstream state, especially if processing touches multiple systems
Harder recovery, because replay from committed offsets skips the failed work
Iโd fix it by committing offsets only after successful processing, usually after the DB transaction or external side effect completes. If possible, make processing idempotent so retries are safe. For stronger guarantees, use Kafka transactions for consume-process-produce flows, or persist both result and offset atomically in the same database transaction.
50. Suppose a topic suddenly shows explosive growth in storage usage. How would you investigate whether the issue is retention, compaction, producer behavior, or consumer behavior?
Iโd triage it in this order: confirm what changed, identify whether growth is from more bytes in, slower cleanup, or consumers pinning data, then narrow by topic config and broker metrics.
Check topic configs first: retention.ms, retention.bytes, cleanup.policy, segment.bytes, min.cleanable.dirty.ratio, and any recent config changes.
Compare produce rate vs delete/compact rate using broker JMX or monitoring: if bytes in spiked, suspect producers; if cleanup lagged, suspect retention or compaction.
Inspect log dirs and segment age/size. Old segments not deleting usually means retention settings, active segments, or broker clock/config issues.
For compacted topics, look at cleaner metrics like max compaction lag, dirty ratio, cleaner backlog, and tombstone retention. A growing backlog points to compaction falling behind.
Validate producer behavior: message size, key cardinality, duplicate sends, retries, compression changes, or a bad loop flooding the topic.
Check consumer lag and offsets, but note consumers do not control topic storage unless retention is tied to their progress externally.
51. If a product team wants to use Kafka as both a queue and a database, how would you respond, and what architectural guidance would you provide?
Iโd say Kafka can act like both, but you should be precise about what role it plays. It is excellent as a durable event log and a decoupling layer. It can behave like a queue with consumer groups, and like a replayable source of truth for event-driven state. But it is not a general-purpose OLTP database.
As a queue, Kafka works well for high-throughput async processing, retries, and fan-out.
As a database, use it as an event store or system of record for immutable facts, not for ad hoc queries or complex transactions.
If you need current state lookups, put a materialized view on top, like Kafka Streams state stores, ksqlDB tables, Elasticsearch, or a relational DB.
Keep retention, compaction, schemas, keys, and idempotency explicit.
Guidance: let Kafka own event history, let specialized databases own serving queries and transactional business state.
52. Have you ever had to migrate from ZooKeeper-based Kafka to KRaft mode, or plan for it? What technical and operational considerations mattered most?
Yes. I have planned and helped execute the move conceptually and operationally, and the biggest lesson is to treat it as both a metadata migration and an ops model change, not just a version upgrade.
First, verify version compatibility and the exact migration path, because KRaft migration is supported only in specific Kafka versions and mixed assumptions can break the rollout.
Separate controller and broker roles deliberately. KRaft makes controllers first-class, so quorum sizing, hardware, placement, and fault domains matter a lot.
Inventory every ZooKeeper dependency, including scripts, monitoring, ACL workflows, and admin tooling, because many teams forget these hidden operational ties.
Rehearse in lower environments with realistic topic counts, ACLs, and broker failures, then validate leader elections, metadata propagation, and rollback options.
Update observability and runbooks. You monitor controller quorum health now, not ZooKeeper, and on-call procedures need to reflect that.
Get Interview Coaching from Kafka Experts
Knowing the questions is just the start. Work with experienced professionals who can help you perfect your answers, improve your presentation, and boost your confidence.
Still not convinced? Don't just take our word for it
We've already delivered 1-on-1 mentorship to thousands of students, professionals, managers and executives. Even better, they've left an average rating of 4.9 out of 5 for our mentors.