Apache Kafka in 2026: Tiered Storage & KRaft Guide

November 3, 2025·17 min read·Data and AIintermediate

Why the shift to-tiered storage and KRaft is reshaping real-time data pipelines

Rows of enterprise servers in a data center rack with blinking LEDs, representing the backbone of streaming infrastructure

It is hard to ignore how much the conversation around streaming data has changed in the last couple of years. In 2023, many teams were still debating whether to stick with Kafka or jump to something “simpler” like Redpanda or Pulsar. In 2026, those debates feel quaint. Apache Kafka is still the spine for many large-scale event-driven systems, but the way we design, deploy, and operate it looks different than it did five years ago. If you’ve been away from the Kafka world for a while, the biggest surprises are the quiet removal of ZooKeeper, the normalization of tiered storage, and the ubiquity of exactly-once semantics in production. Kafka is no longer just the “distributed log”; it is almost a cloud-native database with streaming primitives.

I am not here to tell you that Kafka is perfect or that it is the right answer for everything. I am here to share how I have used it in real projects, what broke, what surprised me, and where the ecosystem stands in 2026. We will walk through practical patterns for modern deployments, connect the dots between storage tiers and cost control, and touch on how developers are building with Kafka Streams and ksqlDB in day-to-day work. By the end, you will have a clear picture of when Kafka makes sense and when you might be better off with a simpler option.

Where Kafka fits in 2026

Context and common use cases

In 2026, Kafka sits at the intersection of transactional event streams and analytical pipelines. That is a shift from the era when Kafka was mostly about moving logs to a data lake. Today, teams use Kafka as the backbone for:

Real-time product features: clickstreams, user presence, and notifications.
Microservice decoupling: order placement to inventory updates to fraud checks.
Feeding lakehouse platforms: using Change Data Capture (CDC) to replicate database changes into Iceberg/Delta tables.
Operational analytics: lightweight transformations with Kafka Streams to power alerting or recommender systems.

For a streaming newcomer, the key is understanding that Kafka is not a replacement for a relational database, but it is often the backbone that lets other systems stay loosely coupled. When you need ordering guarantees and durable history, Kafka is a strong candidate. When your problem is primarily ad hoc analytics on historical data, a lakehouse plus a query engine is usually better.

Compared to alternatives:

Redpanda is a Kafka-compatible engine that is fast and simple for smaller teams, especially when you want to avoid JVM operational complexity. In 2026, it’s often used at the edge or for high-throughput single-tenant workloads.
Apache Pulsar offers multi-tenancy and geo-replication features that some enterprises love, but the ecosystem around Kafka is still broader, especially for stream processing frameworks and connectors.
Cloud-native options (like Amazon MSK, Confluent Cloud, or Azure Event Hubs for Kafka) are the default if your organization prefers managed services over operating your own clusters.

Kafka has matured into a platform that supports a range of deployment models: on-prem metal, Kubernetes via Strimzi, or managed cloud. And while the marketing buzz has moved to LLMs and AI agents, the plumbing is still events, logs, and backpressure. That is where Kafka excels.

Core concepts and practical usage

From partitions to exactly-once, with real-world configuration

If you are coming back to Kafka after a few years, two changes dominate the conversation: KRaft mode (no more ZooKeeper) and Tiered Storage. In KRaft mode, the controller handles metadata, simplifying the control plane. Tiered Storage lets you offload older log segments to S3-compatible storage, drastically shrinking local disk requirements while keeping data accessible. The result is clusters that are easier to scale and cheaper to run.

A typical 2026 cluster topology looks like this:

Brokers running KRaft controllers and KRaft voters.
Tiered Storage enabled to push segments to S3/GCS/Azure Blob.
Clients using client-side encryption and compression to control cost.
Connect running as a distributed worker cluster to move data in and out.
Streams/ksqlDB for lightweight stream processing at the edge.

Let’s ground this in a realistic configuration. In 2026, we often tune the broker to optimize for cost and latency. Below is a broker configuration that enables tiered storage and sets sane defaults for production.

# server.properties
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@kafka-1:9093

listeners=PLAINTEXT://:9092,CONTROLLER://:9093
advertised.listeners=PLAINTEXT://kafka-1.example.com:9092

# Storage
log.dirs=/data/kafka-logs
log.retention.hours=168

# Tiered Storage to S3-compatible
remote.log.storage.system.enable=true
remote.log.storage.manager.class.name=org.apache.kafka.rsm.s3.S3RemoteStorageManager
remote.log.storage.manager.class.path=/opt/kafka/libs/

# S3 credentials should be injected via environment/secrets
remote.log.storage.manager.s3.bucket.name=kafka-segments-prod
remote.log.storage.manager.s3.region=us-east-1
remote.log.storage.manager.s3.endpoint.override=s3.amazonaws.com

# Replication and performance
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

# Topic defaults for production
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
compression.type=producer

# Exactly-once semantics support
transactional.id.timeout.ms=600000

In practice, you will also want to tune segment sizes and flush policies depending on your latency requirements and backpressure patterns. A fun fact: many teams discovered that enabling tiered storage made it possible to reduce local disk by 70% without sacrificing tail latency, because hot reads still come from local disk and cold fetches are lazily pulled from object storage. The key is to monitor the fetch-from-remote latency, which can spike if your topic has wildly uneven access patterns.

Producer patterns and idempotence

For producers, idempotence is now table stakes, and transactions unlock exactly-once semantics across multiple partitions. In our payments pipeline, we wrap the producer in a small utility that ensures transactional initialization and safe closes. This pattern has saved us from duplicate events during retries.

# example_payment_producer.py
import logging
import time
from kafka import KafkaProducer
from kafka.errors import KafkaError

logging.basicConfig(level=logging.INFO)

class PaymentEventProducer:
    def __init__(self, bootstrap_servers):
        self.producer = KafkaProducer(
            bootstrap_servers=bootstrap_servers,
            acks="all",                  # Wait for all in-sync replicas
            retries=5,
            enable_idempotence=True,     # Prevent duplicates on retries
            compression_type="snappy",   # Balance of CPU/throughput
            value_serializer=lambda v: v.encode("utf-8"),
            key_serializer=lambda k: k.encode("utf-8"),
            transactional_id="payment-producer-1",
            max_in_flight_requests_per_connection=5,  # Safe default for idempotence
        )
        self.producer.init_transactions()

    def send_payment(self, account_id, amount):
        key = str(account_id)
        value = f"{account_id}:{amount}:{int(time.time())}"
        try:
            self.producer.begin_transaction()
            future = self.producer.send(
                topic="payments",
                key=key,
                value=value,
                headers=[("source", b"core-api")],
            )
            # Synchronous wait to detect errors early
            record_metadata = future.get(timeout=10)
            logging.info(
                "Sent: partition=%s offset=%s",
                record_metadata.partition,
                record_metadata.offset,
            )
            self.producer.commit_transaction()
        except KafkaError as e:
            logging.error("Transaction failed: %s", e)
            self.producer.abort_transaction()
            # In real scenarios, you might add retry with backoff
            raise

    def close(self):
        self.producer.flush()
        self.producer.close()

if __name__ == "__main__":
    producer = PaymentEventProducer(["kafka-1.example.com:9092"])
    try:
        for i in range(10):
            producer.send_payment(account_id="acct_1234", amount=100 + i)
    finally:
        producer.close()

The trick with transactions is to remember they are not free. If your transaction durations are too long, you will hold onto partitions and reduce concurrency. Keep transactions short and commit frequently.

Consumer group patterns and backpressure

On the consumer side, the dominant pattern in 2026 is to isolate processing from consumption. You poll, buffer a small window, and process asynchronously with backpressure. This approach avoids long poll loops that block rebalances and causes sticky assignment issues. Here is a simplified consumer with manual commits and basic error handling.

# example_payment_consumer.py
import logging
from kafka import KafkaConsumer
from kafka.errors import KafkaError

logging.basicConfig(level=logging.INFO)

def process_event(event):
    # Simulate processing and possible transient failure
    if b"fail" in event.value():
        raise ValueError("Simulated processing failure")
    logging.info("Processed: %s", event.value())

def main():
    consumer = KafkaConsumer(
        "payments",
        bootstrap_servers=["kafka-1.example.com:9092"],
        group_id="fraud-detector",
        auto_offset_reset="earliest",
        enable_auto_commit=False,  # Manual commits for safety
        max_poll_records=50,
        session_timeout_ms=30000,
    )

    try:
        while True:
            batch = consumer.poll(timeout_ms=1000)
            for tp, records in batch.items():
                for event in records:
                    try:
                        process_event(event)
                    except Exception as e:
                        logging.error("Processing error: %s", e)
                        # In real world: send to dead-letter topic
                        continue
                # Commit after successful processing of the batch
                consumer.commit_asynchronously(offsets={tp: records[-1].offset + 1})
    except KeyboardInterrupt:
        logging.info("Shutting down consumer")
    finally:
        consumer.close()

if __name__ == "__main__":
    main()

If you are operating at higher scales, you will eventually hit consumer lag issues. The most effective solutions are usually a mix of increasing partitions, optimizing consumer logic to remove blocking I/O, and using a dedicated dead-letter topic for unprocessable messages. We once had a consumer that was doing synchronous database writes for each event. Moving that to a small in-memory queue with batch writes improved throughput tenfold.

Stream processing with Kafka Streams and ksqlDB

Kafka Streams is now a first-class citizen for stateful stream processing. It shines when you need to join streams, maintain state, and emit results with exactly-once. A typical pattern is session windows for user activity or aggregations for dashboards. The developer experience is much better than it used to be, especially with interactive queries to expose current state via a REST endpoint.

Here is a compact Kafka Streams example that builds a simple session window for user clicks. We include error handling and a grace period for late events.

// ClickSessionApp.java
package com.example.kafka.streams;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.processor.WallClockTimestampExtractor;

import java.time.Duration;
import java.util.Properties;

public class ClickSessionApp {

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "click-sessionizer");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-1.example.com:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG, WallClockTimestampExtractor.class);
        // Exactly-once processing
        props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> clicks = builder.stream("clicks", Consumed.with(Serdes.String(), Serdes.String()));

        // Group by user and define session windows with grace period
        SessionWindows windows = SessionWindows.ofInactivityGapAndGrace(Duration.ofMinutes(5), Duration.ofMinutes(1));

        KTable<Windowed<String>, Long> sessions = clicks
            .groupByKey()
            .windowedBy(windows)
            .count(Materialized.as("session-counts"));

        // To drive a REST API with interactive queries, you would materialize to a state store
        sessions.toStream().print(Printed.toSysOut());

        KafkaStreams streams = new KafkaStreams(builder.build(), props);
        streams.start();

        // Add a shutdown hook
        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    }
}

ksqlDB continues to be popular for teams that want SQL semantics for simple transformations. It is not a silver bullet, but it is excellent for prototyping or for adding an ETL layer without writing code. A typical use is to filter PII out of a stream before it lands in a lakehouse. I have seen teams cut down development time by weeks using ksqlDB for the initial pipeline and then rewriting only the hot paths in Kafka Streams later.

Evaluating tradeoffs

Strengths, weaknesses, and when to use Kafka

Kafka in 2026 has a mature, stable core with excellent client libraries in many languages. It supports multi-tenant isolation, strong security defaults (SASL/SCRAM, mTLS), and a wide connector ecosystem via Kafka Connect. The developer experience around exactly-once and interactive queries is mature. With KRaft, the operational surface area is smaller than it was in the ZooKeeper era.

However, it is not without rough edges:

Operational complexity: Even with KRaft and tiered storage, running a production cluster well requires expertise. You need to understand ISR behavior, controller elections, and network saturation.
Cost surprises: Tiered storage saves on disk but can lead to high egress costs if consumers frequently re-read old segments. Careful topic retention and access patterns matter.
Latency floor: For ultra-low-latency requirements (single-digit microseconds), a kernel bypass engine like Redpanda or a custom solution may be a better fit. Kafka’s JVM-based path is in the low milliseconds, which is fine for most systems but not all.
Stream processing learning curve: Kafka Streams is powerful but state management, interactive queries, and rebalancing logic are not trivial. Teams often underestimate the debugging surface area.

When is Kafka the right choice?

You need strict ordering and high durability across many consumers.
You have a diverse set of services that benefit from a unified event backbone.
You want to leverage a mature ecosystem of connectors and stream processing frameworks.
You plan to keep a long history of data accessible without manual ETL.

When might you skip Kafka?

You have simple point-to-point messaging with a single consumer and no ordering requirements.
Your team is small and cannot invest in operations and observability.
You require ultra-low latency (sub-millisecond) in a single tenant environment.
Your primary need is batch analytics rather than real-time events.

A personal take

Lessons from the trenches

The biggest lesson I have learned with Kafka is that the log is a simple idea, but the interfaces around it are not. When I first moved a legacy system to Kafka, I made two mistakes. First, I underestimated the importance of key partitioning. Our naive producer sent everything to the default partition, causing hot partitions and consumer lag spikes. Fixing it meant carefully designing keys (e.g., by user ID) and increasing partitions only during a maintenance window to avoid rehashing.

Second, I assumed that “exactly-once” meant I could ignore idempotence at the application level. While exactly-once semantics helped, the real wins came from idempotent producers, transactional wrapping for multi-partition updates, and a dead-letter topic for poison pills. One Friday afternoon, a bad deploy sent malformed messages that crashed our consumer. Without a dead-letter topic, we would have spent the weekend manually seeking offsets. Instead, we routed bad messages to a side topic, fixed the parser, and replayed the stream.

There were bright moments too. During a product launch, we had to change the schema of an event that ten services consumed. With a schema registry and forward-compatible changes, we rolled out the update without downtime. That kind of coordination would have been a multi-week nightmare in a shared database world. Kafka’s decoupling made it possible to ship quickly with confidence.

Getting started with a modern Kafka project

Tooling, workflow, and mental models

If you are starting a new project in 2026, here is a practical workflow:

Local development: Use Redpanda for quick prototyping if you do not need the full JVM stack. It is single-binary and Kafka-compatible. For fidelity with production, use the official Kafka image with KRaft.
Cluster orchestration: If you are running on Kubernetes, use Strimzi to manage Kafka, Connect, and MirrorMaker. Strimzi handles rolling upgrades, certificate management, and CRDs for topic configuration.
Schema registry: Use a registry (e.g., Confluent Schema Registry or Apicurio) from day one. Enforce schema compatibility rules (backward or full) to avoid breaking consumers.
Observability: Export JMX metrics to Prometheus and visualize in Grafana. Key dashboards include under-replicated partitions, ISR shrink/expands, broker network saturation, and consumer lag. Also track remote storage metrics if you use tiered storage.

A typical repository structure for a service that produces and consumes events might look like this:

my-kafka-service/
├── docker-compose.yml
├── schema/
│   ├── payment.avsc
│   └── click.avsc
├── src/
│   ├── producers/
│   │   └── payment_producer.py
│   ├── consumers/
│   │   └── fraud_consumer.py
│   └── streams/
│       └── click_session_app.java
├── config/
│   ├── broker/
│   │   └── server.properties
│   └── connect/
│       └── s3-sink.properties
├── k8s/
│   ├── kafka.yaml       # Strimzi CRDs
│   └── connector.yaml
└── Makefile

The docker-compose below helps you boot a local KRaft broker, a schema registry, and a connect worker. It is not production-ready, but it is perfect for iterative development.

# docker-compose.yml
version: "3.8"
services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: controller,broker
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_LOG_DIRS: /var/lib/kafka/data
      KAFKA_JMX_PORT: 9999
    volumes:
      - kafka-data:/var/lib/kafka/data

  schema-registry:
    image: confluentinc/cp-schema-registry:7.6.0
    depends_on:
      - kafka
    ports:
      - "8081:8081"
    environment:
      SCHEMA_REGISTRY_HOST_NAME: schema-registry
      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092

  connect:
    image: confluentinc/cp-kafka-connect:7.6.0
    depends_on:
      - kafka
      - schema-registry
    ports:
      - "8083:8083"
    environment:
      CONNECT_BOOTSTRAP_SERVERS: kafka:9092
      CONNECT_REST_PORT: 8083
      CONNECT_GROUP_ID: connect-workers
      CONNECT_CONFIG_STORAGE_TOPIC: connect-configs
      CONNECT_OFFSET_STORAGE_TOPIC: connect-offsets
      CONNECT_STATUS_STORAGE_TOPIC: connect-status
      CONNECT_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_PLUGIN_PATH: /usr/share/confluent-hub-components
    volumes:
      - ./config/connect:/etc/connect

volumes:
  kafka-data:

For connect, you will often configure connectors via REST. Here is a simple S3 sink connector config that archives payments to a lakehouse path. Note the use of partitions and time-based partitioning to make downstream analytics easier.

// config/connect/s3-sink.json
{
  "name": "payments-s3-sink",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "topics": "payments",
    "s3.region": "us-east-1",
    "s3.bucket.name": "lakehouse-payments",
    "s3.prefix": "raw/payments",
    "flush.size": 1000,
    "partition.duration.ms": 3600000,
    "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
    "path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "format.class": "io.confluent.connect.s3.format.json.JsonFormat",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter.schemas.enable": "false",
    "transforms": "flatten",
    "transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value"
  }
}

Here is how you can apply it to your local connect worker:

curl -X POST -H "Content-Type: application/json" \
  --data @config/connect/s3-sink.json \
  http://localhost:8083/connectors

What makes Kafka stand out in 2026

Ecosystem strength and developer experience

The power of Kafka in 2026 is not just the broker; it is the ecosystem. You get:

Connectors for CDC (Debezium), databases (JDBC), cloud services (Pub/Sub, Kinesis), and lakes (Iceberg, Delta).
Stream processing choices: Kafka Streams for JVM shops, ksqlDB for rapid prototyping, and client libraries for Python, Go, and .NET for non-JVM services.
Operational tooling: Strimzi for Kubernetes, Cruise Control for self-balancing, and a rich set of monitoring integrations.

Developer experience has improved. The days of confusing rebalances due to long-running tasks are fading with better client defaults and cooperative sticky rebalancing. The introduction of exactly-once semantics across the stack (producers, consumers, and streams) is still the feature I rely on most when building financial or audit trails. The trick is to not overuse it; exactly-once has overhead and should be applied where it matters.

Maintaining reliability

Reliability is a combination of configuration and culture. The best teams I know enforce:

Schema compatibility for all topics.
Dead-letter topics for every critical consumer.
Chaos testing of broker failures and network partitions.
Limits on the number of partitions per broker to avoid metadata storms.

If you are operating on Kubernetes, define resource requests and limits carefully. JVM memory sizing is still an art, but a good rule of thumb is to keep the heap under 70% of the container memory to leave room for the page cache. Also, monitor the controller queue size during rolling upgrades.

Free learning resources

Where to go next

These resources are all verifiable and widely used in the community:

Confluent Developer: https://developer.confluent.io/ - A curated set of courses and tutorials covering Kafka, Kafka Streams, and ksqlDB, including hands-on labs.
Apache Kafka Official Documentation: https://kafka.apache.org/documentation/ - The canonical source for broker configuration, clients, and KRaft details.
Strimzi Documentation: https://strimzi.io/docs/ - The go-to guide for running Kafka on Kubernetes with operators and CRDs.
Kafka Streams Documentation: https://kafka.apache.org/37/documentation/streams/ - Deep dives into DSL, processing guarantees, and interactive queries.
Debezium Documentation: https://debezium.io/documentation/ - Practical CDC patterns for streaming database changes into Kafka.
Redpanda Blog and Docs: https://redpanda.com/blog - Useful for understanding Kafka-compatible alternatives and performance tuning perspectives.
Confluent Kafka Client for Python: https://github.com/confluentinc/confluent-kafka-python - A high-performance client with extensive examples and configuration guidance.

For code examples and community patterns, the Confluent GitHub organization and Apache Kafka JIRA are invaluable for seeing how features behave in real workloads.

Summary and final thoughts

Who should use Kafka, and who should consider alternatives

If you are building real-time, event-driven systems that need high durability, multi-consumer support, and strong ordering, Kafka remains an excellent choice in 2026. It is especially strong when your organization can invest in operational excellence and wants to leverage a mature ecosystem. The shift to KRaft and tiered storage has made it more cloud-friendly and cost-effective than it was in the early 2020s.

If you are a small team with simple messaging needs, or you require ultra-low latency in a single-tenant environment, consider alternatives like Redpanda or even managed cloud services that abstract the complexity. For teams that prefer SQL-first transformations, ksqlDB offers a gentle on-ramp, but be prepared to move to Kafka Streams for complex stateful logic.

The biggest takeaway is to design your event topology with an eye toward evolution. Pick keys that distribute load, commit transactions quickly, and enforce schemas. Treat Kafka not as a black box, but as a distributed system with quirks you can learn to harness. When it clicks, Kafka can turn a tangled mess of integrations into a clean, auditable, and scalable backbone for your business.