Data Streaming Architecture Patterns
As data volumes grow and real-time expectations rise, understanding reliable streaming patterns is essential for building systems that keep up.

I still remember the first time I had to ship a streaming pipeline to production. It felt like juggling chainsaws: the source was chatty, the downstream was picky, and the deployment window was unforgiving. Over the years, I’ve learned that the difference between a system that hums and one that wails often comes down to architecture patterns. The right pattern turns chaos into a rhythm. This post is a practical tour through those patterns, meant for developers and curious engineers who want to build data streams that are resilient, predictable, and useful in real life.
Expect plain talk, real-world trade-offs, and a few code examples you can adapt. I won’t pretend there’s a universal best approach; there isn’t. Instead, I’ll show where patterns shine, where they falter, and how to choose based on what you’re actually building.
Where streaming fits today
Data streaming sits at the heart of many modern systems, from user-facing features like live notifications to backend processes like fraud detection or inventory updates. Teams use it to short-circuit batch delays, decouple services, and turn event logs into actionable signals.
Who typically builds streaming systems?
- Backend engineers instrumenting services with event emitters and consumers.
- Platform teams managing Kafka clusters and connector ecosystems.
- Data engineers wiring streams to warehouses, lakes, and lakehouses.
- SREs/DevOps ensuring reliability and observability.
- ML engineers serving low-latency features to online models.
Compared to batch ETL, streaming emphasizes low latency and continuous processing; compared to request-driven architectures, it favors loose coupling and replayability. Compared to message queues like RabbitMQ, platforms like Apache Kafka and Apache Pulsar provide stronger persistence, partitioning, and scaling guarantees. For simpler needs, managed services (Amazon Kinesis, Google Pub/Sub, Azure Event Hubs) reduce operational burden. For serverless flows, tools like AWS Lambda or Google Cloud Functions can react to stream events without managing long-lived processes. Apache Flink and ksqlDB add stream processing capabilities, enabling joins, aggregations, and stateful computations over windows.
The core promise is this: you can make decisions on data as it arrives, and you can recompute those decisions later if needed.
Core streaming concepts you’ll use repeatedly
Before diving into patterns, a few mental models help:
- Events vs. messages: An event is a fact (immutable) about something that happened. A message is a vehicle that carries events or commands. Events are great for replay and audit; commands are about intent.
- Partitions: Data is divided into ordered sequences that can be processed independently. More partitions increase throughput but have coordination overhead.
- Offsets: A pointer to a position in a partition. Consumers track offsets to know what’s been processed.
- Idempotency: Processing the same event twice produces the same result. Essential for retries.
- Backpressure: When downstream systems can’t keep up, upstream must slow down or buffer safely.
- Schema: Contracts between producers and consumers. They make evolution safer and debugging easier.
Common architecture patterns
1. Log-based messaging (event sourcing foundation)
Concept: The stream is an append-only log. Producers write events, consumers read at their own pace and track offsets.
Real-world use: A purchase service emits OrderCreated, OrderPaid, OrderShipped events. Analytics, inventory, and notifications all consume the same log.
Trade-offs: Great for replayability and decoupling; slower than direct RPC for request/response scenarios.
Example: Python producer/consumer with Kafka (local setup for learning)
# producer.py
import json
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=["localhost:9092"],
value_serializer=lambda v: json.dumps(v).encode("utf-8")
)
for i in range(5):
event = {
"event_id": f"evt-{i}",
"type": "OrderCreated",
"payload": {"order_id": i, "amount": 19.99}
}
future = producer.send("orders", event)
# For small demos, ensure delivery before exit
future.get(timeout=5)
producer.flush()
# consumer.py
import json
from kafka import KafkaConsumer
consumer = KafkaConsumer(
"orders",
bootstrap_servers=["localhost:9092"],
group_id="order-analytics",
auto_offset_reset="earliest",
value_deserializer=lambda x: json.loads(x.decode("utf-8"))
)
for msg in consumer:
event = msg.value
# Idempotent by design: ignore duplicates if already processed
print(f"Consumed: {event}")
2. Pub-sub with selective subscriptions
Concept: One topic, many consumer groups. Each group independently reads the stream.
Real-world use: Checkout events feed a fulfillment group (for shipping) and a finance group (for billing), without coupling those teams.
Trade-offs: Flexible but requires topic hygiene; too many topics can fragment resources; too few can cause noisy neighbors.
3. Stream processing with windows and joins
Concept: Process continuous data over time windows (tumbling, sliding, session). Join streams with streams or with tables (Kafka KStreams, Flink, ksqlDB).
Real-world use: Real-time fraud checks using a 60-second window of user actions joined with a profile table.
Trade-offs: Stateful processing adds complexity; careful checkpointing and state backend choices are critical.
Example: ksqlDB sessionization of page views
CREATE STREAM page_views (
user_id VARCHAR,
page_id VARCHAR,
ts BIGINT
) WITH (
KAFKA_TOPIC='page_views',
VALUE_FORMAT='json',
TIMESTAMP='ts'
);
CREATE TABLE user_sessions AS
SELECT
user_id,
COUNT(*) AS events_in_session,
COLLECT_SET(page_id) AS pages_seen
FROM page_views
WINDOW SESSION (2 MINUTES)
GROUP BY user_id
EMIT CHANGES;
4. Tiered storage and compaction
Concept: Use compacted topics for derived state (e.g., latest user profile). Combine with long-term storage for raw events.
Real-world use: Keep 90 days of raw clickstream in object storage; keep the latest profile in a compacted topic for low-latency lookups.
Trade-offs: Compacted topics reduce storage but are not a replacement for full retention; choose per topic.
5. Outbox pattern for transactional integrity
Concept: When a service updates its database and emits an event, ensure both happen atomically by writing the event to an outbox table in the same transaction; a separate process tails the outbox to Kafka.
Real-world use: Orders created in a relational DB; a CDC (change data capture) connector publishes to Kafka.
Trade-offs: Adds a small write path; avoids lost events and duplicate emission.
Example: Minimal outbox table schema
-- Postgres outbox table
CREATE TABLE outbox (
id BIGSERIAL PRIMARY KEY,
aggregate_id VARCHAR(64) NOT NULL,
event_type VARCHAR(64) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
6. Sinks and materialized views
Concept: Stream data into specialized sinks for query: search indexes (Elasticsearch), OLAP (ClickHouse, BigQuery), caches (Redis), or databases (Postgres).
Real-world use: Clickstream events feed Elasticsearch for product search relevance and BigQuery for dashboards.
Trade-offs: Choose sinks based on query patterns; maintain schema evolution carefully.
7. Exactly-once semantics (EOS)
Concept: End-to-end consistency via transactional writes and idempotent processing.
Real-world use: Financial ledger updates where duplicates must not cause double spending.
Trade-offs: EOS is expensive and complex; often it’s safer to design for idempotent side effects rather than pure EOS.
Example: Kafka transactional producer
# txn_producer.py
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=["localhost:9092"],
value_serializer=lambda v: json.dumps(v).encode("utf-8"),
enable_idempotence=True,
transactional_id="order-txn-1"
)
producer.init_transactions()
try:
producer.begin_transaction()
producer.send("orders", {"type": "OrderCreated", "order_id": 101})
producer.send("orders", {"type": "OrderPaid", "order_id": 101})
producer.commit_transaction()
except Exception as e:
producer.abort_transaction()
raise
8. Backpressure handling
Concept: Prevent overload via flow control: bounded buffers, retries with exponential backoff, and dynamic scaling.
Real-world use: A consumer hitting an API rate limit slows reading from its partition; vertical autoscaling or partition rebalancing eases bottlenecks.
Trade-offs: Blind buffering hides problems; backpressure signals allow graceful degradation.
Example: Bounded queue consumer (pseudo-code pattern)
import time, random
def process_event(evt):
# Simulate variable latency
time.sleep(random.uniform(0.02, 0.2))
return True
def consume_with_backpressure(queue, max_in_flight=10):
in_flight = 0
while True:
if in_flight < max_in_flight:
evt = queue.get()
# Launch async work and track completion
in_flight += 1
# In a real app, use threading/asyncio and callbacks to decrement
try:
process_event(evt)
finally:
in_flight -= 1
else:
time.sleep(0.05) # Back off when busy
9. Schema evolution and contract management
Concept: Use schemas (Avro/Protobuf/JSON Schema) and a registry to enforce compatibility rules.
Real-world use: Adding an optional field to an event without breaking existing consumers.
Trade-offs: Schemas add overhead but save hours debugging serialization bugs.
Example: Basic JSON Schema for an order event
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "OrderCreated",
"type": "object",
"properties": {
"event_id": {"type": "string"},
"type": {"type": "string", "enum": ["OrderCreated"]},
"payload": {
"type": "object",
"properties": {
"order_id": {"type": "integer"},
"amount": {"type": "number"}
},
"required": ["order_id", "amount"]
}
},
"required": ["event_id", "type", "payload"]
}
10. Replay and reprocessing
Concept: Treat the stream as the source of truth. Rebuild derived views by replaying from offsets.
Real-world use: A bug in a consumer is fixed; you replay the last 24 hours to regenerate correct aggregates.
Trade-offs: Reprocessing requires deterministic logic and often isolated side effects (e.g., write to temp tables first).
Real-world case: multi-service event flow
Scenario: An e-commerce platform emits order events. Two consumers update different views: an analytics store and an inventory system.
Project structure
/streaming-demo
├── docker-compose.yml
├── schemas/
│ └── order_event.json
├── producer/
│ └── producer.py
├── analytics_consumer/
│ └── analytics.py
├── inventory_consumer/
│ └── inventory.py
└── Makefile
docker-compose.yml
version: "3.8"
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
schema_registry:
image: confluentinc/cp-schema-registry:7.5.0
depends_on:
- kafka
ports:
- "8081:8081"
environment:
SCHEMA_REGISTRY_HOST_NAME: schema_registry
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092
producer/producer.py
import json, time
from kafka import KafkaProducer
from kafka.errors import KafkaError
producer = KafkaProducer(
bootstrap_servers=["localhost:9092"],
value_serializer=lambda v: json.dumps(v).encode("utf-8"),
enable_idempotence=True
)
def safe_send(topic, event, max_retries=3):
for attempt in range(max_retries):
try:
future = producer.send(topic, event)
record_metadata = future.get(timeout=5)
print(f"Sent to {record_metadata.topic}[{record_metadata.partition}]@{record_metadata.offset}")
return
except KafkaError as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
# Simulate order lifecycle
order_id = 1000
for i in range(10):
safe_send("orders", {
"event_id": f"evt-{order_id}-{i}",
"type": "OrderCreated",
"payload": {"order_id": order_id, "amount": 99.50}
})
time.sleep(0.25)
producer.flush()
analytics_consumer/analytics.py
import json, os
from kafka import KafkaConsumer
consumer = KafkaConsumer(
"orders",
bootstrap_servers=["localhost:9092"],
group_id="analytics-group",
auto_offset_reset="earliest",
value_deserializer=lambda x: json.loads(x.decode("utf-8"))
)
print("Analytics consumer started...")
for msg in consumer:
event = msg.value
# Idempotent analytics update: upsert by event_id
if event["type"] == "OrderCreated":
print(f"[Analytics] Order {event['payload']['order_id']} revenue: {event['payload']['amount']}")
inventory_consumer/inventory.py
import json
from kafka import KafkaConsumer
consumer = KafkaConsumer(
"orders",
bootstrap_servers=["localhost:9092"],
group_id="inventory-group",
auto_offset_reset="earliest",
value_deserializer=lambda x: json.loads(x.decode("utf-8"))
)
print("Inventory consumer started...")
for msg in consumer:
event = msg.value
# Deduplicate using event_id
if event["type"] == "OrderCreated":
print(f"[Inventory] Reserve for order {event['payload']['order_id']}")
Makefile
.PHONY: up down produce
up:
docker-compose up -d
down:
docker-compose down -v
produce:
python producer/producer.py
This small demo illustrates how producers and consumers are decoupled, how group IDs isolate views, and how idempotent sends reduce risk.
Strengths, weaknesses, and trade-offs
Strengths:
- Resilience: Durable logs allow reprocessing and recovery.
- Scalability: Partitioning enables parallelism across consumers.
- Flexibility: Multiple independent consumers on the same stream.
- Auditability: Immutable history supports compliance and debugging.
Weaknesses:
- Complexity: State, offsets, and backpressure add operational overhead.
- Latency: Not always the best for request/response; aim for near real-time.
- Cost: High throughput can mean high compute and storage costs if not tuned.
- Debugging: Distributed systems produce subtle bugs; observability is mandatory.
When to choose streaming:
- You need to react to data as it arrives.
- You want to decouple services without tight RPC coupling.
- Replayability matters (bug fixes, retraining, historical analysis).
- Events are valuable beyond the immediate consumer.
When to reconsider:
- You only need occasional batch loads; event-driven overhead isn’t justified.
- Strict request/response semantics are core; consider a service API.
- You can’t invest in observability and schema management.
Personal experience: lessons from the trenches
The most common mistake I’ve seen is treating streams like glorified logs. Without schema discipline, you’ll pay later. If you can’t enforce a schema at write time, at least validate early and log deserialization errors aggressively. Another trap is over-partitioning. Partitions aren’t free; they increase coordination and rebalance time. I once scaled partitions to 200 for a topic that had modest traffic, and the group rebalances became a source of latency. A moderate number (e.g., 16–64) is often plenty until proven otherwise.
Backpressure has burned me twice: once by letting queues grow until the consumer crashed, and once by not respecting external API limits. Now, I add metrics for consumer lag and throttle writes when lag crosses thresholds. It’s easier to degrade gracefully than to recover from a cascade.
The moment streaming proved its worth was during a rollout of a new pricing engine. We introduced a bug that misclassified users. Because we stored raw events and had schemas, we fixed the logic and replayed the last six hours. No frantic manual corrections, no awkward customer service emails. That single replayability saved the team days.
Getting started: workflow and mental models
Focus on these steps:
- Identify the events that represent your domain facts. Name them clearly (e.g., OrderCreated).
- Decide on schema strategy. Start with JSON Schema for simplicity, evolve to Avro/Protobuf as throughput demands.
- Choose a broker and managed components based on your cloud and team skills.
- Define consumer contracts. Use group IDs to separate views. Document expectations and SLAs.
- Instrument early: lag, error rates, throughput, and cost. Use dashboards (Grafana) and logs.
- Plan for failure: retries, dead-letter topics, idempotent side effects.
A typical folder structure for a small service:
/my-stream-service
├── docker-compose.yml
├── schemas/
│ └── event_name.json
├── src/
│ ├── producer/
│ │ └── main.py
│ └── consumer/
│ └── main.py
├── tests/
│ └── test_idempotency.py
├── ops/
│ └── dashboards.json
└── README.md
Mental model: Treat Kafka topics like database tables. Design schemas as API contracts. Consumers are projections. Producers are writers. Think about evolution from day one.
What makes this approach stand out
- Replay as a first-class feature: Reprocessing isn’t an edge case; it’s a superpower.
- Decoupling without chaos: Pub-sub lets teams evolve independently as long as contracts hold.
- Operational maturity: Once you’ve set up dashboards and alerting, incidents become manageable.
- Developer experience: Schema registries and transactional producers reduce subtle bugs and make deployments predictable.
In my experience, the biggest gains come from small, consistent practices: name events precisely, version them, write idempotent consumers, and monitor lag religiously. The stack matters less than the habits.
Free learning resources
- Apache Kafka Concepts (Confluent): https://docs.confluent.io/kafka/latest/introduction.html
- ksqlDB Documentation: https://docs.ksqldb.io/
- Apache Flink Documentation: https://flink.apache.org/docs/latest/
- Kafka Consumer Group Protocol: https://kafka.apache.org/documentation/#consumerconfigs
- JSON Schema Official Site: https://json-schema.org/
- Idempotency in Distributed Systems (Martin Fowler): https://martinfowler.com/articles/idempotency.html
These resources are practical and up-to-date; they’ll help you move from pattern ideas to production-grade implementations.
Summary: who should use streaming and who might skip it
Use streaming if:
- Your domain is event-centric (orders, clicks, sensor telemetry).
- You need low-latency updates and independent consumers.
- You value replayability, audit trails, and resilience.
- Your team can invest in observability and schema management.
Consider skipping or simplifying if:
- Workloads are small and batch-oriented.
- You lack operational capacity to monitor and maintain a broker.
- Strict request/response is the norm and latency requirements are tight.
- You cannot commit to schema discipline.
Streaming is not a silver bullet, but it’s a powerful lever for building responsive, maintainable systems. Start small: choose one event, one producer, one consumer. Measure lag, add replay tests, then expand. Patterns compound over time, and your architecture will evolve from a juggling act into a well-conducted orchestra.




