Log Aggregation Strategies for Cloud-Native Apps
As distributed systems grow, gathering logs efficiently becomes a critical engineering challenge, not just a DevOps afterthought.

When I first moved a monolith into Kubernetes, I thought logging would be simple. I was wrong. The first time we needed to debug a service mesh issue, I found myself ssh’ing into pods, tailing logs locally, and hoping the logs hadn’t rotated out. It was a frantic, inefficient process that exposed a gap in our observability. We didn’t have a cohesive logging strategy, and in a cloud-native environment where ephemeral containers vanish in seconds, that strategy is the difference between minutes and hours of Mean Time to Resolution (MTTR).
The move to microservices, serverless functions, and orchestrated containers breaks traditional logging assumptions. Log lines are no longer in a single file on a single server; they are scattered across nodes, namespaces, and clouds. This article walks through the strategies and tools I’ve used to tame that chaos. We will look at architectural patterns, specific tools like Fluent Bit and OpenTelemetry, and practical code examples to build a resilient logging pipeline.
The Context: Why Centralized Logging Matters Now
In the early days of virtual machines, logging meant a cron job rotating a file on a disk, maybe shipping it to an NFS share. Today, cloud-native applications are dynamic. A single user request might traverse an API gateway, an authentication service, a payment processor, and a database—each running in its own container across multiple availability zones. When that request fails, there is no single log file to check.
Centralized log aggregation is the practice of collecting log data from various sources, parsing it into a structured format, and storing it in a central location for analysis. It fits into the "Three Pillars of Observability" (logs, metrics, and traces). While metrics tell you how many requests are failing, and traces tell you where in the call stack the latency is occurring, logs tell you why.
Who uses this? Every engineering team building distributed systems. From startups deploying Node.js microservices on EKS to enterprises running legacy Java workloads on Azure Kubernetes Service (AKS), the requirement is universal. Compared to alternatives like local file logging or simple cloud provider logging (e.g., AWS CloudWatch Logs without subscriptions), a dedicated aggregation strategy offers better searchability, correlation, and long-term retention capabilities.
Architectural Patterns for Cloud-Native Logging
There are three main architectural patterns for log aggregation in containerized environments. Choosing the right one depends on your scale, latency tolerance, and infrastructure constraints.
1. Sidecar Pattern
In the sidecar pattern, a dedicated logging container runs alongside your application container within the same Pod (in Kubernetes terms). The application writes logs to stdout or a file, and the sidecar collects, processes, and forwards them.
- Pros: Isolation. If the logging agent crashes, it doesn’t bring down the app. You can update the logging config without redeploying the application.
- Cons: Resource overhead. Every pod gets its own logging agent, which adds CPU and memory consumption.
2. Node-Level Agent Pattern
A single logging agent runs as a DaemonSet (in Kubernetes) or a system service on every node in the cluster. It collects logs from all containers on that node.
- Pros: Efficient resource usage. One agent per node handles all logs.
- Cons: Coupled lifecycle. Upgrading the agent might require node updates. Configuration changes affect all pods on the node.
3. Direct Ship Pattern
The application itself sends logs directly to the logging backend (e.g., via an HTTP client).
- Pros: No extra infrastructure agents.
- Cons: Tight coupling. Libraries and network latency are injected into the application. Hard to standardize across polyglot environments.
For most Kubernetes clusters, the Node-Level Agent pattern (via DaemonSets) is the standard recommendation for cost efficiency and ease of management.
The Stack: Fluent Bit, OpenTelemetry, and Object Storage
Modern logging stacks have evolved from the heavy ELK (Elasticsearch, Logstash, Kibana) stack to lighter, more modular architectures.
- Fluent Bit: A super-fast, lightweight log processor and forwarder. It runs as a DaemonSet, collects logs from
/var/log/containers/, parses them, and sends them to backends like S3 or Loki. It is the CNCF graduated project that replaced the heavier Fluentd for many use cases. - OpenTelemetry (OTel): While primarily for traces, OTel has a Logs API and SDK. It provides a vendor-neutral standard for generating and collecting logs, which is crucial for avoiding vendor lock-in.
- Object Storage (S3/GCS): For cost-effective, long-term retention. Search is slower, but it’s perfect for compliance and audit trails.
- Query Engines (Grafana Loki): For active debugging. Loki indexes only metadata (labels), not the full text, making it cheaper and faster to query than Elasticsearch for many log use cases.
Practical Setup: Building a Fluent Bit Pipeline on Kubernetes
Let’s build a realistic logging pipeline using a Node-Level Agent (Fluent Bit) shipping logs to an S3 bucket for archival. We will use a Kubernetes manifest to deploy Fluent Bit as a DaemonSet.
1. Project Structure
A typical infrastructure-as-code repository for logging might look like this:
logging-infra/
├── fluent-bit/
│ ├── config-map.yaml
│ ├── daemonset.yaml
│ └── service-account.yaml
├── kustomization.yaml
└── README.md
2. Fluent Bit Configuration
We need to configure Fluent Bit to read logs from Docker containers (using the tail input plugin) and ship them to S3 (using the s3 output plugin).
The configuration is defined in a Kubernetes ConfigMap. We will parse the standard Docker log format (JSON) and add metadata (Kubernetes labels) to the log records so we can filter them later.
fluent-bit/config-map.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Daemon off
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name s3
Match *
bucket my-app-logs-bucket
region us-east-1
total_file_size 100M
upload_timeout 10m
use_put_object On
s3_key_format /logs/%Y/%m/%d/%H_%M_%S_$UUID.gz
compression gzip
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
3. The DaemonSet Deployment
This manifest deploys Fluent Bit on every node. Note the hostPath mounts; these allow Fluent Bit to access the node's log directory and kernel logs.
fluent-bit/daemonset.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
k8s-app: fluent-bit-logging
version: v1
spec:
selector:
matchLabels:
k8s-app: fluent-bit-logging
template:
metadata:
labels:
k8s-app: fluent-bit-logging
version: v1
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1.0
imagePullPolicy: Always
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
The Evolution: OpenTelemetry and Context Propagation
While Fluent Bit is excellent for log shipping, the industry is moving toward OpenTelemetry (OTel). OTel unifies logs, metrics, and traces. In the past, logs were just strings of text. Today, we want logs to be correlated with traces.
If a request fails, we want to click on the trace ID in our observability dashboard and see all logs associated with that specific request.
Implementing OTel in a Go Application
Here is a simple Go application configured to output logs in JSON format (a prerequisite for OTel compatibility) and include a Trace ID.
package main
import (
"context"
"log"
"net/http"
"os"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
// Logger wraps the standard logger to inject context
type Logger struct {
logger *log.Logger
}
func (l *Logger) Info(ctx context.Context, msg string) {
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// In a real OTel setup, you'd use the slog or zap otel handlers
// Here we simulate structured logging for the Fluent Bit parser
l.logger.Printf(`{"level":"info", "trace_id":"%s", "message":"%s"}`, traceID, msg)
}
func main() {
logger := &Logger{logger: log.New(os.Stdout, "", 0)}
// Setup a simple tracer provider (in production, export to OTel Collector)
tp := otel.Tracer("example-app")
otel.SetTracerProvider(tp)
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
ctx, span := tp.Start(r.Context(), "handle-request")
defer span.End()
logger.Info(ctx, "Request received")
// Simulate work
w.Write([]byte("Hello World"))
})
log.Println("Server listening on :8080")
http.ListenAndServe(":8080", nil)
}
Why this matters:
When this application runs in Kubernetes, the logs are emitted as JSON. Fluent Bit reads this JSON, extracts the trace_id field, and can forward it to the backend. If you are using a tool like Grafana Tempo or Jaeger, you can now link the logs to the distributed trace.
Honest Evaluation: Strengths, Weaknesses, and Tradeoffs
No logging strategy is perfect. It’s a series of tradeoffs based on volume, cost, and query needs.
Strengths:
- Resilience: In the node-agent pattern, logs are buffered locally if the network fails, preventing backpressure on the application.
- Structured Data: Moving away from regex parsing (common in Syslog) to JSON parsing (common in Fluent Bit) drastically reduces CPU usage and parsing errors.
- Correlation: With OTel, logs are no longer isolated events but part of a request lifecycle.
Weaknesses:
- Cost: Storage costs for high-volume logs can spiral. Indexing every word (like in Elasticsearch) is expensive.
- Complexity: Managing a logging pipeline is an engineering task in itself. Upgrading Fluent Bit versions across a cluster requires automation (e.g., ArgoCD or Helm).
- Latency: Batched logs (waiting to upload to S3) can introduce a delay in log visibility.
When to skip this: If you are building a small monolith or a low-traffic application, managed services like AWS CloudWatch Logs or Azure Monitor might be sufficient. Setting up a custom Fluent Bit pipeline is overkill for applications that don't yet have dedicated DevOps resources.
Personal Experience: Lessons from the Trenches
I recall a production incident where a payment service was timing out sporadically. We had logs, but they were unstructured text. Searching for "timeout" returned millions of hits from background cron jobs, masking the actual issue.
We paused and implemented a structured logging schema across the team. Every log line got a correlation_id. We updated our Fluent Bit parser to extract this ID. The next time the issue occurred, we filtered by the specific ID and watched the request flow through the system in real-time. The culprit was a slow third-party API call that wasn't respecting our timeout settings.
Common mistakes I see:
- Logging sensitive data: Never log headers, tokens, or PII. Use filters in Fluent Bit to scrub data before it leaves the cluster.
- Over-logging: Writing a log line for every loop iteration creates noise and burns CPU cycles. Log state changes, not flow control.
- Ignoring retention policies: Keeping logs in hot storage (like Elasticsearch) forever is a financial ticking time bomb. Always tier your storage: hot (7 days), warm (30 days), cold/archive (S3/GCS for years).
Getting Started: Workflow and Mental Models
To get started, don't try to boil the ocean. Start with a mental model of "Collect, Parse, Route."
- Standardize Output: Configure your applications to log JSON to
stdout. - Collect: Deploy a lightweight agent (Fluent Bit) on your nodes.
- Parse: Use the agent to parse the JSON and add Kubernetes metadata (pod name, namespace).
- Route: Send high-value logs (errors, audit trails) to a queryable store (Loki/Elasticsearch) and bulk logs to cheap storage (S3).
Tooling Recommendations:
- Helm: Use the
fluent/fluent-bitHelm chart for easy deployment. - VS Code: Install the YAML extension for validating Kubernetes manifests.
- Local Testing: Use
minikubeorkindto simulate a cluster. Run abusyboxpod that generates logs to verify your Fluent Bit configuration before deploying to production.
Free Learning Resources
The ecosystem moves fast, but these resources provide solid foundations:
- Fluent Bit Documentation: fluentbit.io/documentation - The official docs are excellent, specifically the "Administration Guide" for Kubernetes.
- CNCF Whitepapers: The "Cloud Native Logging" whitepaper by the CNCF Technical Oversight Committee provides vendor-neutral architectural guidance.
- OpenTelemetry Spec: opentelemetry.io - Understand the data model for logs to future-proof your implementation.
- Grafana Loki Docs: If you choose the Loki stack, their documentation on log aggregation patterns is top-notch.
Summary: Who Should Use This?
Use this strategy if:
- You are running more than a handful of microservices.
- You need to comply with data retention regulations (GDPR, SOC2).
- Debugging production issues currently requires SSH access to running containers.
- You want to move toward a unified observability platform (logs + traces + metrics).
Skip this (or simplify) if:
- You have a monolithic architecture with a few static servers.
- You are prototyping an MVP and logging to a local file is sufficient.
- You are fully committed to a managed cloud provider's logging solution (CloudWatch, Azure Monitor) and have no requirements for long-term archival or cross-cloud portability.
Log aggregation is not about collecting data for the sake of it. It’s about reducing uncertainty. When the system breaks at 3 AM, the quality of your logging strategy dictates how fast you go back to sleep.




