Infrastructure Resilience Patterns

·16 min read·DevOps and Infrastructureintermediate

Why Resilience Matters When the Cloud Feels Like It’s Built on Sand

A server rack with cables neatly routed, representing physical and logical infrastructure layers and redundancy considerations

I’ve lost count of the times I’ve seen a deployment go green in the CI pipeline, only for the production health check to start flapping five minutes later due to a misbehaving dependency. It’s rarely the core logic that fails; it’s the network partition, the overloaded database, or the third-party API that decides to go on vacation. In modern distributed systems, the infrastructure isn’t just a platform; it’s an active participant in the application’s behavior. Resilience patterns are the guardrails that keep your service running when the foundation shifts beneath it.

In this post, I’ll walk through practical patterns for infrastructure resilience, focusing on patterns you can implement today using familiar tools like Go, Kubernetes, and Terraform. We’ll look at how these patterns behave in real scenarios, where they shine, and where they add unnecessary complexity. I’ll share personal observations from projects where ignoring a pattern cost us, and where applying one saved the day. You’ll come away with a mental model for choosing the right resilience strategy for your context, plus a starter project structure you can adapt.

Where Infrastructure Resilience Fits Today

Most of us deploy to Kubernetes or serverless platforms, with stateful workloads in managed databases and caches. The boundary between “application” and “infrastructure” is blurrier than ever. You configure pod disruption budgets, set up Istio virtual services, and define autoscaling rules in Terraform. Resilience patterns apply across this boundary: they address failures in compute, storage, networking, and external dependencies.

Who uses these patterns? Platform teams, SREs, and backend developers who care about production stability. In practice, you’ll see them in:

  • Kubernetes manifests with liveness/readiness probes and pod anti-affinity rules.
  • Terraform modules that provision multi-AZ databases and read replicas.
  • Application code implementing retries with backoff and circuit breakers.
  • Pipeline stages validating failover scenarios in staging.

Compared to alternatives, infrastructure resilience differs from high-availability architectures that rely purely on hardware redundancy, and it’s more operational than pure chaos engineering. The goal isn’t just “five nines” on paper; it’s graceful degradation under real-world unpredictability. When done well, these patterns reduce alert fatigue and let you sleep through minor blips.

Core Patterns and Practical Implementation

Health Checks and Readiness Gates

Kubernetes readiness and liveness probes are a baseline. But readiness gates extend readiness to external conditions, like whether a database migration has completed. In one project, we had a service that started accepting traffic before its Redis cache was warmed, causing a spike in latency. Using a readiness gate tied to a warming job fixed it.

Example Kubernetes readiness gate configuration:

apiVersion: v1
kind: Pod
metadata:
  name: app-with-readiness-gate
spec:
  readinessGates:
    - conditionType: "db-migrated"
  containers:
    - name: app
      image: myapp:1.2.0
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5
      env:
        - name: DB_MIGRATED_FILE
          value: "/tmp/db-migrated"

In your application, you can expose a readiness endpoint that checks for the migration flag:

package main

import (
    "database/sql"
    "fmt"
    "net/http"
    "os"
)

func readyHandler(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Check if migration marker exists
        if _, err := os.Stat("/tmp/db-migrated"); err != nil {
            http.Error(w, "migration not complete", http.StatusServiceUnavailable)
            return
        }

        // Verify DB is reachable
        if err := db.Ping(); err != nil {
            http.Error(w, "db not reachable", http.StatusServiceUnavailable)
            return
        }

        w.WriteHeader(http.StatusOK)
        fmt.Fprintln(w, "OK")
    }
}

Fun fact: The readiness gate was introduced to solve ordering problems for workloads that depend on external systems. It decouples “container running” from “service ready,” which matters when your app can start but shouldn’t receive traffic yet.

Circuit Breakers for External Dependencies

Circuit breakers protect your service from cascading failures when an external dependency is struggling. Instead of piling on requests, the breaker trips and fails fast, giving the downstream system time to recover. In Go, libraries like gobreaker are straightforward to use.

Here’s a realistic pattern wrapping an external HTTP client:

package main

import (
    "context"
    "errors"
    "fmt"
    "net/http"
    "time"

    "github.com/sony/gobreaker"
)

func newCircuitBreaker() *gobreaker.CircuitBreaker {
    return gobreaker.NewCircuitBreaker(gobreaker.Settings{
        Name:        "payment-api",
        MaxRequests: 3,
        Interval:    10 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            // Trip if 50% of requests fail and there are at least 5 requests
            return counts.TotalRequests >= 5 && counts.TotalFailures*2 >= counts.TotalRequests
        },
    })
}

func callPaymentAPI(ctx context.Context, cb *gobreaker.CircuitBreaker, url string) (string, error) {
    body, err := cb.Execute(func() (interface{}, error) {
        req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, nil)
        if err != nil {
            return nil, err
        }
        resp, err := http.DefaultClient.Do(req)
        if err != nil {
            return nil, err
        }
        defer resp.Body.Close()
        if resp.StatusCode >= 500 {
            return nil, errors.New("server error")
        }
        // In real usage, read and return body
        return "ok", nil
    })
    if err != nil {
        return "", err
    }
    return body.(string), nil
}

func main() {
    cb := newCircuitBreaker()
    ctx := context.Background()
    result, err := callPaymentAPI(ctx, cb, "https://api.example.com/pay")
    if err != nil {
        fmt.Printf("call failed: %v\n", err)
        return
    }
    fmt.Println("result:", result)
}

This pattern works best when you have clear failure semantics and can set meaningful thresholds. In production, combine circuit breakers with metrics and alerts so you know when they trip and why.

Retries with Backoff and Jitter

Retries are a double-edged sword: they help with transient failures but can amplify load if misconfigured. Always pair retries with exponential backoff and jitter to avoid thundering herds. The Go standard library doesn’t include a retry helper, so teams often roll their own or use a small utility.

Here’s a pragmatic retry wrapper:

package main

import (
    "context"
    "fmt"
    "math/rand"
    "time"
)

func retry(ctx context.Context, maxAttempts int, initialDelay time.Duration, fn func() error) error {
    var err error
    delay := initialDelay
    for attempt := 1; attempt <= maxAttempts; attempt++ {
        err = fn()
        if err == nil {
            return nil
        }
        // jitter reduces synchronized retries across instances
        jitter := time.Duration(rand.Int63n(int64(delay))) //nolint:gosec
        select {
        case <-time.After(delay + jitter):
            delay *= 2
        case <-ctx.Done():
            return fmt.Errorf("context cancelled: %w", err)
        }
    }
    return fmt.Errorf("after %d attempts: %w", maxAttempts, err)
}

func main() {
    ctx := context.Background()
    err := retry(ctx, 5, 100*time.Millisecond, func() error {
        // Simulate a flaky call
        return fmt.Errorf("temporary failure")
    })
    if err != nil {
        fmt.Println("retry failed:", err)
    }
}

Use retries judiciously. For non-idempotent operations, consider compensating transactions or saga patterns instead.

Bulkheads for Resource Isolation

Bulkheads isolate resources so failures in one area don’t consume all capacity. In Go, you can use semaphores or worker pools to limit concurrency for specific operations. In Kubernetes, resource requests/limits and pod anti-affinity provide bulkheads at the infrastructure level.

Here’s a simple worker pool that isolates external API calls from core processing:

package main

import (
    "context"
    "fmt"
    "sync"
)

func processWithPool(ctx context.Context, tasks []string, maxWorkers int) []error {
    var wg sync.WaitGroup
    taskCh := make(chan string)
    errCh := make(chan error, len(tasks))

    // Start workers
    for i := 0; i < maxWorkers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for {
                select {
                case <-ctx.Done():
                    return
                case task, ok := <-taskCh:
                    if !ok {
                        return
                    }
                    if err := callExternalAPI(task); err != nil {
                        errCh <- err
                    }
                }
            }
        }()
    }

    // Send tasks
    go func() {
        for _, t := range tasks {
            taskCh <- t
        }
        close(taskCh)
    }()

    wg.Wait()
    close(errCh)

    var errs []error
    for e := range errCh {
        errs = append(errs, e)
    }
    return errs
}

func callExternalAPI(task string) error {
    // Simulate external call
    if task == "bad" {
        return fmt.Errorf("api failure for task %s", task)
    }
    return nil
}

This ensures heavy API usage doesn’t starve CPU-bound processing. In Kubernetes, pair with horizontal pod autoscaling tuned to the bulkhead’s resource profile.

Rate Limiting and Quotas

When integrating with third-party APIs, rate limiting protects both sides. Use token bucket or leaky bucket algorithms and apply limits per route or per user. In Go, golang.org/x/time/rate is a reliable choice.

package main

import (
    "context"
    "net/http"
    "time"

    "golang.org/x/time/rate"
)

type limiter struct {
    r rate.Limiter
}

func newLimiter(r rate.Limit, b int) *limiter {
    return &limiter{
        r: rate.NewLimiter(r, b),
    }
}

func (l *limiter) middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx := context.Background()
        if err := l.r.Wait(ctx); err != nil {
            http.Error(w, "rate limited", http.StatusTooManyRequests)
            return
        }
        next.ServeHTTP(w, r)
    })
}

In Kubernetes, you can enforce egress rate limits at the network layer using service meshes, but application-level limits give you more control over user experience.

Graceful Shutdown

A crash during deployment can drop in-flight requests. In Go, trap SIGTERM and wait for active work to complete. Kubernetes sends SIGTERM before killing a pod, respecting terminationGracePeriodSeconds.

package main

import (
    "context"
    "fmt"
    "net/http"
    "os"
    "os/signal"
    "sync"
    "syscall"
    "time"
)

type server struct {
    srv    *http.Server
    active sync.WaitGroup
}

func (s *server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    s.active.Add(1)
    defer s.active.Done()

    // Simulate work
    time.Sleep(200 * time.Millisecond)
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "OK")
}

func main() {
    s := &server{
        srv: &http.Server{
            Addr:    ":8080",
            Handler: s,
        },
    }

    // Start server
    go func() {
        if err := s.srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            fmt.Printf("server error: %v\n", err)
        }
    }()

    // Handle graceful shutdown
    stop := make(chan os.Signal, 1)
    signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
    <-stop

    fmt.Println("shutting down...")
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    if err := s.srv.Shutdown(ctx); err != nil {
        fmt.Printf("shutdown error: %v\n", err)
    }

    // Wait for in-flight requests
    s.active.Wait()
    fmt.Println("shutdown complete")
}

In Kubernetes, set terminationGracePeriodSeconds long enough for your app to finish active work, but not so long that rollouts drag.

Pod Disruption Budgets and Anti-Affinity

For stateful workloads, pod disruption budgets (PDBs) prevent voluntary disruptions from taking down too many replicas at once. Anti-affinity ensures pods aren’t co-located on the same node or zone.

Example PDB and anti-affinity:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: myapp
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 4
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - myapp
              topologyKey: "kubernetes.io/hostname"
      containers:
        - name: app
          image: myapp:1.2.0

This pairs well with HPA so you maintain capacity during node drains or zone failures.

Database Failover and Read Replicas

Stateful services need explicit failover strategies. Managed databases (e.g., AWS RDS, GCP Cloud SQL) offer multi-AZ failover; you can add application-level logic to route reads to replicas and writes to the primary. Use connection pools with automatic reconfiguration on failover.

Here’s a simple pattern that detects primary changes (conceptual):

package main

import (
    "database/sql"
    "fmt"
    "sync"
    "time"

    _ "github.com/lib/pq"
)

type dbPool struct {
    mu       sync.RWMutex
    primary  *sql.DB
    replica  *sql.DB
    detectFn func() (string, error) // returns primary DSN
}

func (p *dbPool) getPrimary() *sql.DB {
    p.mu.RLock()
    defer p.mu.RUnlock()
    return p.primary
}

func (p *dbPool) refresh() {
    dsn, err := p.detectFn()
    if err != nil {
        return
    }
    newPrimary, err := sql.Open("postgres", dsn)
    if err != nil {
        return
    }
    p.mu.Lock()
    if p.primary != nil {
        p.primary.Close()
    }
    p.primary = newPrimary
    p.mu.Unlock()
}

func (p *dbPool) startRefresher(ctx context.Context) {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            p.refresh()
        }
    }
}

In practice, rely on your cloud provider’s failover DNS and connection string endpoints, but keep your app resilient to temporary unavailability.

Idempotency and Exactly-Once Semantics

In distributed systems, retries can cause duplicates. Use idempotency keys for writes and deduplicate at the consumer. For message queues like Kafka or SQS, design your consumers to handle repeated messages safely.

Example idempotency middleware for HTTP:

package main

import (
    "context"
    "fmt"
    "net/http"
    "sync"
)

type idempotencyStore struct {
    mu    sync.RWMutex
    cache map[string]struct{}
}

func newIdempotencyStore() *idempotencyStore {
    return &idempotencyStore{cache: make(map[string]struct{})}
}

func (s *idempotencyStore) seen(key string) bool {
    s.mu.RLock()
    _, ok := s.cache[key]
    s.mu.RUnlock()
    return ok
}

func (s *idempotencyStore) mark(key string) {
    s.mu.Lock()
    s.cache[key] = struct{}{}
    s.mu.Unlock()
}

func idempotencyMiddleware(store *idempotencyStore, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        key := r.Header.Get("Idempotency-Key")
        if key == "" {
            next.ServeHTTP(w, r)
            return
        }
        if store.seen(key) {
            w.WriteHeader(http.StatusOK)
            fmt.Fprintln(w, "duplicate")
            return
        }
        store.mark(key)
        next.ServeHTTP(w, r)
    })
}

Combine this with retries to achieve safer “at-least-once” behavior with deduplication.

Chaos Engineering and Game Days

Resilience patterns must be validated. Chaos engineering introduces controlled failures to test resilience. Start small: introduce latency or kill a pod during a game day, observe alerts, and verify failover behavior. Tools like Chaos Mesh or Litmus can help, but manual experiments are just as valuable in early stages.

In one project, we ran a game day simulating a regional database failover. The application handled it well except for one background job that didn’t retry on connection errors. That job caused a backlog that took hours to clear. The fix was simple: add retries with backoff and a circuit breaker. Since then, game days are a scheduled routine, not an ad-hoc panic.

Honest Evaluation: Strengths, Weaknesses, and Tradeoffs

Strengths:

  • When applied thoughtfully, resilience patterns reduce outage duration and blast radius. They often cost less than the downtime they prevent.
  • They encourage better observability: health checks, metrics, and logs become first-class concerns.
  • Many patterns are language-agnostic and fit seamlessly into Kubernetes, Terraform, and service meshes.

Weaknesses:

  • Over-engineering is common. A circuit breaker on every call can hide design issues and add latency.
  • Some patterns increase complexity in development and testing, especially around idempotency and distributed state.
  • Tuning thresholds (e.g., retry counts, circuit breaker trip conditions) requires data and iteration.

Tradeoffs:

  • Retries vs. load: Retries help transient errors but can amplify load. Use backoff and jitter, and limit retries to idempotent operations.
  • Bulkheads vs. efficiency: Isolating resources protects stability but can reduce utilization. Balance with autoscaling.
  • Graceful shutdown vs. rollout speed: Longer grace periods protect in-flight requests but slow deployments. Tune per service profile.

When to use:

  • Use health checks and graceful shutdown everywhere. They’re low-cost and high-value.
  • Circuit breakers are great for calls to unreliable external APIs.
  • Bulkheads and rate limiting are essential for multi-tenant services or third-party integrations.
  • PDBs and anti-affinity are important for stateful workloads or SLA-bound services.

When not to use:

  • Avoid retries for non-idempotent operations without compensating logic.
  • Don’t add complex resilience patterns to a simple internal service with trivial failure modes. It’s okay to keep things straightforward.

Personal Experience: Learning Curves and Gotchas

The biggest lesson for me was that resilience starts with good observability. In a past project, we implemented retries everywhere, only to discover we had no visibility into retry storms. The fix wasn’t removing retries; it was adding metrics: counters for attempts, successes, failures, and durations. Once we could see the shape of retries, tuning became practical.

Another common mistake is conflating readiness and liveness. Marking a pod unhealthy when a non-critical dependency is down can cause cascading restarts. Use liveness to detect deadlocks and readiness to gate traffic. In one service, we had an external rate limiter dependency; if it went down, readiness correctly blocked traffic without restarting the app, preserving state and avoiding restart storms.

Finally, graceful shutdown is not a nice-to-have. During a rolling update, we didn’t respect termination grace periods and saw dropped orders. Setting terminationGracePeriodSeconds to 30 seconds and adding signal handling eliminated the drops. Small changes, big impact.

Getting Started: Workflow and Project Structure

Start by identifying critical paths and dependencies. For each external call, decide if it needs a retry, circuit breaker, or rate limiter. For compute, add health checks and graceful shutdown. For state, plan failover and idempotency. Validate with game days.

A minimal project structure for a Go service on Kubernetes might look like this:

/app
├── cmd
│   └── server
│       └── main.go
├── internal
│   ├── api
│   │   └── handler.go
│   ├── resilience
│   │   ├── breaker.go
│   │   ├── retry.go
│   │   └── idempotency.go
│   └── db
│       └── pool.go
├── Dockerfile
├── go.mod
├── go.sum
└── k8s
    ├── deployment.yaml
    ├── pdb.yaml
    └── service.yaml

Workflow mental model:

  • Define SLOs for critical endpoints: latency and availability.
  • Implement health endpoints: /health/live and /health/ready.
  • Add resilience wrappers for external calls (retry, circuit breaker).
  • Configure Kubernetes probes, resource requests/limits, and PDBs.
  • Run a game day: inject latency, kill pods, failover DB. Observe and iterate.
  • Instrument with metrics: request rate, error rate, retry counts, breaker state.

Free Learning Resources

Summary and Takeaway

Infrastructure resilience patterns are about building services that degrade gracefully rather than failing catastrophically. They apply across the stack: from application code to Kubernetes manifests to Terraform modules. In modern environments, these patterns are essential because the network is unreliable, dependencies are distributed, and users expect continuity.

Who should use these patterns:

  • Backend developers and platform teams building services with external dependencies or strict availability expectations.
  • SREs responsible for operational stability and incident response.
  • Teams deploying on Kubernetes or serverless where infrastructure is part of the failure domain.

Who might skip:

  • Teams building single-tenant internal tools with minimal dependencies and low SLAs.
  • Prototypes and MVPs where observability and fast iteration are more important than resilience polish. You can add patterns later as you scale.

The takeaway: Start with health checks and graceful shutdown. Measure your failure modes. Add retries, circuit breakers, and rate limiting where they address real risks. Validate with game days. Resilience isn’t about perfection; it’s about reducing the blast radius and giving your system room to recover.