Kubernetes Autoscaling Configuration
Why smart scaling matters as your applications face unpredictable traffic and tight budgets

You have probably been there. Your service hums along fine during the day, then a marketing campaign lands, or a partner integration flips on, and requests spike. You scramble to add capacity, only to have things quiet down an hour later while you’re left paying for idle nodes. Autoscaling sounds like magic, but in practice it is a series of configuration decisions with real tradeoffs. It’s the difference between waking up to an incident and waking up to a healthy dashboard.
I have spent enough nights watching dashboards and tweaking resource requests to know that autoscaling is not a set-and-forget switch. It is a feedback loop that you design, tune, and revisit. In the Kubernetes world, the tooling is mature, but the defaults rarely match your workload without thoughtful configuration. This post walks through the concepts, the knobs you can turn, and the patterns I have used in production to keep systems responsive without breaking the budget.
Where autoscaling fits in modern Kubernetes teams
Kubernetes autoscaling is the combination of workload-level scaling (Horizontal Pod Autoscaler, Vertical Pod Autoscaler) and cluster-level scaling (Cluster Autoscaler). You use it when your application’s demand is variable and you want to align resource usage with actual need. In real-world projects, it’s common to see a mix: web services using the Horizontal Pod Autoscaler, batch jobs or stateful workloads experimenting with the Vertical Pod Autoscaler, and node pools scaled automatically by the Cluster Autoscaler.
Who benefits? Platform engineers who want stable clusters, developers who want faster response times, and finance teams who want predictable spend. Compared to alternatives like static capacity planning or manual scaling scripts, autoscaling shifts decisions from guesswork to metrics and policies. Compared to cloud provider auto scaling groups without Kubernetes awareness, the Cluster Autoscaler is Kubernetes-native and understands pod-to-node placement, which matters when pods are pending due to resource fragmentation.
In practice, most teams start with HPA on CPU and memory, then move to custom metrics for business-driven scaling. If you run on managed Kubernetes (EKS, AKS, GKE), the basics are easier, but the configuration choices still belong to you.
Core concepts and practical configuration
Horizontal Pod Autoscaler (HPA): workload-level scaling
HPA increases or decreases replicas of a Deployment or StatefulSet based on observed metrics. It’s the bread-and-butter of scaling stateless services. The classic flow is: metrics server collects metrics, HPA calculates desired replica count, and the controller updates the Deployment.
A realistic HPA looks like this. Note the resource targets and stabilization windows, which matter to avoid flapping:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60
What this does in practice:
- Scales up quickly if CPU or memory crosses thresholds, but stabilizes for five minutes before scaling down.
- Prevents rapid scaling oscillation when traffic jitter is common.
- Keeps at least two replicas during quiet periods for resilience.
For CPU, a target around 60% is a safe starting point for typical web services. For memory, be careful; memory utilization often behaves differently, and scaling on memory can lead to thrashing if your app caches aggressively. Many teams start with CPU only and add memory later only if necessary.
HPA works with metrics-server for CPU and memory. On managed clusters, it’s usually installed by default. To check, run:
kubectl top pods
If this fails, metrics-server isn’t running or RBAC isn’t configured. On GKE, EKS, or AKS, you can typically install it via the cloud addon marketplace.
Vertical Pod Autoscaler (VPA): right-sizing workloads
VPA is less about scaling replicas and more about learning the right resource requests and limits for your pods. It’s valuable for stateful workloads where horizontal scaling is difficult or for services with consistent baseline usage. VPA has three modes: Off, Initial, and Auto. In production, many teams start with Recommender to gather sizing suggestions, then move to Initial for controlled updates, and cautiously adopt Auto for non-critical workloads.
A common VPA configuration for recommendations only:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: postgres-vpa
namespace: data
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: postgres
updatePolicy:
updateMode: "Off"
After some time, you inspect recommendations:
kubectl describe vpa postgres-vpa -n data
You will see suggested requests for CPU and memory. If you decide to apply them, you change updateMode to "Initial" or "Auto". A caution: VPA will evict pods to apply new requests. Use PodDisruptionBudgets to control risk.
Cluster Autoscaler: node-level scaling
Cluster Autoscaler increases or decreases the number of nodes in a node pool based on unschedulable pods. It works with cloud provider integrations and respects node labels, taints, and PDBs. It’s the component that links pod demand to actual infrastructure capacity.
Here’s a typical scenario: a Deployment requests 4 CPU cores per pod and has 10 replicas. If your nodes are 4-core, and you already run other workloads, you might need new nodes. The Cluster Autoscaler sees pending pods and triggers a scale-up.
Practical configuration tips:
- Use multiple node pools with different shapes (e.g., a pool for burstable workloads and a pool for memory-heavy jobs).
- Set node labels and taints to steer workloads to the right pool.
- Coordinate with HPA: set HPA maxReplicas to a value your cluster can realistically handle when scaled out.
On GKE, the cluster autoscaler is a built-in option on node pools. On EKS, you deploy it as a separate deployment and configure it with AWS tags and asg names. On AKS, it’s also a built-in option.
# Example of checking pending pods and logs
kubectl get pods -o wide
kubectl logs -n kube-system deployment/cluster-autoscaler
KEDA: event-driven autoscaling
KEDA (Kubernetes Event-driven Autoscaling) extends HPA to scale based on events from queues, streams, or external metrics. It’s popular for async workloads: SQS, Kafka, RabbitMQ, Prometheus, and more. KEDA creates scaledobjects that drive HPA with custom metrics.
Example for Kafka consumer lag:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: default
spec:
scaleTargetRef:
name: worker-deployment
minReplicaCount: 2
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: "kafka-cluster:9092"
consumerGroup: "payments-processor"
topic: "payments"
lagThreshold: "1000"
tls: "disable"
This scales the worker based on consumer lag. KEDA installs the necessary HPA under the hood and manages authentication secrets for your broker. It’s a clean pattern for workloads where demand is defined by backlog rather than CPU.
Metrics and the Prometheus connection
HPA supports custom metrics via the metrics API. Prometheus Adapter allows you to expose Prometheus queries as metrics that HPA can consume. This is how you scale on application-specific signals, like requests per second or queue length, rather than only CPU and memory.
A common pattern: you expose a /metrics endpoint with application counters, Prometheus scrapes it, and the adapter translates a query like rate(http_requests_total{service="webapp"}[2m]) into a metric for HPA.
Example HPA using an external metric:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-req-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 3
maxReplicas: 40
metrics:
- type: Object
object:
metric:
name: http_requests_per_second
describedObject:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
name: webapp-monitor
target:
type: Value
value: 300
You’ll need Prometheus and the Prometheus Adapter properly configured. This is a step up in complexity but pays off when you want to scale based on actual service behavior.
Real-world example: a web service with HPA, VPA, and cluster autoscaler
Let’s ground this in a typical microservice. The deployment targets a stateless web API with moderate CPU usage and occasional memory spikes. We want to avoid manual intervention while staying within budget.
Project structure for configuration management:
k8s-config/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
├── overlays/
│ ├── dev/
│ │ ├── hpa.yaml
│ │ ├── vpa.yaml
│ │ └── kustomization.yaml
│ └── prod/
│ ├── hpa.yaml
│ ├── vpa.yaml
│ └── kustomization.yaml
└── README.md
Base deployment (relevant snippet):
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
replicas: 2
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: your-registry/webapp:1.4.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
env:
- name: LOG_LEVEL
value: "info"
Production overlay for HPA (prod/hpa.yaml):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
Production overlay for VPA recommendations (prod/vpa.yaml):
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: webapp-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
updatePolicy:
updateMode: "Off"
Apply with Kustomize:
kubectl apply -k overlays/prod
Observe the behavior over a few days. VPA will recommend new requests. If you set updateMode to Initial, VPA will apply recommendations at pod creation time. For critical services, consider a canary rollout using separate deployments and gradually switching traffic.
Common pitfalls in this setup:
- Using too tight CPU utilization targets results in aggressive scaling. Start around 60–70%.
- Scaling on memory can cause flapping if your app has variable caches.
- Not setting PDBs leads to risky evictions during node drain or VPA updates.
Add a PodDisruptionBudget for safety:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: webapp-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: webapp
Evaluating tradeoffs: strengths, weaknesses, and fit
When autoscaling shines
- Variable traffic patterns with clear signals (CPU, memory, custom metrics, queue lag).
- Stateless services where replicas can be increased horizontally.
- Teams that can invest in observability and SLOs to guide tuning.
When autoscaling is tricky
- Stateful workloads with strict consistency or long startup times. VPA can help with sizing, but horizontal scaling may be limited.
- Latency-sensitive pipelines where cold starts matter. Keep minReplicas higher or use warm pools.
- Environments with strict cost controls. Scaling policies and quotas are essential to avoid bill shock.
- Small clusters with limited node pools. The Cluster Autoscaler might not be able to provision the right shapes quickly enough.
Tradeoffs to weigh
- Responsiveness vs. stability: fast scale-up helps under load, but can cause thrashing without proper stabilization.
- Resource efficiency vs. risk: low utilization targets save money but increase the chance of throttling or CPU saturation during bursts.
- Custom metrics vs. complexity: scaling on queue lag or RPS is powerful, but requires a metrics pipeline and operational maturity.
Personal experience: lessons from real clusters
I learned the hard way that scaling on CPU alone can be misleading for memory-bound apps. One data processing service scaled replicas based on CPU, but the real constraint was memory fragmentation inside the process. The pods thrashed, evictions happened, and the cluster autoscaler kept adding nodes that didn’t actually help. Moving to VPA recommendations and tuning memory requests reduced churn, and we added a soft limit on replicas to prevent runaway scale-up during anomalies.
Another lesson: test scale-up and scale-down separately. We once set scale-down stabilization too short, and pods got evicted right after a burst, leading to a flapping service and noisy neighbors on shared nodes. A five-minute scale-down window and a PDB saved us from repeated restarts.
On the cost side, one team scaled aggressively to hit latency SLOs, only to realize that the node pool didn’t have enough capacity in the target region during peak hours. The fix was a combination of node pool diversity (larger nodes for bursty traffic), realistic maxReplicas, and proactive capacity planning with reserved instances for baseline load.
Getting started: tooling and workflow
If you’re new to autoscaling, start with a single service and a simple HPA on CPU. Make sure metrics-server is installed. Then iterate.
Tooling checklist:
- A cluster with metrics-server and kube-state-metrics.
- Prometheus and the Prometheus Adapter if you plan to scale on custom metrics.
- KEDA if you have event-driven workloads.
- Cloud provider autoscaler components if using Cluster Autoscaler (install via cloud provider docs).
General workflow:
- Define SLOs or clear performance targets (e.g., p95 latency < 250 ms).
- Choose a metric that correlates with load (CPU for stateless APIs, queue lag for async workers).
- Set minReplicas for baseline resilience and maxReplicas based on capacity and budget.
- Use stabilization windows to dampen oscillations.
- Add PDBs and topology spread constraints for placement stability.
- Observe using dashboards and alerts; adjust targets based on real behavior.
- For VPA, start with Recommender mode, then move to Initial for non-critical workloads, and carefully to Auto for critical ones after testing.
Free learning resources
- Kubernetes Horizontal Pod Autoscaler documentation: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- Vertical Pod Autoscaler user guide: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#readme
- Cluster Autoscaler official repo: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#readme
- KEDA documentation and examples: https://keda.sh/docs/
- Prometheus Adapter configuration: https://github.com/kubernetes-sigs/prometheus-adapter#readme
- GKE Cluster Autoscaler overview: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
- EKS Cluster Autoscaler setup: https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html
- AKS Cluster Autoscaler: https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler
These resources cover the components, cloud specifics, and configuration patterns you will need. The official docs are the best source for up-to-date flags and IAM permissions, which change frequently across providers.
Who should use autoscaling, and who might skip it
Use autoscaling if:
- Your workload is variable or unpredictable.
- You have basic observability in place and can define meaningful metrics.
- You can afford to spend time tuning behavior and reviewing capacity.
Consider skipping or deferring autoscaling if:
- Your workloads are small, stable, and predictable, and you prefer fixed capacity.
- You run batch jobs with strict resource constraints and no need for responsiveness.
- Your team lacks the bandwidth to maintain monitoring and alerting for scaling behavior.
A grounded takeaway: autoscaling is not a silver bullet. It is a feedback loop that rewards careful measurement and iteration. Start small, choose one service, pick one metric, and observe. The confidence you build from a well-tuned HPA or a well-scoped VPA recommendation will guide you toward more advanced patterns like custom metrics and event-driven scaling. When done right, autoscaling gives you fewer pages at night, faster responses for users, and a bill that matches actual usage.




