Monitoring and Alerting Systems
Because waiting for user reports is no longer acceptable in modern software delivery

Monitoring and alerting are the heartbeat of any production system. As developers, we tend to celebrate deployments, but the reality is that most of our value is delivered between deployments when the application is actually running and serving users. Over the years, I’ve learned that the difference between a resilient service and a chaotic one isn’t the absence of failure; it’s how quickly you detect it, understand it, and recover. Writing code is hard; operating it responsibly is harder and often overlooked until an incident wakes us up at 3 a.m. This article is for developers who want to bake observability and alerting into their workflows, not as an afterthought, but as a core part of how we build. We’ll cover the mental models, tradeoffs, and practical implementation details you need to move from “it works on my machine” to “I know exactly what’s happening in production.”
Most teams start monitoring in reactive mode: they add logs, metrics, and alerts only after an incident. The best time to instrument a service is before it handles real traffic. I’ve seen startups that skip monitoring entirely and rely on customer support tickets to detect outages. That might work at three in the morning for a weekend hack project, but it doesn’t scale. Modern applications run across clouds, containers, and serverless platforms. You have services talking to services, queues, databases, caches, and third-party APIs. Each hop introduces potential failure and latency. Without a clear way to observe system behavior and get notified when thresholds are crossed, you’re flying blind. That’s why the topic matters now more than ever: distributed systems are the default, and our tooling must match that reality.
If you’re skeptical about the overhead, I get it. Instrumentation can feel like a tax. It takes time to add metrics, define dashboards, and tune alerts. But the cost of not having observability is higher: longer incidents, user churn, and a team that spends more time firefighting than building. You don’t need a massive observability budget to start. You can begin with a simple metrics pipeline, a small set of meaningful alerts, and a handful of dashboards. Over time, you refine and scale. The goal isn’t to monitor everything; it’s to monitor the right things in a way that gives you actionable signals, not noise.
Context: Where monitoring and alerting fit today
Monitoring and alerting are foundational to DevOps, SRE, and platform engineering. They are used by backend engineers, infrastructure teams, and increasingly by frontend developers who care about user experience. In real-world projects, monitoring is embedded into CI/CD pipelines, deployment strategies, and incident response playbooks. For instance, a canary release needs real-time metrics to decide if the new version should proceed or roll back. A serverless function needs cold start tracking to ensure predictable latency. A Kubernetes cluster requires resource saturation alerts to avoid cascading failures.
Compared to alternatives, pure logging is not enough. Logs are essential for debugging specific requests, but they don’t give you aggregate views or trends. Manual checks don’t scale and create hero culture. Third-party status pages are useful for communication but don’t provide internal insight. Modern monitoring stacks typically combine metrics, logs, traces, and events. Metrics capture numeric time series data, logs capture textual context, traces capture request lifecycles across services, and events mark discrete occurrences like deployments or config changes. Tools like Prometheus for metrics, Grafana for dashboards, Loki for logs, and OpenTelemetry for tracing are common in the ecosystem. Cloud providers offer managed options like AWS CloudWatch, GCP Cloud Monitoring, and Azure Monitor. Alert managers such as Alertmanager or PagerDuty route notifications based on severity and on-call schedules.
The choice of stack depends on constraints. If you’re on Kubernetes, Prometheus and Grafana are natural fits. For serverless, CloudWatch metrics and logs are convenient. For distributed tracing, OpenTelemetry is emerging as the standard. There are tradeoffs: self-hosted stacks offer control but need maintenance; managed services reduce ops burden but can be expensive and vendor-locking. The key is to separate signal from noise. Good alerting means fewer pages with higher urgency. Bad alerting means alert fatigue and ignored incidents.
Concepts and practical implementation
At the core, monitoring and alerting revolve around signals. Signals are measurable indicators of system health. Common signals include error rate, request latency, CPU/memory utilization, queue depth, and saturation points. Alerts are rules that fire when signals cross thresholds or exhibit problematic patterns. The most effective alerts are tied to symptoms that matter to users, like “95th percentile latency above 500 ms for more than five minutes,” rather than “CPU above 80%.” High CPU might be normal during batch jobs; latency spikes usually aren’t.
To illustrate, let’s build a minimal Node.js service that emits Prometheus metrics and triggers alerts via a webhook. We’ll create a simple HTTP server with endpoints that simulate work, and we’ll expose a /metrics endpoint for scraping. We’ll also set up Alertmanager to route alerts to a webhook receiver that logs them. This example demonstrates metrics collection and alert routing without relying on specific cloud services.
Project structure and setup
monitoring-demo/
├─ src/
│ ├─ index.js
│ ├─ metrics.js
│ └─ routes.js
├─ alertmanager/
│ ├─ alertmanager.yml
│ └─ webhook.js
├─ prometheus/
│ └─ prometheus.yml
├─ package.json
├─ Dockerfile
└─ docker-compose.yml
In package.json, we add dependencies for Express and Prometheus client.
{
"name": "monitoring-demo",
"version": "1.0.0",
"main": "src/index.js",
"scripts": {
"start": "node src/index.js",
"dev": "nodemon src/index.js"
},
"dependencies": {
"express": "^4.19.2",
"prom-client": "^15.0.0"
},
"devDependencies": {
"nodemon": "^3.0.1"
}
}
Instrumenting metrics in code
We’ll create a metrics module that registers Prometheus counters, histograms, and gauges. Counters track cumulative totals, histograms track distributions, and gauges track values that can go up and down. In src/metrics.js, we define metrics for HTTP requests and a simulated job queue size.
// src/metrics.js
const client = require('prom-client');
// Create a Registry to collect metrics
const register = new client.Registry();
// Default metrics (event loop lag, memory, CPU, etc.)
client.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5],
});
register.registerMetric(httpRequestDuration);
const jobQueueSize = new client.Gauge({
name: 'job_queue_size',
help: 'Current size of the job queue',
});
register.registerMetric(jobQueueSize);
const errorCounter = new client.Counter({
name: 'app_errors_total',
help: 'Total application errors',
labelNames: ['type'],
});
register.registerMetric(errorCounter);
module.exports = {
register,
httpRequestDuration,
jobQueueSize,
errorCounter,
};
In src/routes.js, we create endpoints that simulate work and record metrics.
// src/routes.js
const express = require('express');
const { httpRequestDuration, jobQueueSize, errorCounter } = require('./metrics');
const router = express.Router();
// Middleware to measure request duration
router.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || req.path, status: res.statusCode });
});
next();
});
// Health endpoint
router.get('/health', (req, res) => {
res.json({ status: 'ok' });
});
// Simulate CPU-bound work
router.get('/work/cpu', (req, res) => {
const start = Date.now();
// Busy loop to simulate CPU load
while (Date.now() - start < 200) {
// no-op
}
res.json({ message: 'cpu work done', ms: 200 });
});
// Simulate a job queuing process
let queue = [];
router.post('/jobs', (req, res) => {
const job = { id: Date.now(), payload: req.body };
queue.push(job);
jobQueueSize.set(queue.length);
res.json({ id: job.id, queued: true });
});
router.delete('/jobs/:id', (req, res) => {
const id = parseInt(req.params.id, 10);
const before = queue.length;
queue = queue.filter(j => j.id !== id);
jobQueueSize.set(queue.length);
res.json({ removed: before - queue.length });
});
// Simulate errors
router.get('/error', (req, res) => {
errorCounter.inc({ type: 'demo' });
res.status(500).json({ error: 'demo error' });
});
module.exports = router;
In src/index.js, we wire up Express and expose the metrics endpoint.
// src/index.js
const express = require('express');
const routes = require('./routes');
const { register } = require('./metrics');
const app = express();
app.use(express.json());
app.use('/', routes);
// Prometheus scraping endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Demo app listening on http://localhost:${PORT}`);
console.log(`Metrics available at http://localhost:${PORT}/metrics`);
});
Prometheus configuration
In prometheus/prometheus.yml, we configure scraping and alerting rules.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/alerts.yml"
scrape_configs:
- job_name: "node-app"
static_configs:
- targets: ["host.docker.internal:3000"]
metrics_path: "/metrics"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Define alerts in prometheus/alerts.yml.
groups:
- name: node-app
rules:
- alert: HighErrorRate
expr: rate(app_errors_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec over the last 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 3m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile latency is {{ $value }} seconds"
- alert: QueueGrowing
expr: job_queue_size > 50
for: 1m
labels:
severity: warning
annotations:
summary: "Job queue is growing"
description: "Queue size is {{ $value }}"
Alertmanager configuration and webhook
In alertmanager/alertmanager.yml, we route alerts to a webhook.
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://webhook:8080/'
send_resolved: true
A simple webhook receiver in alertmanager/webhook.js to log alerts.
// alertmanager/webhook.js
const express = require('express');
const app = express();
app.use(express.json());
app.post('/', (req, res) => {
const alerts = req.body.alerts || [];
alerts.forEach(a => {
const sev = a.labels?.severity || 'unknown';
const name = a.labels?.alertname || 'unnamed';
const desc = a.annotations?.description || '';
console.log(`[ALERT] ${sev.toUpperCase()} ${name}: ${desc}`);
});
res.status(200).send('ok');
});
app.listen(8080, () => {
console.log('Webhook receiver listening on :8080');
});
Docker Compose for local orchestration
In docker-compose.yml, we bring up the app, Prometheus, Alertmanager, and the webhook.
version: "3.8"
services:
app:
build: .
ports:
- "3000:3000"
environment:
- PORT=3000
networks:
- mon
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
command:
- "--config.file=/etc/prometheus/prometheus.yml"
networks:
- mon
alertmanager:
image: prom/alertmanager:v0.26.0
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
networks:
- mon
webhook:
build:
context: ./alertmanager
dockerfile: Dockerfile.webhook
ports:
- "8080:8080"
networks:
- mon
networks:
mon:
driver: bridge
The app Dockerfile is minimal.
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY src ./src
EXPOSE 3000
CMD ["node", "src/index.js"]
The webhook Dockerfile.webhook is similarly simple.
FROM node:20-alpine
WORKDIR /webhook
COPY alertmanager/webhook.js .
RUN npm install express
EXPOSE 8080
CMD ["node", "webhook.js"]
With this setup, you can run docker-compose up and then generate load:
# Generate traffic to trigger alerts
for i in {1..100}; do
curl -s http://localhost:3000/error > /dev/null
curl -s -X POST -H "Content-Type: application/json" -d '{"task":"demo"}' http://localhost:3000/jobs > /dev/null
sleep 0.2
done
You’ll see alerts in Alertmanager’s UI (localhost:9093) and logs in the webhook container. This is a minimal but realistic pipeline: instrument app metrics, scrape them, evaluate rules, and route alerts.
Why this matters
You’ve now tied code to operations. When the error rate spikes, Prometheus fires the alert; Alertmanager routes it; your webhook receives it. The next step is connecting this to paging tools or chat systems. For production, you would integrate with PagerDuty or Opsgenie for on-call rotations, and Slack or MS Teams for notifications. The mental model is simple: signals from code, central aggregation, threshold evaluation, and intelligent routing.
Fun language facts and patterns
Node.js event loop metrics (provided by prom-client defaults) reveal event loop lag, which can indicate CPU saturation or blocking tasks. In Python, you can use prometheus_client to expose similar metrics; in Go, prometheus/client_golang is common. Languages influence how you instrument: synchronous frameworks can use middleware easily; async runtimes may need context propagation to tie metrics to requests. In Go, you’d typically use http.HandlerFunc wrappers; in Python, FastAPI middleware; in Node.js, Express middleware. The patterns are universal: start timers, record outcomes, label dimensions.
Honest evaluation: strengths, weaknesses, and tradeoffs
Monitoring and alerting are powerful but come with tradeoffs.
Strengths
- Early detection: Alerts reduce mean time to detect (MTTD) and mean time to recover (MTTR).
- Data-driven decisions: Metrics inform capacity planning, tuning, and feature impact analysis.
- Feedback loops: Instrumentation reveals performance bottlenecks and user experience issues that tests might miss.
- Operational maturity: Consistent monitoring leads to calmer incidents and better postmortems.
Weaknesses
- Alert fatigue: Poorly tuned alerts flood on-call engineers and lead to ignored pages.
- Costs: Storage and compute for metrics and logs can be expensive, especially at high cardinality.
- Complexity: Distributed tracing and multi-service setups require careful propagation of context.
- Maintenance: Rules, dashboards, and pipelines need ongoing updates as the system evolves.
Tradeoffs
- Self-hosted vs managed: Self-hosted Prometheus gives control but requires ops work; managed services like Datadog reduce ops but can be pricey and vendor-dependent.
- Cardinality limits: High-cardinality labels (e.g., user IDs) can explode metric series. Use them sparingly.
- Sampling vs full capture: Full tracing gives completeness but high overhead; sampling balances cost and insight.
- Threshold vs anomaly detection: Static thresholds are simple but brittle; anomaly detection is flexible but can be noisy and complex.
When to use
- Do use when you run services in production with real users or critical SLAs.
- Do use when you need continuous delivery with safe rollouts and performance guardrails.
- Do use when you want to move from reactive firefighting to proactive reliability.
When to skip or defer
- Skip if you’re building throwaway prototypes with zero users. Logging may be enough.
- Defer advanced distributed tracing if your system is a monolith behind a single entry point.
- Avoid over-alerting early; start with symptoms and expand gradually.
Personal experience: lessons from the trenches
I learned monitoring the hard way: during a data migration that slowed down a critical API. We didn’t have latency histograms, only logs. When users complained, we spent an hour combing logs for errors that weren’t there. The issue wasn’t errors; it was latency. We added histograms and immediately saw p95 jumps during migration windows. The fix was to throttle the migration and alert on latency thresholds. That incident taught me to measure the right things: symptoms over causes.
Another common mistake I’ve made is over-labeling metrics. Early on, I added labels for requestId and userId to “make metrics rich.” That looked great until Prometheus memory spiked and scrapes slowed. We learned to keep labels low-cardinality and use logs for high-cardinality debugging. Tracing helped fill the gap. We instrumented context propagation so each request had a traceId; logs carried that ID, metrics stayed lean. The result was a clean, scalable system that balanced cost and insight.
I’ve also learned that alert design matters. An alert that fires every five minutes becomes noise. Tuning for durations, grouping alerts, and setting severity properly made our pages rare and urgent. We added runbooks and dashboard links to alert annotations. That context helped responders act quickly. In incidents, having a dashboard showing error rate, latency, and saturation together was invaluable. The pattern is consistent: less is more, and clarity beats volume.
Getting started: workflow and mental models
Starting from scratch can feel overwhelming. Here’s a pragmatic workflow that scales:
- Step 1: Identify symptoms. Talk to your team and users. What failures hurt the most? Latency spikes, errors, data loss, or capacity issues? Map symptoms to measurable signals.
- Step 2: Pick a stack. If you’re on Kubernetes, Prometheus and Grafana are a strong default. For serverless, CloudWatch or OpenTelemetry exporters are good. Choose a tracing solution if you have multiple services.
- Step 3: Instrument early. Add metrics to your application code at boundaries: HTTP handlers, database calls, queue operations. Use standard labels like method, route, status. Avoid high-cardinality labels.
- Step 4: Define alerts. Start with symptom-based alerts: error rate, latency, saturation. Add
fordurations to reduce flapping. Tie alerts to dashboards for context. - Step 5: Build dashboards. Create a small set of dashboards that show the health of each service: request rate, error rate, latency, resource usage, and queue depth. Keep them readable; avoid clutter.
- Step 6: Test alerts. Simulate failures to ensure alerts fire and route correctly. Run tabletop exercises so the team knows how to respond.
- Step 7: Iterate. Review alert quality regularly. Remove noisy alerts, add missing ones, refine thresholds, and update runbooks.
Folder structure for a typical microservice:
service-name/
├─ src/
│ ├─ handlers/
│ ├─ middleware/
│ ├─ models/
│ └─ metrics.js
├─ scripts/
│ └─ seed-data.js
├─ dashboards/
│ └─ service-name.json
├─ alerts/
│ └─ prometheus-rules.yml
├─ Dockerfile
├─ docker-compose.yml
├─ prometheus.yml
└─ README.md
Mental models:
- Signals, not logs alone. Metrics and traces complement logs. Logs are for context, metrics are for aggregates, traces are for journeys.
- Symptom over symptomless. Alert on user-visible issues, not internal stats that might be fine.
- Thresholds with hysteresis. Use
forto avoid flapping; consider deadman switches for heartbeats. - Label carefully. Cardinality is cost and complexity. Keep labels minimal and meaningful.
Example of a more advanced alert rule in Prometheus that uses a recording rule to reduce query cost:
groups:
- name: optimization
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job, route) (rate(http_request_duration_seconds_count[5m]))
- alert: TooMany4xx
expr: sum by (job, route) (rate(http_request_duration_seconds_count{status=~"4.."}[5m])) / job:http_requests:rate5m > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High 4xx rate for {{ $labels.route }}"
Recording rules precompute expensive queries, improving efficiency. This pattern is common in production systems where query load must be controlled.
What makes monitoring and alerting stand out
The standout quality is how monitoring transforms developer ownership. When you instrument your code, you see the consequences of design choices. A blocking loop becomes visible as event loop lag; a bursty queue becomes visible as saturation spikes. The feedback loop is immediate and quantitative. The ecosystem around Prometheus, Grafana, and OpenTelemetry is mature, with broad language support and community dashboards. Developer experience is improved by clear dashboards and actionable alerts, reducing cognitive load during incidents.
Maintainability comes from standardized patterns. Once you adopt consistent instrumentation, it becomes part of the codebase’s DNA. New features naturally include metrics. Onboarding new engineers is easier because dashboards and alerts document expected behavior. The outcome is resilience: faster detection, clearer context, and predictable recovery.
Free learning resources
- Prometheus documentation: The official docs are practical and cover concepts, query language, and alerting. Start with “What is Prometheus?” and the query language examples. See https://prometheus.io/docs/.
- Grafana Labs Tutorials: Grafana provides step-by-step guides on building dashboards and integrating data sources. Their “Introduction to Grafana” is beginner-friendly. See https://grafana.com/tutorials/.
- OpenTelemetry Documentation: If you’re exploring distributed tracing, OpenTelemetry’s language-specific guides are useful. See https://opentelemetry.io/docs/.
- SRE Books by Google: The “Site Reliability Engineering” book and the follow-up “The Site Reliability Workbook” contain practical chapters on monitoring and alerting. See https://sre.google/books/.
- Alertmanager Documentation: Learn routing, grouping, and inhibition rules to tame noise. See https://prometheus.io/docs/alerting/latest/alertmanager/.
These resources are grounded and hands-on. They avoid hype and focus on patterns that work in real systems.
Summary and takeaway
Monitoring and alerting are essential for anyone building software that runs in production. If you’re a backend or platform engineer, they should be part of your core toolkit. If you’re a frontend developer, instrumenting client-side performance and errors matters for user experience. If you’re a solo developer working on a hobby project, start small with metrics and a single alert; you’ll be surprised how much confidence it adds.
You might skip advanced observability if you’re building isolated prototypes or offline tools where failure has no user impact. But as soon as your code serves users or relies on external dependencies, investing in monitoring pays dividends.
The takeaway is simple: measure what matters, alert on symptoms, and iterate. Build a feedback loop between code and operations. The result is software that doesn’t just run, but runs reliably, with fewer surprises and faster fixes. That’s the kind of engineering that users trust and teams enjoy maintaining.
Sources: Prometheus documentation (https://prometheus.io/docs/), Grafana Labs tutorials (https://grafana.com/tutorials/), OpenTelemetry documentation (https://opentelemetry.io/docs/), Google SRE books (https://sre.google/books/).




