Monitoring Web Servers

A deep dive into web server metrics, how they're structured, visualized, and where things often go wrong.

Once, I debugged a production issue that only happened on Tuesdays. Everything looked fine at first glance: CPU, memory, request count, all normal. But users complained about slow responses. Turns out, the culprit was an unnoticed spike in “active connections,” hitting the server’s limit, causing requests to queue up. A missing alert on a non-standard metric led to hours of frustration.

The Standard Metrics

Web servers like Nginx, Apache, and Caddy expose a set of built-in metrics that cover most operational concerns:

Requests per second (RPS) - How many requests the server handles per second. High RPS indicates good traffic, but sudden drops may signal an outage, while spikes might overload your backend. Keeping historical trends can help anticipate scaling needs.
Response time (latency) - How long requests take to process. A server under high load might have increased response times. Tracking p95 and p99 latencies (rather than averages) helps detect performance issues before users complain.
Error rates - 4xx and 5xx responses, broken down by type. High 4xx errors might indicate bot activity or API misuse, while growing 5xx errors often mean backend failures. Alerting on their rate, rather than just raw counts, avoids noise.
CPU & Memory usage - Resource consumption of the web server process. CPU spikes could mean inefficient request handling, while high memory use might suggest resource leaks. Correlating this with request volume helps pinpoint the root cause.
Open connections - Active TCP connections at a given time. A high number of open connections, especially with long durations, may indicate slow client behavior or an overwhelmed backend.
Throughput - Data transferred per second. Tracking throughput is essential for detecting unexpected bandwidth consumption. Sudden dips might mean network issues, while spikes could indicate abuse, like large file downloads.

These are a great start, but they don’t always tell the full story.

Beyond the Defaults: Custom Metrics

Sometimes, what you really need isn’t included out of the box. Here are a few examples of non-standard metrics that have saved me before:

Queued Requests - Helps spot when your server isn’t keeping up before outright failures happen. For example, if incoming requests far exceed processing capacity, they pile up in a queue, leading to latency spikes and eventual timeouts.
TLS Handshake Time - Crucial for diagnosing HTTPS slowdowns. If this metric increases, it might indicate certificate validation delays, an overloaded TLS termination proxy, or even network issues affecting encrypted connections.
Backend Response Times - If you’re proxying requests, measuring upstream services separately is key. Without this, a slow database query or external API call could appear as a web server issue when the real problem lies elsewhere.
Cache Hit Ratio - When serving static assets or using FastCGI caches, knowing if requests hit cache is invaluable. A low cache hit ratio could mean misconfigured caching rules, leading to unnecessary backend load.
Rate-limiting Events - Helps debug why users might see “Too Many Requests.” Monitoring when and how often rate limits are triggered can help fine-tune API quotas and protect services from abuse.

Projects that expose these:

Prometheus Nginx Exporter - Adds detailed connection and request queue metrics.
Apache mod_status - Gives real-time statistics on requests.
Envoy Proxy Stats - Includes advanced L7 metrics.

Understanding Metric Data Structures

Metrics typically follow a structured format. Here’s how Prometheus structures time-series data:

http_requests_total{method="GET",status="200",host="example.com"} 1234

This tells us:

Metric name: http_requests_total (a counter for total requests)
Labels: {method="GET", status="200", host="example.com"} (adds context)
Value: 1234 (the actual count at a given timestamp)

Other formats include JSON (used by ELK stack) or plaintext stats like Varnish’s varnishstat output. Visualization tools like Grafana, Kibana, and Datadog help turn these raw numbers into insightful graphs.

Here are additional common types of metrics:

Counters: These are metrics that only ever increase. For example, a counter might track the total number of requests processed. They reset when the application restarts but otherwise just keep adding. The main metric type for things like http_requests_total.
Gauges: These metrics can go up or down. A typical example is memory usage. The value fluctuates over time, such as memory_usage_bytes. Gauges are perfect for metrics that reflect current values like the number of active connections.
Histograms: These are metrics that track the distribution of data over a period. For instance, request latency is often tracked using histograms, which group latency into buckets (e.g., 0-100ms, 100-200ms). You can later query these buckets to find out the overall distribution and percentiles.
Summaries: Similar to histograms, summaries also track distributions but with a focus on calculating percentile values over time, such as the http_request_duration_seconds that tracks response durations. Summaries automatically calculate things like the 95th and 99th percentiles, giving a clear picture of the system’s responsiveness.

Visualization tools like Grafana use these structures to create time-series graphs, bar charts, and heatmaps, helping you monitor and make sense of the raw data collected over time.

How Things Go Wrong

Misconfigurations and bad practices can lead to blind spots. Here are some common mistakes:

Only monitoring averages - Averages hide spikes. Use percentiles instead (e.g., p95 latency instead of mean latency). A system with an average response time of 200ms might still have p99 requests taking several seconds.
Ignoring connection limits - Many assume high CPU means a busy server, but often it’s connection saturation causing slowdowns. If your server reaches its max connections, new requests get queued or dropped. Monitoring this prevents hard-to-debug availability issues.
Alerting on raw request counts - More traffic isn’t always bad, rate-based alerts (e.g., error rate > 5%) are better. If a sudden traffic increase also brings a rise in 5xx errors, that’s a red flag.
Forgetting to clean up old metrics - Unused labels can bloat time-series databases, slowing down queries. A metric with thousands of unique labels (e.g., per-user tracking) can make querying painfully slow.
No alert for missing data - If your monitoring stops reporting, that’s a failure in itself. A crashed exporter or a broken pipeline can lead to silent blind spots unless you explicitly check for data gaps.

Final Thoughts

Good monitoring isn’t just about collecting numbers, it’s about knowing which numbers matter. Go beyond defaults, visualize wisely, and keep an eye on what isn’t being monitored. Your future self will thank you when a weird Tuesday bug pops up again.