Kubernetes

Right-Sizing Kubernetes: Stop Paying for Idle

Requests, limits, autoscaling, and bin-packing. A pragmatic guide to cutting Kubernetes cost without trading away reliability.

UVExcel Tech18 Mar 202611 min read

Most Kubernetes bills are not large because Kubernetes is expensive; they are large because clusters are running mostly empty. Industry surveys repeatedly find that the majority of provisioned CPU sits idle, and the reason is human, not technical: engineers set generous resource requests to avoid getting paged, nobody revisits them, and the scheduler dutifully provisions nodes to satisfy reservations that bear no relationship to actual usage. Right-sizing is the work of closing the gap between what you reserve and what you use, without reintroducing the instability everyone was over-provisioning to avoid.

Requests are the bill; limits are the guardrail

The single most important concept is that requests, not limits, drive cost. A pod's CPU and memory requests are what the scheduler reserves and what determines how many nodes you need. If every pod requests far more than it uses, the scheduler packs fewer pods per node and provisions more nodes than the workload requires. The fix is to set requests from observed usage — a high percentile such as p95 of real consumption, plus modest headroom — rather than from a nervous guess. Tighten requests and the same workload bin-packs onto fewer nodes, and the bill falls without touching the application.

Limits are a different lever and deserve care. A memory limit is a hard ceiling: exceed it and the pod is killed, so memory limits should exist and should be set with real headroom. CPU limits are more contentious. Because CPU is compressible, a CPU limit throttles a pod that could otherwise have used spare capacity, often hurting latency for no reliability gain. A common, defensible pattern is to set memory requests and limits, set CPU requests, and omit CPU limits where it is safe — letting bursty workloads use idle cycles and improving effective packing.

Right-size before you autoscale. Autoscaling an over-provisioned workload just multiplies the waste — you scale out copies of pods that were too big to begin with.

Make the autoscalers cooperate

Kubernetes has three autoscalers and they solve different problems. The Horizontal Pod Autoscaler adds and removes pod replicas in response to demand. The Vertical Pod Autoscaler tunes the requests of individual pods toward observed usage — best run in recommendation mode first, then applied through your normal rollout process. Node autoscaling — Cluster Autoscaler, or the faster and more flexible Karpenter — adjusts the number and type of nodes so capacity tracks demand and idle nodes are consolidated away.

The subtlety is that these can fight each other. Running VPA and HPA against the same metric produces oscillation. Aggressive bin-packing leaves no headroom for VPA to grow a pod, so a resize triggers a disruptive eviction. Tight packing can also place a noisy, bursty workload next to a latency-sensitive one. The workable approach is to give each autoscaler a clear lane: HPA for variable-traffic replica counts, VPA for steady or batch workloads, node autoscaling for capacity — and to leave deliberate headroom so consolidation does not become thrashing.

The levers beyond requests

Run fault-tolerant, stateless, and batch workloads on Spot capacity with proper interruption handling; keep databases and stateful services on on-demand.
Consider ARM-based instances for compatible workloads — they frequently deliver a better price-to-performance ratio.
Use namespace ResourceQuotas and LimitRanges to set sane defaults and stop a single team from hoarding the cluster.
Scale non-production environments down on nights and weekends; idle staging is pure waste.
Eliminate avoidable cross-zone traffic for chatty services — inter-zone data transfer is a quiet, recurring cost.

Treat it as continuous, not a one-off

The painful truth is that a single optimization sprint does not hold. New services launch with default requests nobody tunes, traffic patterns drift, and three to six months later the bill has quietly climbed back. Right-sizing is a standing practice: instrument actual usage, review requests against it on a cadence, and make sensible defaults the path of least resistance so new workloads start efficient. Teams that do this routinely cut compute spend substantially — often by a third or more — while improving stability, because right-sized pods schedule predictably and fail less.

Key takeaways

Requests drive cost; set them from observed p95 usage plus headroom, not from anxiety.
Set memory requests and limits, set CPU requests, and consider omitting CPU limits to reduce throttling and improve packing.
Give HPA, VPA, and node autoscaling clear lanes so they cooperate instead of oscillating.
Right-size before autoscaling, and treat it as a continuous practice — savings decay without maintenance.

From reading to building

Want help putting these ideas into production?

We work alongside your team to architect, automate, and operate platforms that hold up under real load.

Book a Discovery Call