Kubernetes Cost Optimization: 5 Strategies That Cut Our Clients' Bills by 40%

The average company wastes 35% of their Kubernetes cloud spend. It's not because teams are careless — it's because Kubernetes makes it remarkably easy to over-provision resources and remarkably hard to know you're doing it. After optimizing infrastructure for over 100 organizations, from seed-stage startups to public companies, we've distilled our approach down to five high-impact strategies that consistently deliver 30–45% savings.

These aren't theoretical best practices pulled from documentation. They're the exact playbook we run during the first two weeks of every client engagement. Let's break them down.

1 Right-Size Your Resource Requests and Limits

This is the single biggest source of waste in virtually every Kubernetes cluster we audit. Here's what happens: a developer sets up a new deployment, copies resource values from a Stack Overflow answer or an old team template, and ships it. The application runs fine, so nobody revisits those numbers. Meanwhile, the cluster is reserving four times the CPU and eight times the memory that the workload actually needs.

Understanding the difference between requests and limits is critical. Requests are what the scheduler uses to place your pod on a node — they're a guaranteed reservation. Limits are the ceiling your container can hit before it gets throttled (CPU) or killed (memory). When your requests are set far above actual usage, the scheduler reserves capacity that sits idle, and you pay for nodes that are running at 15% utilization.

    YAML
# Before: Over-provisioned (copy-pasted from Stack Overflow)
resources:
  requests:
    cpu: "1000m"
    memory: "2Gi"
  limits:
    cpu: "2000m"
    memory: "4Gi"

# After: Right-sized based on actual usage metrics
resources:
  requests:
    cpu: "150m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"
  

That's a real example from a client's API service. The original requests were consuming a full vCPU and 2 GiB per replica — across 12 replicas, that's 12 vCPUs and 24 GiB reserved for a service that peaked at 1.2 vCPUs total. The fix alone saved them over $2,800/month.

Tools we recommend: Start with kubectl top pods for a quick baseline, then deploy Kubecost for cost attribution, Goldilocks for VPA-based recommendations, and the Vertical Pod Autoscaler in recommendation mode. Never trust the defaults — measure, then set.

2 Implement Cluster Autoscaler + Karpenter

Static node pools are the enemy of cost efficiency. If you're running a fixed number of nodes sized for peak traffic, you're paying peak prices 24 hours a day for traffic that probably peaks for two. The solution is dynamic node provisioning, and there are two main approaches: the traditional Cluster Autoscaler and AWS's newer Karpenter.

Cluster Autoscaler works at the node group level. It watches for pods that can't be scheduled due to insufficient resources, then scales up the appropriate node group. It's battle-tested and works across all major cloud providers, but it's limited by the instance types you've pre-configured in your node groups. If you defined your group with m5.xlarge instances, that's what you get — even if a c6g.large would be cheaper and more appropriate.

Karpenter takes a fundamentally different approach. Instead of scaling pre-defined node groups, it provisions exactly the right instance type for the pending workload in real-time. It evaluates the CPU, memory, and architecture requirements of your unschedulable pods and selects from the full catalog of available instances. The result is significantly better bin-packing: fewer nodes, less wasted capacity, and lower bills. In our experience, clients who migrate from static node groups to Karpenter see an additional 15–25% reduction in compute costs on top of right-sizing gains.

The tradeoff is that Karpenter is currently AWS-specific (though GCP and Azure equivalents are maturing). If you're multi-cloud, Cluster Autoscaler with well-tuned node groups is still the right call. Either way, static is the enemy.

3 Spot and Preemptible Instances for Stateless Workloads

Spot instances offer up to 70% savings on compute compared to on-demand pricing. The catch is that the cloud provider can reclaim them with as little as two minutes' notice. For many workloads, that's not a problem — it's an opportunity.

Workloads that are safe for spot instances include: CI/CD runners (a terminated build simply retries), batch processing jobs (checkpoint your progress and resume), and stateless API replicas behind a load balancer with proper Pod Disruption Budgets. The key requirement is that your application handles graceful shutdown and that you're running enough replicas to absorb the loss of any single instance.

    YAML
# Node affinity for spot instances
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values:
          - "spot"
  

We use preferredDuringSchedulingIgnoredDuringExecution rather than required so that if spot capacity is temporarily unavailable, pods fall back to on-demand nodes rather than staying unscheduled. Combined with Pod Disruption Budgets that ensure at least 75% of replicas remain available during voluntary disruptions, this gives you the cost savings of spot with the reliability your SLAs demand.

4 Namespace-Level Resource Quotas and LimitRanges

Without guardrails, a single team can consume unbounded cluster resources. We've seen a single runaway integration test consume 48 vCPUs for six hours because there was no quota in place. Resource Quotas and LimitRanges are Kubernetes-native primitives that prevent this class of cost explosion.

A ResourceQuota caps the total resources a namespace can consume. A LimitRange sets default and maximum values for individual containers, so that even if a developer forgets to specify resource requests, sensible defaults are applied automatically.

    YAML
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "8"
    requests.memory: "16Gi"
    limits.cpu: "16"
    limits.memory: "32Gi"
    pods: "40"
  

This isn't just about cost control — it's about cluster stability. Without quotas, one team's misconfigured deployment can starve other teams' workloads of resources. We implement quotas as a standard part of our namespace provisioning workflow, typically managed through Terraform or Crossplane so they're version-controlled and auditable.

5 Scheduled Scaling and Non-Production Shutdown

Here's a question we ask every new client: do your dev, staging, and QA environments run 24/7? The answer is almost always yes. The follow-up: does anyone use them between 8 PM and 8 AM, or on weekends? The answer is almost always no.

Non-production environments typically account for 40–60% of total cluster costs, and they're idle for roughly two-thirds of the week. By scheduling these environments to scale down during off-hours — either to zero replicas or to a minimal footprint — you can reclaim 60–70% of non-production spend.

The simplest approach is Kubernetes CronJobs that patch deployment replica counts on a schedule. For more sophisticated setups, KEDA (Kubernetes Event-Driven Autoscaling) can scale workloads based on actual demand signals: queue depth, HTTP request rate, cron schedules, or custom metrics. We pair this with Karpenter so that nodes themselves are deprovisioned when workloads scale down — there's no point scaling pods to zero if the underlying nodes keep running.

Quick win: For most teams, a simple CronJob that runs kubectl scale at 8 PM and 8 AM local time delivers 90% of the value with near-zero implementation effort. Start there, graduate to KEDA when you need event-driven precision.

Client Results

40%

Average client savings

2 weeks

Typical payback period

Performance degradation

Cost Optimization Is a Practice, Not a Project

These five strategies — right-sizing, autoscaling, spot instances, resource quotas, and scheduled scaling — are not one-time fixes. Cloud infrastructure drifts. Teams ship new services. Traffic patterns change. The configurations that were optimal three months ago may be wasting 20% of your budget today.

That's why we build continuous cost observability into every engagement from day one. Automated alerts when resource utilization drops below thresholds. Monthly cost reviews with actionable recommendations. Kubecost dashboards broken down by team, namespace, and workload. The goal isn't just to cut your bill once — it's to make sure it never inflates silently again.

If your Kubernetes bill has been climbing and you're not sure where the waste is, we can usually identify the top savings opportunities within a single 30-minute call. No pitch, no commitment — just a clear picture of what's possible.

1 Right-Size Your Resource Requests and Limits

2 Implement Cluster Autoscaler + Karpenter

3 Spot and Preemptible Instances for Stateless Workloads

4 Namespace-Level Resource Quotas and LimitRanges

5 Scheduled Scaling and Non-Production Shutdown

Cost Optimization Is a Practice, Not a Project

Want to cut your Kubernetes costs by 40%?

Why Startups Are Replacing DevOps Hires with DevOps as a Service

Zero-Downtime Deployments: A Practical CI/CD Playbook

The Hidden Cost of Technical Debt in Your Infrastructure