When Someone Else's Problem Becomes Your Solution
Two years of Kubernetes operational pain, self-blame, and a three-day migration to ECS that should have happened sooner
The Situation
Three years ago, I adopted Kubernetes for a project running about 20 containers on AWS. I didnât evaluate whether it fit the problem. Well, I thought I did, but I never stopped to reconsider once the team was invested.
The hosting fees were the most visible symptom. Six EKS control planes across three clusters came to $438/month just for the management layer, more than the actual compute they managed. But the real cost never appeared on an invoice: four hours per week of architect time on resource tuning, full weeks lost to upgrade cycles, customer-facing services vanishing during scaling events, and the steady drain of features that didnât ship because IT leadership was consumed by infrastructure operations.
When the Doubt Started
The first real crack appeared during a Kubernetes minor version upgrade. Again, a minor version upgrade. The new version introduced breaking changes with the standard agents that EKS had previously installed on our clusters. Some of those agents were installed automatically by EKS; others we had installed manually. The result was an inconsistent management surface where AWS controlled some components and we controlled others, and the version upgrade put them at odds.
The fix wasnât a patch; it required rebuilding entire clusters from scratch, for every cluster we ran.
I could assign blame three ways: Kubernetes itself, AWS EKSâs abstraction layer, and our own gaps in understanding. But that realization was itself the point. If running a container orchestration platform requires deep expertise in both the orchestration layer and the cloud providerâs specific abstraction of it, the solution is not sustainable for a team our size.
Compare this to virtually any other cloud service: you set it up once and walk away. Upgrades are usually automatic, and when theyâre not, theyâre simple and predictable. A container orchestration solution should not behave like a JavaScript project with a hundred npm packages, where every version bump creates cascading compatibility issues across layers you donât fully control.
The Operational Tax
Scaling That Caused Outages
Kubernetes scaling signals never properly integrated with the cloud-native infrastructure underneath EKS. The speed of K8s scaling decisions was lost in translation to the underlying AWS layer, and during node scale-up and scale-down events, K8s would initiate changes without respecting Pod Disruption Budgets. At least N replicas of a service were supposed to stay running; in practice, they didnât.
Critical customer-facing processes would simply vanish. Not degrade. Vanish.
Before anyone asks: yes, we tuned the health checks, scale-down delays, and scale-up delays to match each containerâs startup and readiness characteristics. Thatâs a common source of scaling issues and we got it right. It didnât matter. The disruption budgets were still not enforced during node-level scaling events. A platform that occasionally kills the services itâs supposed to protect has a fundamental reliability problem regardless of how often it happens.
The workaround was to increase the minimum number of replicas per service, which defeats the entire purpose of auto-scaling. I never found a real fix. EKS requires significantly more operational overhead than GKE or AKS, with less pre-configured automation and more manual configuration for exactly these integration points.
The Time Tax
About four hours per week went to resource allocation tuning: constant small adjustments to CPU and memory allocations per service and per node group. Load testing beforehand was never enough to define optimal defaults; it took weeks of production issues and incremental tweaking to dial in each container.
Each K8s upgrade consumed a full week of sustained research and testing. There was no budget to spin up a parallel environment, so the dev environment served double duty, blocking the rest of the team during upgrade cycles. And again, this was a âmanagedâ EKS solution.
The third cost was version monitoring. Staying current on upcoming K8s changes was mandatory because falling behind meant the cloud provider would charge additional fees for running unsupported releases. Of course, I track upcoming changes for all AWS tools we use, but the difference is that other tools do not pull the rug from under me and run me over with a bus loaded with documents telling me the pain is all in my head.
Why It Took So Long
Two things delayed the migration even after the doubt set in.
First, feature priorities. Certain capabilities had to ship before we could absorb the risk of a migration; the business couldnât wait while we rearchitected infrastructure.
Second, and more honestly, denial. Kubernetes was everywhere: conference talks, blog posts, hiring trends, job descriptions. It felt safer to go with the crowd than to question whether the tool fit our context. Admitting that K8s was wrong felt like admitting incompetence rather than recognizing a mismatch. To a degree, I suppose you might call this resume-driven development. We wanted to learn what others knew, nevermind that most of us share the same ignorance.
The Migration
After wasting yet another full week trying to push through a K8s upgrade, the team just wanted to be done. The conversation with leadership was straightforward: operational cost and opportunity cost, more money saved and more features delivered. Every layer of the company had felt the pain by proxy, watching IT leadership consumed by infrastructure operations instead of strategic work.
The migration to AWS ECS Fargate took three days. We moved one service at a time, standing it up on ECS and verifying stability before touching the next. Each K8s service stayed running until its ECS counterpart was confirmed healthy. The cutover was simple AWS plumbing: DNS A records pointing to new ALBs, each ALB routing to ECS target groups. We stood up separate ALBs for each cluster boundary (public services, private services, and workers) so each group could be migrated and validated independently.
Task definitions replaced Kubernetes manifests, AWS networking replaced custom overlays, and the application code didnât change at all.
The one friction point was ECS task definitions. Thereâs no separation between pod configuration and container image configuration the way K8s has it; everything is collapsed into the task definition. Getting this to work well in CI/CD took some rework, but it was manageable.
CloudFormation stacks made the infrastructure side straightforward: destroy and create, deterministic every time. This was the sharpest contrast with EKS, where third-party component versions and schema conflicts had made every infrastructure change unpredictable.
No rollback moments, no close calls, zero issues. Two years of complexity replaced in three days by something that just worked.
Results
We were running six EKS control planes at $0.10/hour each, roughly $438/month for the management layer. That jumps to $0.60/hour per cluster ($2,592/month total) if you fall behind onto an extended-support K8s version. ECS has no equivalent fee, and at our scale the management layer was the dominant infrastructure expense.
| Metric | Before (K8s/EKS) | After (ECS Fargate) |
|---|---|---|
| Monthly control plane cost | ~$438 (6 clusters Ă $73) | $0 |
| Monthly compute cost | Nearly identical | Nearly identical |
| Architect time on infrastructure ops | ~4 hrs/week + upgrade weeks | Near zero |
| Disruption budget violations | Recurring, never resolved | Zero, ever |
| Customer-facing outages from scaling | Regular | None |
| Spot instance utilization | Difficult, limited | Easy, effective |
| Delivered project value | Baseline | ~20% increase in expected value score |
ECS Fargate has never once violated a disruption budget or caused an outage, and zero configuration was required to achieve this. Spot instances also became much easier and more effective to use, pushing costs down further. The human cost made the case even more overwhelming: four hours per week of architect time reclaimed, full upgrade weeks eliminated, and leadership available for strategic work instead of cluster operations.
Early criticism of Fargate centered on cost, and when it launched that was accurate. AWS cut Fargate prices by up to 50% in 2019 and has since added Fargate Spot (up to 70% savings), Compute Savings Plans (up to 50%), and Graviton support at 20% lower cost. For teams without dedicated platform engineers, the total cost of ownership comparison no longer favors self-managed Kubernetes.
With that said, I do miss kubectl apply. YAML-based declarative changes made versioned service updates genuinely ergonomic, even compared to CloudFormation. But the golden violin wasnât worth the deal with the devil.
Key Lessons
1. Tool selection requires honest requirements analysis
Before adopting Kubernetes, I should have asked five questions honestly: Do we need multi-cloud? Are we running hundreds of microservices? Do we have dedicated platform engineers? Is the operational overhead justified? Do we need fine-grained orchestration control? The answer to every one was no. That analysis should have happened before we committed, not two years in.
2. Self-blame delays obvious decisions
The denial period cost us at least a year. We assumed the problem was our technical competence rather than our tool selection, an assumption the internet reinforces enthusiastically. Talking to other teams who had lived through the same experience was what finally broke through it.
3. Managed cloud services exist to eliminate operational overhead
Kubernetes is not built for that goal; itâs built for scale so large that even tiny resource optimizations are material, with dedicated platform teams to absorb the operational complexity. A tool built for someone elseâs problem at someone elseâs scale becomes your problem at your scale.
4. The largest cost of the wrong tool rarely appears on an invoice
Four hours per week of architect time, full upgrade weeks, and leadership capacity diverted from strategy to operations donât show up as line items until you add them up yourself. We didnât just ship slower; we shipped fewer projects and lower-value projects for two years.
Find this case study insightful? Share it with your network:
Share on LinkedIn