The hidden cost of 'right' decisions: what 4 years of infrastructure teaches about trade-offs
Every infrastructure decision is a bet on the future. After watching teams make the same mistakes across multiple startups, here's what actually matters when choosing your stack.
A startup infrastructure lead just published their 4-year retrospective of what worked and what didn't. Reading it reminded me of every "we should have..." conversation I've had with teams who scaled from zero to production load.
The piece is valuable not because it's a blueprint (your context is different), but because it shows something most infrastructure guides miss: every decision is a trade-off, and you won't know which trade-offs matter until you're already committed.
The pattern no one talks about
Here's the truth about infrastructure decisions: the "right" choice at 10 engineers becomes the bottleneck at 100. The author regrets not using Lambda more, endorses EKS over ECS, and wishes they'd adopted OpenTelemetry earlier.
But notice what's missing from that list: there's no universal "correct" answer. The author's regrets are specific to their growth trajectory, their team's expertise, and their product's demands.
This is the gap in most infrastructure advice. People share their conclusions without sharing the constraints that made those conclusions correct.
The real cost of managed services
The author endorses RDS with a line that hits hard: "You lose your network: that's downtime. You lose your data: that's a company ending event."
I've seen this play out. A team I worked with tried to save $800/month by self-hosting Postgres. Six months later, a botched migration cost them two days of engineering time across four people. Do the math: at $150k average salary, that's ~$2,400 in actual cost, plus the opportunity cost of delayed features.
But here's the thing: managed services buy you time, not correctness. If your team doesn't understand database indexing, RDS won't fix your slow queries. It just moves the problem from "server crashed" to "queries timing out."
The decision isn't "managed vs. self-hosted." It's "where do we want to spend our debugging time?"
When "best practices" become technical debt
The author regrets EKS managed add-ons. They started with them because it felt like "the right way," then had to rip them out when they needed customization.
This is the trap of following best practices without understanding their assumptions. Managed add-ons work great if you fit AWS's expected use case. The moment you need to tweak CPU requests or change a config map, you're fighting the abstraction.
I've made this mistake with "infrastructure as code" tools that promised simplicity but locked us into their workflow. The real lesson: simplicity that doesn't match your needs is just complexity with better marketing.
The GitOps trade-off
The author endorses GitOps but admits: "We've had to invest in tooling to help people answer questions like 'I did a commit: why isn't it deployed yet?'"
This is infrastructure reality distilled. GitOps gives you auditability and reproducibility. It also gives you debugging complexity that traditional pipelines don't have.
Every team I've worked with that adopted GitOps went through the same cycle:
- Excitement: "Everything is declarative!"
- Confusion: "Why isn't my change live?"
- Investment: Building tools to bridge the gap
The teams that succeeded were the ones who budgeted for step 3 from the start. The teams that struggled treated GitOps like a drop-in replacement for deployment scripts.
The DataDog pricing problem
This one is brutal: DataDog's per-instance pricing model punishes exactly the behavior you want in Kubernetes clusters (rapid scaling, spot instances, GPU nodes with single services).
I've seen teams hit this wall. You tune your infrastructure for cost and reliability, then your observability bill goes up because you're doing infrastructure correctly.
The author regrets DataDog but doesn't suggest an alternative. That's honest. Most observability tools have pricing models that don't match modern infrastructure patterns. OpenTelemetry helps, but only if you invest in it early (which the author also regrets not doing).
The real decision: Do you tune for infrastructure cost or observability clarity? You can't always have both.
What actually matters
After watching teams scale infrastructure across multiple startups, here's what separates successful decisions from expensive mistakes:
1. Tune for team velocity, not tool sophistication
Bazel might be brilliant, but if only two engineers understand it, you've created a bottleneck. GitHub Actions might be "simpler," but if everyone can debug it, you've reduced your bus factor.
2. Standardize on identity early
The author wishes they'd adopted Okta sooner. I've never seen a team regret investing in identity infrastructure too early. It's boring, it's not technically exciting, but it compounds in value.
3. Track costs in real-time, not quarterly
Monthly cost review meetings aren't just for finance. When engineering sees the bill breakdown, cost control becomes a cultural habit, not a panic response.
4. Every abstraction has a failure mode
EKS managed add-ons. Terraform Cloud. DataDog's Kubernetes integration. They all work great until they don't. The question isn't "is this abstraction good?" It's "when this abstraction fails, can we debug it?"
The framework
Here's the decision framework I use when evaluating infrastructure choices:
- What are we tuning for? (Cost, reliability, velocity, compliance)
- What's our escape hatch? (Can we migrate away if this doesn't work?)
- Who owns this when it breaks? (Do we have the expertise in-house?)
- What's the cost of being wrong? (Downtime, engineering time, data loss)
Most infrastructure retrospectives focus on #1. The author's piece is valuable because it shows #2-4 in action.
The meta-lesson
The author endorses AWS over GCP because "Amazon lives with a customer focus." They regret Bottlerocket because "debugging was much harder than debugging standard EKS AMIs."
Notice the pattern: the decisions that worked were the ones that reduced friction when things went wrong. Not when things went right.
Your infrastructure will fail. The question is: will you be able to fix it?
Choose tools that make failure debuggable. Choose abstractions that don't hide critical information. Choose managed services that buy you time to build features, not complexity.
And when someone shares a retrospective like this, don't copy their conclusions. Study their constraints. Then make your own bets.
Because in 4 years, you'll be writing your own retrospective. And the decisions you're making today will look very different with production load behind you.
Get new posts in your inbox
Architecture, performance, security. No spam.
Keep reading
Inside Claude Code's Context Machine
Claude Code manages your context through three systems: microcompaction, auto-compaction, and structured rehydration. Here's how the machinery actually works, and why most developers burn tokens without realizing it.
AI Made Writing Code Easier. It Made Engineering Harder.
AI accelerates code production but expands scope, raises expectations, and shifts the bottleneck from implementation to judgment. Engineers are doing 2x the work and feeling 10x the burnout.
MinIO Is Dead. Here's What Your Infrastructure Team Should Do Next.
60,000 GitHub stars. One billion Docker pulls. Officially archived. MinIO's five-year wind-down from Apache 2.0 to AGPL to dead is the most dramatic open-source infrastructure collapse in years. Here's the migration playbook.