The real AWS cost trap (it's not what you think)

2026-03-15 · 3 min read

AWSFinOps

The first time a client asked me to look at their AWS bill, I went straight for the usual suspects. Runaway Lambda invocations, oversized RDS, forgotten EBS volumes. I found a few small things. The bill wasn't moved.

The money was hiding in three places nobody had bothered to read. I've now seen the same pattern on enough bills that I'd bet I'll find at least two of these in any AWS account over $20k a month. Here they are.

1. NAT Gateway data transfer

The NAT Gateway looks innocent at $0.045 per hour. Pocket change. The killer is the $0.045 per gigabyte processed on top, which scales linearly with traffic and quietly dwarfs the hourly cost the moment your private subnets start doing anything.

Every private subnet pulling Docker images from ECR over the public endpoint, every cron hitting an external API, every microservice calling S3 the lazy way: all of it racks up per-GB charges. A single NAT moving 10 GB a day costs $13.50 a month just on transfer, plus the hourly fee. Multiply by traffic. Multiply by environments. Multiply by AZs (because you have one NAT per AZ for HA). The bill compounds in places nobody is watching.

The tactic is mechanical: VPC endpoints for S3, ECR, DynamoDB, and Secrets Manager. Gateway endpoints are free. Interface endpoints have an hourly cost but pay for themselves quickly on any meaningful traffic. I had one client whose ECR pulls alone justified the migration in eight days.

2. Idle ECS and EC2 capacity

Every org has them. The staging cluster from a project that shipped two years ago. The capacity provider with a minimum of three tasks that nobody scales down. The dev environment somebody spun up for a demo in 2023 and forgot.

ECS doesn't bill you for tasks. It bills you for the EC2 they run on, 24/7, whether anything is actually using them. I once watched a client pay $1,800 a month for an idle staging cluster that hadn't received a deploy in eleven months. Nobody noticed because the bill grew so slowly.

The tactic: tag every cluster with an owner and expires-on, run a weekly Lambda that flags untagged or expired resources, and route the report into the team's Slack. A tagged auto-suspend cron that scales clusters to zero outside business hours is even better when latency permits.

3. Cross-AZ data transfer in service meshes

$0.01 per gigabyte sounds like nothing. Then your service mesh fans out 50 calls per request across three availability zones, and that "nothing" becomes a quarter of the compute bill.

The tactic is zone-aware routing. Most modern service meshes support it. Keep traffic in the same AZ when a healthy local replica exists. For latency-tolerant chatty services, you can be more aggressive and pin services to a single AZ entirely, only failing over on real outage. The savings often dwarf the engineering time within a quarter.

The story I always tell

I once worked with a team that didn't notice their bill had grown 40% over three months. The growth was almost entirely NAT Gateway data transfer from a new ML training pipeline pulling datasets over the public internet. Nobody had alerted on it because the increase was smooth, not spiky. The bill had simply gotten bigger, the way bills do.

The lesson: cost anomalies aren't usually spikes. They're slopes. Set up alerts on the derivative, not the value.

I wrote a whole book on this if you want the long version: AWS Cost Optimization. The short version is on this page.