Real AWS Bill Autopsy (Anonymised Case Study)

A line-by-line breakdown of a real AI startup's AWS bill — where the money went, what was wasted, and how we cut it by 58%.

Supporting Guide for: AI Cost Reduction & LLM Optimisation

Real AWS Bill Autopsy (Anonymised Case Study)

This is a real AWS bill from a Series A AI startup processing customer support tickets with RAG. We have anonymised the company but the numbers are real. Monthly spend: $34,200. After our optimisation engagement: $14,400. Here is where the money was going.

The Breakdown

Bedrock / LLM API Calls — $18,500 (54%) — The largest line item. The company was using Claude Sonnet for every single query — including simple ticket classification, status lookups, and FAQ responses that a much smaller model could handle. No caching was implemented, so identical questions generated fresh API calls every time.

EC2 Compute — $6,800 (20%) — Three m5.4xlarge instances running 24/7 for the application layer, plus two p3.2xlarge GPU instances for embedding generation. The GPU instances were running at 12% average utilisation — they were sized for peak load but peak load happened 2 hours per day.

RDS (PostgreSQL + pgvector) — $4,200 (12%) — A db.r6g.2xlarge instance with 500GB storage. Significantly oversized for the actual data volume. The pgvector indexes were unoptimised, forcing the database to work harder than necessary.

S3 + Data Transfer — $2,800 (8%) — Document storage and inter-service data transfer. The application was downloading full documents from S3 on every request instead of caching them locally.

Other (CloudWatch, NAT Gateway, etc.) — $1,900 (6%) — Verbose logging to CloudWatch at $0.50/GB ingested was adding up. The NAT Gateway charges were inflated by unnecessary cross-AZ traffic.

What We Changed

Model routing — Implemented a classifier that routes simple queries (60% of traffic) to a smaller model. Inference costs dropped 45%.

Semantic caching — Added a Redis-based cache for near-duplicate queries. Hit rate of 28% in the first week.

Right-sized compute — Moved embedding generation to spot instances with auto-scaling. Reduced the always-on GPU cost by 70%.

Database optimisation — Tuned pgvector indexes and downscaled the RDS instance. Cut database costs by 40%.

Caching and transfer — Implemented local document caching and consolidated services to reduce cross-AZ traffic.

The Result

Monthly bill dropped from $34,200 to $14,400 — a 58% reduction. No measurable quality degradation. Response latency actually improved because of caching and the faster small model on simple queries.

Ready to implement this?

We help founders master vibe coding at scale. Book a Free Technical Triage to unblock your build.

GET FREE CALL