Why your AWS bill spiked: Causes, fixes & full incident playbook (FinOps + automation)
A single-file, actionable playbook to diagnose spikes, remediate safely, and embed continuous optimization. Includes scripts (AWS CLI, boto3, PowerShell), guardrails (CloudFormation/Terraform), budgets & alarms, FinOps process, and printable checklists.
Author: CloudKnowledge · Updated:
Quick TL;DR: Most AWS billing spikes are caused by misconfigurations (autoscaling loops, runaway jobs), forgotten resources, data egress, or storage tier mistakes. Use Cost Explorer + Compute Optimizer + budgets + tagging + automation to find and fix root causes. This file gives scripts and guardrails you can copy-paste now.
Visit CloudKnowledge for additional templates and workshops.
1) Why AWS bills spike — high-level causes (expanded)
Understanding the categories helps you prioritize the right diagnostic tool. Below are common categories with practical tips and a severity guide (High / Medium / Low).
Common categories
- Autoscaling & runaway compute (High): bad lifecycle hooks, infinite scaling loops (scale out -> fail health checks -> launch more), or cron jobs that spin up servers per request.
- Orphaned resources (Medium): unattached EBS volumes, unused AMIs, unused ELBs, old snapshots, internet gateways with traffic through NAT gateways.
- Data egress (High): cross-region replication, public traffic from S3 or EC2, large downloads by analytics jobs.
- Storage tiering & retrieval (Medium-High): Glacier/Deep Archive retrievals or keeping rarely-accessed data in Standard.
- High-frequency API services (Medium): excessive Lambda invocations, Kinesis shards, DynamoDB read/write costs due to hot keys.
- Poor reserved instance planning (Medium): paying on-demand for long-running workloads when RIs/Savings Plans would save.
- Third-party SaaS/Integrations (Low-Medium): monitoring or analytics tools pulling lots of data from S3 or APIs.
- Lack of visibility (Critical): no tagging or no central billing view makes any analysis slow.
2) First 30-minute incident playbook — stop the bleeding
When a spike occurs, people panic. Follow these steps in order; each step is reversible or conservative where possible.
30-minute checklist (prioritized)
- Identify scope: Sign into Billing console or run Cost Explorer queries for the last 24–72 hours. Focus on region/account/service causing spike.
- Notify stakeholders: Send an incident message to Slack/email with a one-line summary and owner list (tag environment owners).
- Limit blast radius: Apply temporary throttle: reduce ASG desired count to current running, apply IAM deny for new launches on suspect account, or add WAF/CloudFront limits if egress is due to public traffic.
- Quickly snapshot & tag resources: Snapshot EBS before deletion and tag discovered orphans as `cost-review`.
- Collect evidence: Export Cost Explorer JSON, CloudWatch alarms, and recent CloudTrail events related to scaling/launch actions.
- Remediate non-critical items: Delete unattached EBS, pause nightly big jobs, temporarily disable cross-region replication if safe.
- Perform root cause analysis: Once immediate costs drop, correlate the spike timestamps to infrastructure actions and deploy long-term fixes (lifecycle hooks, budgets, monitoring).
# 1) Identify top services (CLI)
aws ce get-cost-and-usage --time-period Start=$(date -d '3 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) --granularity DAILY --metrics "UnblendedCost" --group-by Type=DIMENSION,Key=SERVICE --query 'ResultsByTime[].Groups' --output json | jq .
# 2) List available (unattached) EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,AZ:AvailabilityZone}' --output table
3) Deep diagnostics — mapping cost -> resource -> owner
Use the combination of Cost Explorer, Cost Allocation Tags, Cost Categories, and CloudWatch metrics.
Using Cost Explorer programmatically
# example CLI to get cost by service and region for 7 days
aws ce get-cost-and-usage \
--time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE Type=DIMENSION,Key=REGION \
--query 'ResultsByTime[].Groups' --output json | jq .
Map to resources
- If Cost Explorer shows EC2 in us-east-1: list recent instance launches (`aws ec2 describe-instances --filters Name=launch-time,...`).
- For S3 spikes: check `aws s3api get-bucket-analytics-configuration` and server access logs (if enabled) or CloudTrail events for GetObject requests.
- For Lambda: list functions with highest invocation counts via CloudWatch metrics.
4) Comprehensive scripts & snippets (copy/paste)
Below are conservative, audited scripts and examples. Always review in a staging environment and get approvals when performing destructive operations.
4.1 AWS CLI — Top-10 cost services (last 7 days)
aws ce get-cost-and-usage \
--time-period Start=2025-10-18,End=2025-10-25 \
--granularity DAILY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[].Groups' --output json | jq .
4.2 List unattached EBS volumes (safe to delete after snapshot)
aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,AZ:AvailabilityZone,CreateTime:CreateTime}' --output table
# Snapshot an EBS before deletion (example)
aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pre-delete-snapshot-$(date +%F)"
4.3 Stop autoscaling groups (temporary, conservative)
# Set desired capacity to min and scale-in safely
aws autoscaling update-auto-scaling-group --auto-scaling-group-name my-asg --min-size 0 --desired-capacity 0
# Note: review impact before running; consider lifecycle hooks and draining policies.
4.4 Python boto3 — find top S3 buckets by size
#!/usr/bin/env python3
# Requires: boto3, configured AWS credentials
import boto3, datetime
s3 = boto3.client('s3')
cloudwatch = boto3.client('cloudwatch')
buckets = [b['Name'] for b in s3.list_buckets().get('Buckets',[])]
now = datetime.datetime.utcnow()
start = now - datetime.timedelta(days=3)
sizes=[]
for b in buckets:
try:
resp = cloudwatch.get_metric_statistics(
Namespace='AWS/S3',
MetricName='BucketSizeBytes',
Dimensions=[{'Name':'BucketName','Value':b},{'Name':'StorageType','Value':'StandardStorage'}],
StartTime=start,
EndTime=now,
Period=86400,
Statistics=['Maximum']
)
pts = resp.get('Datapoints',[])
size = pts[0]['Maximum'] if pts else 0
sizes.append((b, size))
except Exception as e:
# permissions or metric not available
continue
sizes_sorted = sorted(sizes, key=lambda x: x[1], reverse=True)
for bucket, size in sizes_sorted[:20]:
print(bucket, round(size/1024/1024/1024,2), "GB")
4.5 PowerShell (AWS Tools for PowerShell) — quick cost export
# Requires: Install-Module -Name AWSPowerShell.NetCore
Import-Module AWSPowerShell.NetCore
$end = (Get-Date).ToString("yyyy-MM-dd")
$start = (Get-Date).AddDays(-7).ToString("yyyy-MM-dd")
Get-CECostAndUsage -TimePeriod_Start $start -TimePeriod_End $end -Granularity DAILY -Metric "UnblendedCost" -GroupBy_Type DIMENSION -GroupBy_Key SERVICE | ConvertTo-Json | Out-File costs.json
Get-Content costs.json | ConvertFrom-Json
4.6 CloudWatch alarm & SNS example (CLI)
# Create an SNS topic and subscription, then alarm for EstimatedCharges
aws sns create-topic --name BillingAlerts
aws sns subscribe --topic-arn arn:aws:sns:us-east-1:123456789012:BillingAlerts --protocol email --notification-endpoint ops@example.com
aws cloudwatch put-metric-alarm --alarm-name "Monthly-EstimatedCharges-High" \
--metric-name EstimatedCharges --namespace AWS/Billing --statistic Maximum \
--period 21600 --threshold 100 --comparison-operator GreaterThanThreshold \
--evaluation-periods 1 --alarm-actions arn:aws:sns:us-east-1:123456789012:BillingAlerts --region us-east-1
5) Guardrails — CloudFormation and Terraform policy examples
Guardrails reduce future risk by enforcing safe defaults: deny public S3 buckets, enforce tagging, limit instance sizes, prevent creation of expensive resources in dev accounts, enforce budgets.
5.1 CloudFormation Guard — tag enforcement (example snippet)
AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation macro or StackSet to enforce required tags (example)'
Resources:
TagCheckRule:
Type: "AWS::CloudFormation::WaitCondition"
Properties:
Handle: !Ref "AWS::NoValue"
# Note: Use AWS Config custom rules or SCPs in Organizations for stronger enforcement.
5.2 Terraform Sentinel-style policy (pseudo) — enforce tags & disallow t2.micro in prod
# Terraform Policy-like pseudocode (use Sentinel/Policy as Code in your pipeline)
# 1) require tags: team, env, cost_center
# 2) disallow instance types > m5.4xlarge in dev account
# 3) disallow public S3 without bucket policy
5.3 AWS Organizations SCP example — deny specific actions
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyCreateNatGatewayInDev",
"Effect": "Deny",
"Action": ["ec2:CreateNatGateway"],
"Resource": "*",
"Condition": {
"StringEquals": {"aws:PrincipalAccount": "111122223333"} # dev account id
}
}
]
}
6) Rightsizing & long-term savings (Compute Optimizer, RIs, Spot)
Rightsizing is the heart of cost optimization. Use Compute Optimizer recommendations, but validate before change. For steady-state workloads, Savings Plans often beat RIs in flexibility.
Compute Optimizer workflow
- Enable Compute Optimizer for accounts (requires role & permissions).
- Review recommendations (CPU/Memory/Network) for EC2, ASGs, and EBS.
- Tag recommended instances as `co:reviewed` when validated.
- Roll out changes in canary waves and monitor performance & error rates.
Purchasing RIs / Savings Plans
Run utilization analysis for 30–90 days. Commit to RIs only after verifying baseline usage. Consider convertible RIs or compute Savings Plans for more flexibility.
7) FinOps: Organizing teams, sprints, and cost ownership
FinOps is a cultural practice. Below is a recommended adoption plan across 3 phases.
Phase 1 — Visibility (0–3 months)
- Enable Cost Explorer, enable aws:createdBy and team tags at creation time.
- Set up budgets, alerts, and weekly cost reports emailed to cost owners.
- Run a baseline audit to find top 10 cost drivers.
Phase 2 — Accountability & Optimization (3–9 months)
- Assign cost owners per service/app and include costs KPI in sprint reviews.
- Start rightsizing cadence and a reserved instance / savings plan review cycle.
- Automate shutdown of dev environments during off hours.
Phase 3 — Automation & Continuous Improvement (9+ months)
- Policy-as-code enforcement (SCPs, Terraform + pipeline policies).
- Cost-aware CI/CD (estimate resource cost before merge for infrastructure changes).
- Integrate cost for product prioritization decisions.
8) Full incident response playbook — step-by-step (playbook)
Use this playbook for documented, repeatable response. Store it in your runbook/incident management tool.
Playbook (summary)
- Detect: Budget or anomaly alert triggers.
- Triage: Identify accounts/services/regions; create incident in tracker.
- Contain: Scale down ASGs, quarantine accounts, disable cross-region replication if safe.
- Investigate: Correlate logs, CloudTrail, Cost Explorer; identify UID/automation that made changes.
- Remediate: Delete orphaned resources (with snapshots), patch automation bugs, update autoscaler settings.
- Recover: Restore any intentionally halted services after validation; inform stakeholders.
- Post-mortem: Document root cause, impact, timeline, and corrective actions (SOPs & policy changes).
Playbook — Example timeline
00:00 Detect - Budget alert triggers (Ops receives email/SNS)
00:05 Triage - Run Cost Explorer queries to determine service & region
00:12 Contain - Set ASG desired capacity to safe value, block new launches temporarily
00:20 Investigate - CloudTrail + CloudWatch logs show a scheduler starting analytics in wrong region
00:45 Remediate - Disable bad cron, snapshot, and delete temporary resources
02:00 Post-mortem - Share notes and update runbook
9) Example policies & enforcement (SCP, Config, Lambda auto-remediation)
Use AWS Config managed rules and Lambda-based auto-remediation for repeatable enforcement.
AWS Config rule example: S3 public access
{
"ConfigRuleName": "s3-buckets-public-read-prohibited",
"Source": {
"Owner": "AWS",
"SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
}
}
# Auto-remediation: Lambda that tags owner & applies bucket policy to block public access
Example Lambda auto-remediate (pseudo)
def handler(event, context):
bucket = event['detail']['requestParameters']['bucketName']
# add tag 'owner:unknown' and apply deny-public policy, then notify owner channel
10) Reporting, dashboards, and runbook artifacts
Build simple dashboards for daily ops: Top 10 services, top buckets, top EC2 costs, anomaly scoring. Send weekly digest to finance and product owners.
11) Printable checklist (one-page)
First-Response Checklist — Cost Spike
- Open Cost Explorer & find top service/region (24–72h).
- Notify cost owner & create incident ticket.
- Run CLI command:
aws ce get-cost-and-usage ...and save output. - List and snapshot unattached EBS volumes.
- Temporarily reduce ASG desired capacity (if safe).
- Check S3 access logs & CloudTrail for unexpected GET/PUT spikes.
- Check cross-region replication & stop if safe.
- Document remediation steps & owners in post-mortem.
12) Appendices — Extra scripts, sample CloudFormation & Terraform policies
Appendix A — Cost Explorer SDK (Python) sample
import boto3, datetime, json
client = boto3.client('ce')
end = datetime.date.today()
start = end - datetime.timedelta(days=7)
resp = client.get_cost_and_usage(
TimePeriod={'Start': start.isoformat(), 'End': end.isoformat()},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type':'DIMENSION','Key':'SERVICE'}]
)
print(json.dumps(resp, indent=2))
Appendix B — Terraform snippet: deny public S3 via bucket policy resource
resource "aws_s3_bucket_public_access_block" "deny_public" {
bucket = aws_s3_bucket.mybucket.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Appendix C — Example CloudFormation to create budget (JSON simplified)
{
"AWSTemplateFormatVersion": "2010-09-09",
"Resources": {
"MonthlyBudget": {
"Type": "AWS::Budgets::Budget",
"Properties": {
"Budget": {
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"BudgetLimit": {"Amount": 1000, "Unit": "USD"},
"CostFilters": {}
},
"NotificationsWithSubscribers": [
{
"Notification": {"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80},
"Subscribers":[{"SubscriptionType":"EMAIL","Address":"ops@example.com"}]
}
]
}
}
}
}
Conclusion — People, process & platform
Cost spikes are frightening but manageable. The fastest path from alert -> remediation -> prevention is:
- Alerting & visibility (budgets, CE, tags)
- Fast containment (conservative throttles & snapshots)
- Root-cause fix (automation bug fixes, lifecycle policies)
- Preventive guardrails (SCPs, Config rules, policy-as-code)
- Cultural embed (FinOps: owners, cadences, sprint ops)
Use the scripts above as starting points and adapt them to your organization's processes and approval flows.













Leave a Reply