Why your AWS bill spiked: Causes, fixes & full incident playbook

1) Why AWS bills spike — high-level causes (expanded)

Understanding the categories helps you prioritize the right diagnostic tool. Below are common categories with practical tips and a severity guide (High / Medium / Low).

Common categories

Autoscaling & runaway compute (High): bad lifecycle hooks, infinite scaling loops (scale out -> fail health checks -> launch more), or cron jobs that spin up servers per request.
Orphaned resources (Medium): unattached EBS volumes, unused AMIs, unused ELBs, old snapshots, internet gateways with traffic through NAT gateways.
Data egress (High): cross-region replication, public traffic from S3 or EC2, large downloads by analytics jobs.
Storage tiering & retrieval (Medium-High): Glacier/Deep Archive retrievals or keeping rarely-accessed data in Standard.
High-frequency API services (Medium): excessive Lambda invocations, Kinesis shards, DynamoDB read/write costs due to hot keys.
Poor reserved instance planning (Medium): paying on-demand for long-running workloads when RIs/Savings Plans would save.
Third-party SaaS/Integrations (Low-Medium): monitoring or analytics tools pulling lots of data from S3 or APIs.
Lack of visibility (Critical): no tagging or no central billing view makes any analysis slow.

2) First 30-minute incident playbook — stop the bleeding

When a spike occurs, people panic. Follow these steps in order; each step is reversible or conservative where possible.

30-minute checklist (prioritized)

Identify scope: Sign into Billing console or run Cost Explorer queries for the last 24–72 hours. Focus on region/account/service causing spike.
Notify stakeholders: Send an incident message to Slack/email with a one-line summary and owner list (tag environment owners).
Limit blast radius: Apply temporary throttle: reduce ASG desired count to current running, apply IAM deny for new launches on suspect account, or add WAF/CloudFront limits if egress is due to public traffic.
Quickly snapshot & tag resources: Snapshot EBS before deletion and tag discovered orphans as `cost-review`.
Collect evidence: Export Cost Explorer JSON, CloudWatch alarms, and recent CloudTrail events related to scaling/launch actions.
Remediate non-critical items: Delete unattached EBS, pause nightly big jobs, temporarily disable cross-region replication if safe.
Perform root cause analysis: Once immediate costs drop, correlate the spike timestamps to infrastructure actions and deploy long-term fixes (lifecycle hooks, budgets, monitoring).

Automated first-responder — low-risk commands

# 1) Identify top services (CLI)
aws ce get-cost-and-usage --time-period Start=$(date -d '3 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) --granularity DAILY --metrics "UnblendedCost" --group-by Type=DIMENSION,Key=SERVICE --query 'ResultsByTime[].Groups' --output json | jq .

# 2) List available (unattached) EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,AZ:AvailabilityZone}' --output table

3) Deep diagnostics — mapping cost -> resource -> owner

Use the combination of Cost Explorer, Cost Allocation Tags, Cost Categories, and CloudWatch metrics.

Using Cost Explorer programmatically

# example CLI to get cost by service and region for 7 days
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE Type=DIMENSION,Key=REGION \
  --query 'ResultsByTime[].Groups' --output json | jq .

Map to resources

If Cost Explorer shows EC2 in us-east-1: list recent instance launches (`aws ec2 describe-instances –filters Name=launch-time,…`).
For S3 spikes: check `aws s3api get-bucket-analytics-configuration` and server access logs (if enabled) or CloudTrail events for GetObject requests.
For Lambda: list functions with highest invocation counts via CloudWatch metrics.

4) Comprehensive scripts & snippets (copy/paste)

Below are conservative, audited scripts and examples. Always review in a staging environment and get approvals when performing destructive operations.

4.1 AWS CLI — Top-10 cost services (last 7 days)

aws ce get-cost-and-usage \
  --time-period Start=2025-10-18,End=2025-10-25 \
  --granularity DAILY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[].Groups' --output json | jq .

4.2 List unattached EBS volumes (safe to delete after snapshot)

aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,AZ:AvailabilityZone,CreateTime:CreateTime}' --output table

# Snapshot an EBS before deletion (example)
aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pre-delete-snapshot-$(date +%F)"

4.3 Stop autoscaling groups (temporary, conservative)

# Set desired capacity to min and scale-in safely
aws autoscaling update-auto-scaling-group --auto-scaling-group-name my-asg --min-size 0 --desired-capacity 0
# Note: review impact before running; consider lifecycle hooks and draining policies.

4.4 Python boto3 — find top S3 buckets by size

#!/usr/bin/env python3
# Requires: boto3, configured AWS credentials
import boto3, datetime
s3 = boto3.client('s3')
cloudwatch = boto3.client('cloudwatch')
buckets = [b['Name'] for b in s3.list_buckets().get('Buckets',[])]
now = datetime.datetime.utcnow()
start = now - datetime.timedelta(days=3)
sizes=[]
for b in buckets:
    try:
        resp = cloudwatch.get_metric_statistics(
            Namespace='AWS/S3',
            MetricName='BucketSizeBytes',
            Dimensions=[{'Name':'BucketName','Value':b},{'Name':'StorageType','Value':'StandardStorage'}],
            StartTime=start,
            EndTime=now,
            Period=86400,
            Statistics=['Maximum']
        )
        pts = resp.get('Datapoints',[])
        size = pts[0]['Maximum'] if pts else 0
        sizes.append((b, size))
    except Exception as e:
        # permissions or metric not available
        continue
sizes_sorted = sorted(sizes, key=lambda x: x[1], reverse=True)
for bucket, size in sizes_sorted[:20]:
    print(bucket, round(size/1024/1024/1024,2), "GB")

4.5 PowerShell (AWS Tools for PowerShell) — quick cost export

# Requires: Install-Module -Name AWSPowerShell.NetCore
Import-Module AWSPowerShell.NetCore
$end = (Get-Date).ToString("yyyy-MM-dd")
$start = (Get-Date).AddDays(-7).ToString("yyyy-MM-dd")
Get-CECostAndUsage -TimePeriod_Start $start -TimePeriod_End $end -Granularity DAILY -Metric "UnblendedCost" -GroupBy_Type DIMENSION -GroupBy_Key SERVICE | ConvertTo-Json | Out-File costs.json
Get-Content costs.json | ConvertFrom-Json

4.6 CloudWatch alarm & SNS example (CLI)

# Create an SNS topic and subscription, then alarm for EstimatedCharges
aws sns create-topic --name BillingAlerts
aws sns subscribe --topic-arn arn:aws:sns:us-east-1:123456789012:BillingAlerts --protocol email --notification-endpoint ops@example.com

aws cloudwatch put-metric-alarm --alarm-name "Monthly-EstimatedCharges-High" \
  --metric-name EstimatedCharges --namespace AWS/Billing --statistic Maximum \
  --period 21600 --threshold 100 --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 --alarm-actions arn:aws:sns:us-east-1:123456789012:BillingAlerts --region us-east-1

5) Guardrails — CloudFormation and Terraform policy examples

Guardrails reduce future risk by enforcing safe defaults: deny public S3 buckets, enforce tagging, limit instance sizes, prevent creation of expensive resources in dev accounts, enforce budgets.

5.1 CloudFormation Guard — tag enforcement (example snippet)

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation macro or StackSet to enforce required tags (example)'
Resources:
  TagCheckRule:
    Type: "AWS::CloudFormation::WaitCondition" 
    Properties:
      Handle: !Ref "AWS::NoValue"
# Note: Use AWS Config custom rules or SCPs in Organizations for stronger enforcement.

5.2 Terraform Sentinel-style policy (pseudo) — enforce tags & disallow t2.micro in prod

# Terraform Policy-like pseudocode (use Sentinel/Policy as Code in your pipeline)
# 1) require tags: team, env, cost_center
# 2) disallow instance types > m5.4xlarge in dev account
# 3) disallow public S3 without bucket policy

5.3 AWS Organizations SCP example — deny specific actions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyCreateNatGatewayInDev",
      "Effect": "Deny",
      "Action": ["ec2:CreateNatGateway"],
      "Resource": "*",
      "Condition": {
        "StringEquals": {"aws:PrincipalAccount": "111122223333"} # dev account id
      }
    }
  ]
}

6) Rightsizing & long-term savings (Compute Optimizer, RIs, Spot)

Rightsizing is the heart of cost optimization. Use Compute Optimizer recommendations, but validate before change. For steady-state workloads, Savings Plans often beat RIs in flexibility.

Compute Optimizer workflow

Enable Compute Optimizer for accounts (requires role & permissions).
Review recommendations (CPU/Memory/Network) for EC2, ASGs, and EBS.
Tag recommended instances as `co:reviewed` when validated.
Roll out changes in canary waves and monitor performance & error rates.

Purchasing RIs / Savings Plans

Run utilization analysis for 30–90 days. Commit to RIs only after verifying baseline usage. Consider convertible RIs or compute Savings Plans for more flexibility.

7) FinOps: Organizing teams, sprints, and cost ownership

FinOps is a cultural practice. Below is a recommended adoption plan across 3 phases.

Phase 1 — Visibility (0–3 months)

Enable Cost Explorer, enable aws:createdBy and team tags at creation time.
Set up budgets, alerts, and weekly cost reports emailed to cost owners.
Run a baseline audit to find top 10 cost drivers.

Phase 2 — Accountability & Optimization (3–9 months)

Assign cost owners per service/app and include costs KPI in sprint reviews.
Start rightsizing cadence and a reserved instance / savings plan review cycle.
Automate shutdown of dev environments during off hours.

Phase 3 — Automation & Continuous Improvement (9+ months)

Policy-as-code enforcement (SCPs, Terraform + pipeline policies).
Cost-aware CI/CD (estimate resource cost before merge for infrastructure changes).
Integrate cost for product prioritization decisions.

8) Full incident response playbook — step-by-step (playbook)

Use this playbook for documented, repeatable response. Store it in your runbook/incident management tool.

Playbook (summary)

Detect: Budget or anomaly alert triggers.
Triage: Identify accounts/services/regions; create incident in tracker.
Contain: Scale down ASGs, quarantine accounts, disable cross-region replication if safe.
Investigate: Correlate logs, CloudTrail, Cost Explorer; identify UID/automation that made changes.
Remediate: Delete orphaned resources (with snapshots), patch automation bugs, update autoscaler settings.
Recover: Restore any intentionally halted services after validation; inform stakeholders.
Post-mortem: Document root cause, impact, timeline, and corrective actions (SOPs & policy changes).

Playbook — Example timeline

00:00 Detect - Budget alert triggers (Ops receives email/SNS)
00:05 Triage - Run Cost Explorer queries to determine service & region
00:12 Contain - Set ASG desired capacity to safe value, block new launches temporarily
00:20 Investigate - CloudTrail + CloudWatch logs show a scheduler starting analytics in wrong region
00:45 Remediate - Disable bad cron, snapshot, and delete temporary resources
02:00 Post-mortem - Share notes and update runbook

9) Example policies & enforcement (SCP, Config, Lambda auto-remediation)

Use AWS Config managed rules and Lambda-based auto-remediation for repeatable enforcement.

AWS Config rule example: S3 public access

{
  "ConfigRuleName": "s3-buckets-public-read-prohibited",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  }
}
# Auto-remediation: Lambda that tags owner & applies bucket policy to block public access

Example Lambda auto-remediate (pseudo)

def handler(event, context):
    bucket = event['detail']['requestParameters']['bucketName']
    # add tag 'owner:unknown' and apply deny-public policy, then notify owner channel

10) Reporting, dashboards, and runbook artifacts

Build simple dashboards for daily ops: Top 10 services, top buckets, top EC2 costs, anomaly scoring. Send weekly digest to finance and product owners.

11) Printable checklist (one-page)

First-Response Checklist — Cost Spike

Open Cost Explorer & find top service/region (24–72h).
Notify cost owner & create incident ticket.
Run CLI command: aws ce get-cost-and-usage ... and save output.
List and snapshot unattached EBS volumes.
Temporarily reduce ASG desired capacity (if safe).
Check S3 access logs & CloudTrail for unexpected GET/PUT spikes.
Check cross-region replication & stop if safe.
Document remediation steps & owners in post-mortem.

12) Appendices — Extra scripts, sample CloudFormation & Terraform policies

Appendix A — Cost Explorer SDK (Python) sample

import boto3, datetime, json
client = boto3.client('ce')
end = datetime.date.today()
start = end - datetime.timedelta(days=7)
resp = client.get_cost_and_usage(
    TimePeriod={'Start': start.isoformat(), 'End': end.isoformat()},
    Granularity='DAILY',
    Metrics=['UnblendedCost'],
    GroupBy=[{'Type':'DIMENSION','Key':'SERVICE'}]
)
print(json.dumps(resp, indent=2))

Appendix B — Terraform snippet: deny public S3 via bucket policy resource

resource "aws_s3_bucket_public_access_block" "deny_public" {
  bucket = aws_s3_bucket.mybucket.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Appendix C — Example CloudFormation to create budget (JSON simplified)

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "MonthlyBudget": {
      "Type": "AWS::Budgets::Budget",
      "Properties": {
        "Budget": {
          "BudgetType": "COST",
          "TimeUnit": "MONTHLY",
          "BudgetLimit": {"Amount": 1000, "Unit": "USD"},
          "CostFilters": {}
        },
        "NotificationsWithSubscribers": [
          {
            "Notification": {"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80},
            "Subscribers":[{"SubscriptionType":"EMAIL","Address":"ops@example.com"}]
          }
        ]
      }
    }
  }
}

Conclusion — People, process & platform

Cost spikes are frightening but manageable. The fastest path from alert -> remediation -> prevention is:

Alerting & visibility (budgets, CE, tags)
Fast containment (conservative throttles & snapshots)
Root-cause fix (automation bug fixes, lifecycle policies)
Preventive guardrails (SCPs, Config rules, policy-as-code)
Cultural embed (FinOps: owners, cadences, sprint ops)

Use the scripts above as starting points and adapt them to your organization’s processes and approval flows.