Cloud Knowledge

Your Go-To Hub for Cloud Solutions & Insights

Advertisement

Why your AWS bill spiked: Causes, fixes & full incident playbook

Why your AWS bill spiked: Causes, fixes & full incident playbook
Why your AWS bill spiked: Causes, fixes & practical cost-optimization (FinOps + automation) — Full playbook
AWS Cost Optimization

Why your AWS bill spiked: Causes, fixes & full incident playbook (FinOps + automation)

A single-file, actionable playbook to diagnose spikes, remediate safely, and embed continuous optimization. Includes scripts (AWS CLI, boto3, PowerShell), guardrails (CloudFormation/Terraform), budgets & alarms, FinOps process, and printable checklists.

Author: CloudKnowledge · Updated:

Quick TL;DR: Most AWS billing spikes are caused by misconfigurations (autoscaling loops, runaway jobs), forgotten resources, data egress, or storage tier mistakes. Use Cost Explorer + Compute Optimizer + budgets + tagging + automation to find and fix root causes. This file gives scripts and guardrails you can copy-paste now.

Visit CloudKnowledge for additional templates and workshops.

1) Why AWS bills spike — high-level causes (expanded)

Understanding the categories helps you prioritize the right diagnostic tool. Below are common categories with practical tips and a severity guide (High / Medium / Low).

Common categories

  • Autoscaling & runaway compute (High): bad lifecycle hooks, infinite scaling loops (scale out -> fail health checks -> launch more), or cron jobs that spin up servers per request.
  • Orphaned resources (Medium): unattached EBS volumes, unused AMIs, unused ELBs, old snapshots, internet gateways with traffic through NAT gateways.
  • Data egress (High): cross-region replication, public traffic from S3 or EC2, large downloads by analytics jobs.
  • Storage tiering & retrieval (Medium-High): Glacier/Deep Archive retrievals or keeping rarely-accessed data in Standard.
  • High-frequency API services (Medium): excessive Lambda invocations, Kinesis shards, DynamoDB read/write costs due to hot keys.
  • Poor reserved instance planning (Medium): paying on-demand for long-running workloads when RIs/Savings Plans would save.
  • Third-party SaaS/Integrations (Low-Medium): monitoring or analytics tools pulling lots of data from S3 or APIs.
  • Lack of visibility (Critical): no tagging or no central billing view makes any analysis slow.

2) First 30-minute incident playbook — stop the bleeding

When a spike occurs, people panic. Follow these steps in order; each step is reversible or conservative where possible.

30-minute checklist (prioritized)

  1. Identify scope: Sign into Billing console or run Cost Explorer queries for the last 24–72 hours. Focus on region/account/service causing spike.
  2. Notify stakeholders: Send an incident message to Slack/email with a one-line summary and owner list (tag environment owners).
  3. Limit blast radius: Apply temporary throttle: reduce ASG desired count to current running, apply IAM deny for new launches on suspect account, or add WAF/CloudFront limits if egress is due to public traffic.
  4. Quickly snapshot & tag resources: Snapshot EBS before deletion and tag discovered orphans as `cost-review`.
  5. Collect evidence: Export Cost Explorer JSON, CloudWatch alarms, and recent CloudTrail events related to scaling/launch actions.
  6. Remediate non-critical items: Delete unattached EBS, pause nightly big jobs, temporarily disable cross-region replication if safe.
  7. Perform root cause analysis: Once immediate costs drop, correlate the spike timestamps to infrastructure actions and deploy long-term fixes (lifecycle hooks, budgets, monitoring).
Automated first-responder — low-risk commands
# 1) Identify top services (CLI)
aws ce get-cost-and-usage --time-period Start=$(date -d '3 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) --granularity DAILY --metrics "UnblendedCost" --group-by Type=DIMENSION,Key=SERVICE --query 'ResultsByTime[].Groups' --output json | jq .

# 2) List available (unattached) EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,AZ:AvailabilityZone}' --output table
            

3) Deep diagnostics — mapping cost -> resource -> owner

Use the combination of Cost Explorer, Cost Allocation Tags, Cost Categories, and CloudWatch metrics.

Using Cost Explorer programmatically

# example CLI to get cost by service and region for 7 days
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE Type=DIMENSION,Key=REGION \
  --query 'ResultsByTime[].Groups' --output json | jq .
          

Map to resources

  • If Cost Explorer shows EC2 in us-east-1: list recent instance launches (`aws ec2 describe-instances --filters Name=launch-time,...`).
  • For S3 spikes: check `aws s3api get-bucket-analytics-configuration` and server access logs (if enabled) or CloudTrail events for GetObject requests.
  • For Lambda: list functions with highest invocation counts via CloudWatch metrics.

4) Comprehensive scripts & snippets (copy/paste)

Below are conservative, audited scripts and examples. Always review in a staging environment and get approvals when performing destructive operations.

4.1 AWS CLI — Top-10 cost services (last 7 days)

aws ce get-cost-and-usage \
  --time-period Start=2025-10-18,End=2025-10-25 \
  --granularity DAILY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[].Groups' --output json | jq .

4.2 List unattached EBS volumes (safe to delete after snapshot)

aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[].{ID:VolumeId,Size:Size,AZ:AvailabilityZone,CreateTime:CreateTime}' --output table

# Snapshot an EBS before deletion (example)
aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pre-delete-snapshot-$(date +%F)"

4.3 Stop autoscaling groups (temporary, conservative)

# Set desired capacity to min and scale-in safely
aws autoscaling update-auto-scaling-group --auto-scaling-group-name my-asg --min-size 0 --desired-capacity 0
# Note: review impact before running; consider lifecycle hooks and draining policies.

4.4 Python boto3 — find top S3 buckets by size

#!/usr/bin/env python3
# Requires: boto3, configured AWS credentials
import boto3, datetime
s3 = boto3.client('s3')
cloudwatch = boto3.client('cloudwatch')
buckets = [b['Name'] for b in s3.list_buckets().get('Buckets',[])]
now = datetime.datetime.utcnow()
start = now - datetime.timedelta(days=3)
sizes=[]
for b in buckets:
    try:
        resp = cloudwatch.get_metric_statistics(
            Namespace='AWS/S3',
            MetricName='BucketSizeBytes',
            Dimensions=[{'Name':'BucketName','Value':b},{'Name':'StorageType','Value':'StandardStorage'}],
            StartTime=start,
            EndTime=now,
            Period=86400,
            Statistics=['Maximum']
        )
        pts = resp.get('Datapoints',[])
        size = pts[0]['Maximum'] if pts else 0
        sizes.append((b, size))
    except Exception as e:
        # permissions or metric not available
        continue
sizes_sorted = sorted(sizes, key=lambda x: x[1], reverse=True)
for bucket, size in sizes_sorted[:20]:
    print(bucket, round(size/1024/1024/1024,2), "GB")

4.5 PowerShell (AWS Tools for PowerShell) — quick cost export

# Requires: Install-Module -Name AWSPowerShell.NetCore
Import-Module AWSPowerShell.NetCore
$end = (Get-Date).ToString("yyyy-MM-dd")
$start = (Get-Date).AddDays(-7).ToString("yyyy-MM-dd")
Get-CECostAndUsage -TimePeriod_Start $start -TimePeriod_End $end -Granularity DAILY -Metric "UnblendedCost" -GroupBy_Type DIMENSION -GroupBy_Key SERVICE | ConvertTo-Json | Out-File costs.json
Get-Content costs.json | ConvertFrom-Json

4.6 CloudWatch alarm & SNS example (CLI)

# Create an SNS topic and subscription, then alarm for EstimatedCharges
aws sns create-topic --name BillingAlerts
aws sns subscribe --topic-arn arn:aws:sns:us-east-1:123456789012:BillingAlerts --protocol email --notification-endpoint ops@example.com

aws cloudwatch put-metric-alarm --alarm-name "Monthly-EstimatedCharges-High" \
  --metric-name EstimatedCharges --namespace AWS/Billing --statistic Maximum \
  --period 21600 --threshold 100 --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 --alarm-actions arn:aws:sns:us-east-1:123456789012:BillingAlerts --region us-east-1

5) Guardrails — CloudFormation and Terraform policy examples

Guardrails reduce future risk by enforcing safe defaults: deny public S3 buckets, enforce tagging, limit instance sizes, prevent creation of expensive resources in dev accounts, enforce budgets.

5.1 CloudFormation Guard — tag enforcement (example snippet)

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation macro or StackSet to enforce required tags (example)'
Resources:
  TagCheckRule:
    Type: "AWS::CloudFormation::WaitCondition" 
    Properties:
      Handle: !Ref "AWS::NoValue"
# Note: Use AWS Config custom rules or SCPs in Organizations for stronger enforcement.

5.2 Terraform Sentinel-style policy (pseudo) — enforce tags & disallow t2.micro in prod

# Terraform Policy-like pseudocode (use Sentinel/Policy as Code in your pipeline)
# 1) require tags: team, env, cost_center
# 2) disallow instance types > m5.4xlarge in dev account
# 3) disallow public S3 without bucket policy

5.3 AWS Organizations SCP example — deny specific actions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyCreateNatGatewayInDev",
      "Effect": "Deny",
      "Action": ["ec2:CreateNatGateway"],
      "Resource": "*",
      "Condition": {
        "StringEquals": {"aws:PrincipalAccount": "111122223333"} # dev account id
      }
    }
  ]
}

6) Rightsizing & long-term savings (Compute Optimizer, RIs, Spot)

Rightsizing is the heart of cost optimization. Use Compute Optimizer recommendations, but validate before change. For steady-state workloads, Savings Plans often beat RIs in flexibility.

Compute Optimizer workflow

  1. Enable Compute Optimizer for accounts (requires role & permissions).
  2. Review recommendations (CPU/Memory/Network) for EC2, ASGs, and EBS.
  3. Tag recommended instances as `co:reviewed` when validated.
  4. Roll out changes in canary waves and monitor performance & error rates.

Purchasing RIs / Savings Plans

Run utilization analysis for 30–90 days. Commit to RIs only after verifying baseline usage. Consider convertible RIs or compute Savings Plans for more flexibility.

7) FinOps: Organizing teams, sprints, and cost ownership

FinOps is a cultural practice. Below is a recommended adoption plan across 3 phases.

Phase 1 — Visibility (0–3 months)

  • Enable Cost Explorer, enable aws:createdBy and team tags at creation time.
  • Set up budgets, alerts, and weekly cost reports emailed to cost owners.
  • Run a baseline audit to find top 10 cost drivers.

Phase 2 — Accountability & Optimization (3–9 months)

  • Assign cost owners per service/app and include costs KPI in sprint reviews.
  • Start rightsizing cadence and a reserved instance / savings plan review cycle.
  • Automate shutdown of dev environments during off hours.

Phase 3 — Automation & Continuous Improvement (9+ months)

  • Policy-as-code enforcement (SCPs, Terraform + pipeline policies).
  • Cost-aware CI/CD (estimate resource cost before merge for infrastructure changes).
  • Integrate cost for product prioritization decisions.

8) Full incident response playbook — step-by-step (playbook)

Use this playbook for documented, repeatable response. Store it in your runbook/incident management tool.

Playbook (summary)

  1. Detect: Budget or anomaly alert triggers.
  2. Triage: Identify accounts/services/regions; create incident in tracker.
  3. Contain: Scale down ASGs, quarantine accounts, disable cross-region replication if safe.
  4. Investigate: Correlate logs, CloudTrail, Cost Explorer; identify UID/automation that made changes.
  5. Remediate: Delete orphaned resources (with snapshots), patch automation bugs, update autoscaler settings.
  6. Recover: Restore any intentionally halted services after validation; inform stakeholders.
  7. Post-mortem: Document root cause, impact, timeline, and corrective actions (SOPs & policy changes).

Playbook — Example timeline

00:00 Detect - Budget alert triggers (Ops receives email/SNS)
00:05 Triage - Run Cost Explorer queries to determine service & region
00:12 Contain - Set ASG desired capacity to safe value, block new launches temporarily
00:20 Investigate - CloudTrail + CloudWatch logs show a scheduler starting analytics in wrong region
00:45 Remediate - Disable bad cron, snapshot, and delete temporary resources
02:00 Post-mortem - Share notes and update runbook

9) Example policies & enforcement (SCP, Config, Lambda auto-remediation)

Use AWS Config managed rules and Lambda-based auto-remediation for repeatable enforcement.

AWS Config rule example: S3 public access

{
  "ConfigRuleName": "s3-buckets-public-read-prohibited",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  }
}
# Auto-remediation: Lambda that tags owner & applies bucket policy to block public access

Example Lambda auto-remediate (pseudo)

def handler(event, context):
    bucket = event['detail']['requestParameters']['bucketName']
    # add tag 'owner:unknown' and apply deny-public policy, then notify owner channel

10) Reporting, dashboards, and runbook artifacts

Build simple dashboards for daily ops: Top 10 services, top buckets, top EC2 costs, anomaly scoring. Send weekly digest to finance and product owners.

11) Printable checklist (one-page)

First-Response Checklist — Cost Spike

  1. Open Cost Explorer & find top service/region (24–72h).
  2. Notify cost owner & create incident ticket.
  3. Run CLI command: aws ce get-cost-and-usage ... and save output.
  4. List and snapshot unattached EBS volumes.
  5. Temporarily reduce ASG desired capacity (if safe).
  6. Check S3 access logs & CloudTrail for unexpected GET/PUT spikes.
  7. Check cross-region replication & stop if safe.
  8. Document remediation steps & owners in post-mortem.

12) Appendices — Extra scripts, sample CloudFormation & Terraform policies

Appendix A — Cost Explorer SDK (Python) sample

import boto3, datetime, json
client = boto3.client('ce')
end = datetime.date.today()
start = end - datetime.timedelta(days=7)
resp = client.get_cost_and_usage(
    TimePeriod={'Start': start.isoformat(), 'End': end.isoformat()},
    Granularity='DAILY',
    Metrics=['UnblendedCost'],
    GroupBy=[{'Type':'DIMENSION','Key':'SERVICE'}]
)
print(json.dumps(resp, indent=2))

Appendix B — Terraform snippet: deny public S3 via bucket policy resource

resource "aws_s3_bucket_public_access_block" "deny_public" {
  bucket = aws_s3_bucket.mybucket.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Appendix C — Example CloudFormation to create budget (JSON simplified)

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "MonthlyBudget": {
      "Type": "AWS::Budgets::Budget",
      "Properties": {
        "Budget": {
          "BudgetType": "COST",
          "TimeUnit": "MONTHLY",
          "BudgetLimit": {"Amount": 1000, "Unit": "USD"},
          "CostFilters": {}
        },
        "NotificationsWithSubscribers": [
          {
            "Notification": {"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80},
            "Subscribers":[{"SubscriptionType":"EMAIL","Address":"ops@example.com"}]
          }
        ]
      }
    }
  }
}

Conclusion — People, process & platform

Cost spikes are frightening but manageable. The fastest path from alert -> remediation -> prevention is:

  1. Alerting & visibility (budgets, CE, tags)
  2. Fast containment (conservative throttles & snapshots)
  3. Root-cause fix (automation bug fixes, lifecycle policies)
  4. Preventive guardrails (SCPs, Config rules, policy-as-code)
  5. Cultural embed (FinOps: owners, cadences, sprint ops)

Use the scripts above as starting points and adapt them to your organization's processes and approval flows.

© CloudKnowledge · Single-file playbook for AWS cost optimization.

Short 120-char summary: Why your AWS bill spiked — causes, quick fixes, rightsizing, S3 tiering, FinOps practices + actionable CLI & scripts.

Leave a Reply

Your email address will not be published. Required fields are marked *