Compute Services

Amazon ECS (Elastic Container Service) — Run Docker Containers Easily

A complete, field-tested guide to planning, deploying, operating, and troubleshooting production containers with Amazon ECS, AWS Fargate, Elastic Load Balancing, and Amazon VPC.

Conceptual diagram of Amazon ECS with Fargate, ALB and CloudWatch — Concept: ECS service behind an ALB with tasks on Fargate, monitored by CloudWatch/X-Ray.

1) Introduction to Amazon ECS

Amazon Elastic Container Service (ECS) is AWS’s fully managed container orchestration platform that lets you run, stop, and manage Docker containers at any scale without building a control plane. With ECS you describe the application once—via a Task Definition—and the service takes care of placement, scaling, rolling updates, and lifecycle. You can choose between two launch types: EC2 (you manage instances) and Fargate (serverless). For most new workloads, Fargate minimizes operational overhead while maintaining performance and security isolation.

Tip: Link ECS with Amazon ECR (the managed image registry) to streamline secure image pulls, scanning, and lifecycle policies.

2) Container Orchestration Made Simple

ECS abstracts cluster complexity: you define tasks, group them into services, and optionally put them behind an Application Load Balancer (ALB). The scheduler handles placement across Availability Zones, health checks, and failure remediation.

Rolling and blue/green deployments with CodeDeploy.
Target tracking and step scaling using CloudWatch metrics.
Integration with AWS Systems Manager for ops automation.

3) Native Docker Integration

ECS runs Docker-compatible images from ECR or any registry. Keep images lean using multi-stage builds and scan them regularly. Use immutable tags (e.g., Git SHA) to prevent accidental rollbacks when pushing latest.

Best practice: Use –platform to build for ARM64/AMD64 and match your Fargate/EC2 architecture.

4) Launch Types

EC2: Full control of AMIs, agents, daemonsets, and sidecars. Ideal for specialized hardware.
Fargate: Serverless isolation, per-task billing, simplest ops surface for most apps.

Choose Fargate for baseline; switch to EC2 when you need host-level control.

5) Fargate Integration

Fargate isolates tasks at the runtime boundary, auto patches the infra, and scales on demand. No capacity management.

6) Task Definitions

Blueprint for your containers: images, CPU/memory, env vars, secrets, awslogs, health checks, and volumes.

7) ECS Clusters & Services

A cluster groups compute capacity; a service maintains the desired count of tasks and handles rolling updates. With ALB target groups, health checks dictate replacement decisions—keep them realistic (timeout, interval, thresholds) to avoid flapping during cold starts.

99.9%+
Zonal resilience with multi-AZ tasks

<1 min
Typical failover on task health failure

Zero-downtime
Rolling/blue-green deploys

8) Scaling & Load Balancing

Attach an Application Load Balancer for HTTP/HTTPS and Network Load Balancer for TCP/UDP. Scale via CloudWatch target tracking (e.g., average CPU 60%) or request rate (ALB RequestCountPerTarget).

Scenario	ALB	NLB	Notes
HTTP APIs	✅	—	Use path/host-based routing, stickiness when stateful.
gRPC / HTTP2	✅	—	Enable HTTP/2, adjust idle timeout.
TCP microservices	—	✅	Ultra-low latency; fewer L7 features.
WebSockets	✅	—	Consider higher idle timeout and connection draining.

9) Networking with Amazon VPC

awsvpc networking mode gives each task an ENI (own security group, IP). Required for Fargate.
Place tasks in private subnets; route egress via NAT or VPC endpoints for ECR/CloudWatch to reduce cost/egress risk.
Restrict ALB/NLB inbound security groups by CIDR or WAF conditions.

Misrouting to public subnets with overly open SGs is a common audit finding—lock down ports and CIDRs.

10) Security & IAM Integration

ECS embraces least privilege with two roles: the task execution role (pull images, write logs, fetch secrets) and the task role (your app’s AWS API access). Rotate credentials, use short sessions, and prefer VPC Endpoints for Secrets Manager and SSM.

Encrypt env variables via secrets (SSM/Secrets Manager) instead of plain text.
Enable ECR image scanning and block vulnerable images on deploy with CodePipeline/CodeBuild quality gates.
Use AWS WAF on ALB for Layer-7 protection and bot control.

11) Observability: Logs, Metrics, Traces

Ship STDOUT/STDERR via the awslogs driver to CloudWatch Logs. Emit application metrics to CloudWatch (or Prometheus w/Sidecar) and use AWS X-Ray for distributed tracing between microservices.

Set structured JSON logging to simplify log insights.
Emit RED/USE metrics (Rate, Errors, Duration / Utilization, Saturation, Errors).
Create AnomalyDetection alarms for sudden spikes in 5XX or latency.

12) CI/CD & DevOps

Use CodePipeline → CodeBuild → ECS blue/green (via CodeDeploy) for zero-downtime updates. For GitHub Actions or GitLab CI, use the aws-actions/amazon-ecr-login and ECS deploy actions to push images and update services automatically.

Build, test, scan image → push to ECR (immutable tag).
Register new task definition revision.
CodeDeploy switches target group from Green to Blue post health validation.

13) Cost Optimization

Prefer Fargate for bursty loads; tune CPU/Memory to right-size tasks.
For steady 24×7, consider EC2 with Savings Plans or Spot for worker nodes.
Leverage Graviton (ARM64) images; many see 20–30% savings.
Use VPC endpoints for ECR/CloudWatch to reduce NAT egress charges.

Add cost allocation tags (service, env, owner) and build dashboards by team and environment.

14) ECS Anywhere & Hybrid

With ECS Anywhere, register on-prem or edge instances to the ECS control plane. Keep a unified deployment and operations model while complying with data residency or latency constraints.

15) Real-World Use Cases

Public APIs and web frontends behind ALB with path-based routing.
Event-driven workers consuming SQS/Kinesis streams.
ETL/batch jobs scheduled via EventBridge.
Internal microservices with NLB + service discovery.

16) Sample Task Definition (Fargate)

{ 
  "family": "api-blue",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/appTaskRole",
  "containerDefinitions": [{
    "name": "web",
    "image": "123456789012.dkr.ecr.ap-south-1.amazonaws.com/api:%24%7BGIT_SHA%7D",
    "portMappings": [{ "containerPort": 8080, "protocol": "tcp" }],
    "essential": true,
    "linuxParameters": { "initProcessEnabled": true },
    "environment": [
      {"name":"ENV","value":"prod"},
      {"name":"LOG_LEVEL","value":"info"}
    ],
    "secrets": [
      {"name":"DB_PASSWORD", "valueFrom":"arn:aws:secretsmanager:ap-south-1:123456789012:secret:prod-db-password"}
    ],
    "healthCheck": {
      "command": ["CMD-SHELL","curl -fsS http://localhost:8080/health || exit 1"],
      "interval": 15, "timeout": 5, "retries": 3, "startPeriod": 20
    },
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/api",
        "awslogs-region": "ap-south-1",
        "awslogs-stream-prefix": "web"
      }
    }
  }]
}

17) Troubleshooting: A Practical, Script-Backed Playbook

When production pages go red, follow this escalation path. All scripts use AWS Tools for PowerShell (Install-Module AWSPowerShell.NetCore).

Step 1 — Confirm Service & Task Health

# 1) Identify ECS services failing health
$Region = "ap-south-1"
$Cluster = "prod-cluster"
Initialize-AWSDefaultConfiguration -Region $Region

$services = Get-ECSService -Cluster $Cluster -MaxResult 100
$services | ForEach-Object {
  $name = $_.ServiceName
  $events = (Get-ECSService -Cluster $Cluster -Service $name).Events | Select-Object -First 10
  [PSCustomObject]@{
    Service = $name
    RunningCount = $_.RunningCount
    DesiredCount = $_.DesiredCount
    PendingCount = $_.PendingCount
    LastEvent = $events[0].Message
  }
} | Sort-Object RunningCount | Format-Table -Auto

Step 2 — Inspect Failing Tasks & Last Status

# 2) List STOPPED tasks with reasons to find crash loops
$svc = "api-service"
$stopped = Get-ECSTask -Cluster $Cluster -DesiredStatus STOPPED -MaxResult 100 | 
  Where-Object { $_.TaskArn -match $svc }
$stopped | ForEach-Object {
  $desc = (Get-ECSTaskDetail -Cluster $Cluster -Task $_.TaskArn).Tasks[0]
  [PSCustomObject]@{
    Task = $_.TaskArn.Split('/')[-1]
    StopCode = $desc.StopCode
    StoppedReason = $desc.StoppedReason
    ExitCode = $desc.Containers[0].ExitCode
    LastStatus = $desc.LastStatus
  }
} | Format-Table -Auto

Step 3 — Pull Recent Logs from CloudWatch

# 3) Tail last 200 lines from container log streams (awslogs)
$group = "/ecs/api"
$streams = Get-CWLogStream -LogGroupName $group | Sort-Object -Property LastEventTimestamp -Descending | Select-Object -First 3
foreach($s in $streams){
  Write-Host "=== Stream:" $s.LogStreamName "===" -ForegroundColor Cyan
  Get-CWLogEvent -LogGroupName $group -LogStreamName $s.LogStreamName -Limit 200 |
    Select-Object -ExpandProperty Message
}

Step 4 — Verify Target Group Health

# 4) Check ALB target health for the service's target group
$lbArns = (Get-ELB2LoadBalancer).LoadBalancers | Where-Object { $_.Type -eq "application" } | Select-Object -ExpandProperty LoadBalancerArn
$tgList = Get-ELB2TargetGroup
foreach($tg in $tgList){
  $health = Get-ELB2TargetHealth -TargetGroupArn $tg.TargetGroupArn
  if($health.TargetHealthDescriptions.TargetHealth.State -contains "unhealthy"){
    [PSCustomObject]@{ TargetGroup = $tg.TargetGroupName; Unhealthy = ($health.TargetHealthDescriptions | Where-Object {$_.TargetHealth.State -ne "healthy"}).Count } |
      Format-Table -Auto
  }
}

Step 5 — DNS & Service Discovery Sanity

# 5) Resolve Cloud Map (if used) and ALB DNS
Resolve-DnsName "api.internal.local" -Type A -Server 8.8.8.8
Resolve-DnsName "my-alb-123.ap-south-1.elb.amazonaws.com" -Type A

Step 6 — Rapid Redeploy / Rollback

# 6) Force new deployment or rollback to previous task def
$service = "api-service"
Update-ECSService -Cluster $Cluster -Service $service -ForceNewDeployment

# rollback (register prior task def revision)
$tdFamily = "api-blue"
$defs = (Get-ECSTaskDefinitionList -FamilyPrefix $tdFamily).TaskDefinitionArns | Sort-Object
$prev = $defs[-2]
Update-ECSService -Cluster $Cluster -Service $service -TaskDefinition $prev

Step 7 — Find Configuration Drift

# 7) Compare running task vs template (ports, health, env)
$taskArn = (Get-ECSTask -Cluster $Cluster -ServiceName $service -DesiredStatus RUNNING -MaxResult 1).TaskArns[0]
$running = (Get-ECSTaskDetail -Cluster $Cluster -Task $taskArn).Tasks[0].Containers[0]
$template = (Get-ECSTaskDefinition -TaskDefinition "$tdFamily:latest").TaskDefinition.ContainerDefinitions[0]

Compare-Object ($running.Environment | % { $_.Name + '=' + $_.Value }) `
               ($template.Environment | % { $_.Name + '=' + $_.Value })

CloudWatch Logs Insights Queries

-- Slow requests > 1s by URL (ALB)
fields @timestamp, @message
| filter elb_status_code >= 500 or target_status_code >= 500
| stats count() as errors by elb, elb_status_code, target_status_code

-- Application JSON logs: top errors
fields @timestamp, level, msg, path
| filter level in ['ERROR','WARN']
| stats count() by msg, path
| sort by count() desc

If tasks flap between PENDING and RUNNING, check: CPU/memory limits, image pull auth (ECR policy), and security group routes to dependencies (DB/Redis).

18) Networking & Security Troubleshooting Matrix

Symptom	Probable Cause	How to Fix
ALB 502/504	Wrong health check path/timeout or container not listening	Align container containerPort and app port; increase health timeout for cold starts.
Task stuck in PENDING	Subnets/SGs misconfigured; no free IPs (awsvpc); insufficient CPU/mem	Use more private IPs (bigger CIDR) or reduce task ENIs; right-size task.
Image pull fails	ECR access denied; no VPC endpoint; rate limited via NAT	Add ECR VPC endpoints; grant execution role permissions; pre-warm cache.
Secrets not found	Task role lacks secretsmanager:GetSecretValue	Attach least-privilege policy to the task role, not just execution role.
Intermittent timeouts	NLB idle timeout, SG egress blocked, DB connections exhausted	Increase timeouts; expand DB pool; confirm SG egress to DB port.

19) Blue/Green Deployments with ALB

Use dual target groups (Blue & Green) and let CodeDeploy shift traffic after health checks and optional automated tests.

# appspec.json (snippet for ECS blue/green)
{
  "version": 1,
  "Resources": [
    {
      "TargetService": {
        "Type": "AWS::ECS::Service",
        "Properties": {
          "TaskDefinition": "api-blue:123",
          "LoadBalancerInfo": {
            "ContainerName": "web",
            "ContainerPort": 8080
          }
        }
      }
    }
  ],
  "Hooks": [
    { "BeforeInstall": "validate_migrations.sh" },
    { "AfterAllowTraffic": "smoke_tests.sh" }
  ]
}

20) Monitoring SLOs & Alarms (Production Checklist)

API SLO: 99.9% success, p95 latency < 300ms.
Alarms on: target 5XX, HTTPCode_Target_5XX_Count, TargetResponseTime, CPU > 80%, Mem > 85%.
Log retention policies (14–90 days) and export to S3 for long-term audit.

21) Migration Patterns to ECS

L2: Lift-and-shift VMs → containers (define Dockerfile, externalize config).
L3: Introduce ALB + service discovery; split monolith by bounded contexts.
L4: Adopt eventing (SQS/EventBridge), autoscaling, blue/green.

Keep backward compatibility during split; version APIs explicitly.

22) Security Hardening (Actionable)

Block :latest tags in production deploys.
Mandatory TLS 1.2+ on ALB listeners; HSTS headers at app layer.
Limit egress with SGs and NACLs; allowlist dependencies by port and CIDR.
Enable secret rotation for database/API keys.

23) Performance Tuning

Right-size cpu and memory to minimize throttling and OOM kills.
Use HEALTHCHECK in Dockerfile plus ECS health checks.
Batch expensive I/O; cache with ElastiCache; reuse connections (HTTP keep-alive).

24) Frequently Asked Questions

Q: When should I prefer ECS over EKS?
A: If you want minimal control-plane operations and a native AWS experience with simpler constructs, ECS is faster to value. Choose EKS for Kubernetes-specific ecosystem or portability needs.

Q: Can I mix Fargate and EC2?
A: Yes, via separate services or capacity providers; this is common when some tasks need host features and others want serverless simplicity.

Q: Do I need a service mesh?
A: Often not. Start with ALB and X-Ray; add App Mesh only if you need retries, timeouts, traffic splitting at L7 among many services.

Blue/Green deployment flow over ALB — Blue/Green: CodeDeploy switches the ALB target group after health validation.

25) Operational Runbooks (Copy-Paste)

Scale a Service for an Event Spike

$Cluster = "prod-cluster"; $Service = "checkout-service"
# Scale from 4 to 12 tasks temporarily
Update-ECSService -Cluster $Cluster -Service $Service -DesiredCount 12
# Later, revert
Update-ECSService -Cluster $Cluster -Service $Service -DesiredCount 4

Drain Tasks Safely for Maintenance (EC2 launch type)

# Put container instance into DRAINING state so tasks migrate off
$instance = (Get-ECSContainerInstance -Cluster $Cluster)[0].ContainerInstanceArn
Update-ECSContainerInstanceState -Cluster $Cluster -ContainerInstance $instance -Status DRAINING

Find Largest Images & Reclaim Space (EC2 hosts)

# Via SSM: list image sizes to find bloat; requires SSM Agent on hosts
Send-SSMCommand -InstanceId "i-0123456789abcdef0" -DocumentName "AWS-RunShellScript" -Parameter @{ commands = @(
  "docker images --format '{{.Repository}}:{{.Tag}} {{.Size}}' | sort -hr -k2 | head -n 20"
)}

Force Roll to New Revision After Config Change

Register-ECSTaskDefinition -Family "api-blue" -ContainerDefinition $templateUpdated
Update-ECSService -Cluster $Cluster -Service "api-service" -TaskDefinition "api-blue:$(Get-Date -Format HHmmss)" -ForceNewDeployment

26) Health Check Tuning Cheatsheet

Workload	Path	Interval	Timeout	Healthy/Unhealthy
HTTP API (fast)	/health	10s	5s	2/2
Cold-starting app	/ready	15s	10s	3/2
Background worker	TCP:8125	20s	5s	2/2

27) Common Pitfalls (and How to Avoid Them)

Skipping task role permissions and overloading execution role → breaks principle of least privilege.
No backoff/retry policy in client → ALB spikes cause cascading failures.
Forgetting connection draining → terminate tasks while requests are in flight.
Hardcoding secrets → leaks via logs or images; always use secrets.
Under-provisioned awsvpc IPs → tasks stuck in PENDING.

28) Governance & Compliance

Tag everything: Service, Env, Owner, CostCenter.
Config conformance: build Config rules to detect public SGs or missing logs.
Backup logs and artifacts to S3 with lifecycle and Object Lock where needed.

29) Reference Architecture (At a Glance)

Public ALB (443) → Private subnets with ECS tasks (awsvpc) → RDS/ElastiCache (no public IP).
VPC Endpoints for ECR (api + dkr), CloudWatch, SSM, Secrets Manager.
Multi-AZ placement, one task per AZ minimum.

30) Copy-Ready IAM Snippets

# Execution role policy (minimal)
{
  "Version": "2012-10-17",
  "Statement": [
    {"Effect":"Allow","Action":[
      "ecr:GetAuthorizationToken","ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer","ecr:BatchGetImage",
      "logs:CreateLogStream","logs:PutLogEvents"
    ],"Resource":"*"}
  ]
}

# Task role policy example: SSM + Secrets Manager (scope to specific ARNs)
{
  "Version": "2012-10-17",
  "Statement": [
    {"Effect":"Allow","Action":["ssm:GetParameter","ssm:GetParameters"],"Resource":"arn:aws:ssm:ap-south-1:123456789012:parameter/prod/*"},
    {"Effect":"Allow","Action":["secretsmanager:GetSecretValue"],"Resource":"arn:aws:secretsmanager:ap-south-1:123456789012:secret:prod/*"}
  ]
}

31) Blueprints for Different Workloads

Public API: Fargate, ALB, WAF, auto scaling on request rate.
Async worker: SQS trigger via EventBridge schedule; no ALB; DLQ alarms.
Batch ETL: Fargate Spot, EventBridge cron, higher CPU for short runtime.

ECS VPC layout with public and private subnets — Place ALB in public subnets; ECS tasks and databases in private subnets with VPC endpoints for control-plane traffic.

32) Key Terms (Quick Glossary)

Task Definition Service Cluster Capacity Provider Target Group awsvpc Execution Role Task Role

33) Conclusion

Amazon ECS unblocks teams to ship reliable, scalable services without the overhead of running a control plane. Start simple with Fargate, enforce least privilege through IAM roles, observe with CloudWatch/X-Ray, then iterate with blue/green deploys and cost tuning. With the playbooks and scripts above, you can diagnose 80–90% of incidents within minutes and restore service confidently.

Keep exploring on CloudKnowledge: Amazon ECS, AWS Fargate, Elastic Load Balancing, Amazon VPC, CloudWatch.

Amazon ECS (Elastic Container Service) — Run Docker Containers Easily

Amazon ECS (Elastic Container Service) — Run Docker Containers Easily

1) Introduction to Amazon ECS

2) Container Orchestration Made Simple

3) Native Docker Integration

4) Launch Types

5) Fargate Integration

6) Task Definitions

7) ECS Clusters & Services

8) Scaling & Load Balancing

9) Networking with Amazon VPC

10) Security & IAM Integration

11) Observability: Logs, Metrics, Traces

12) CI/CD & DevOps

13) Cost Optimization

14) ECS Anywhere & Hybrid

15) Real-World Use Cases

16) Sample Task Definition (Fargate)

17) Troubleshooting: A Practical, Script-Backed Playbook

Step 1 — Confirm Service & Task Health

Step 2 — Inspect Failing Tasks & Last Status

Step 3 — Pull Recent Logs from CloudWatch

Step 4 — Verify Target Group Health

Step 5 — DNS & Service Discovery Sanity

Step 6 — Rapid Redeploy / Rollback

Step 7 — Find Configuration Drift

CloudWatch Logs Insights Queries

18) Networking & Security Troubleshooting Matrix

19) Blue/Green Deployments with ALB

20) Monitoring SLOs & Alarms (Production Checklist)

21) Migration Patterns to ECS

22) Security Hardening (Actionable)

23) Performance Tuning

24) Frequently Asked Questions

25) Operational Runbooks (Copy-Paste)

Scale a Service for an Event Spike

Drain Tasks Safely for Maintenance (EC2 launch type)

Find Largest Images & Reclaim Space (EC2 hosts)

Force Roll to New Revision After Config Change

26) Health Check Tuning Cheatsheet

27) Common Pitfalls (and How to Avoid Them)

28) Governance & Compliance

29) Reference Architecture (At a Glance)

30) Copy-Ready IAM Snippets

31) Blueprints for Different Workloads

32) Key Terms (Quick Glossary)

33) Conclusion

Comments

Leave a Reply Cancel reply

Welcome to CloudKnowledge