Amazon ECS (Elastic Container Service) — Run Docker Containers Easily
A complete, field-tested guide to planning, deploying, operating, and troubleshooting production containers with Amazon ECS, AWS Fargate, Elastic Load Balancing, and Amazon VPC.
1) Introduction to Amazon ECS
Amazon Elastic Container Service (ECS) is AWS’s fully managed container orchestration platform that lets you run, stop, and manage Docker containers at any scale without building a control plane. With ECS you describe the application once—via a Task Definition—and the service takes care of placement, scaling, rolling updates, and lifecycle. You can choose between two launch types: EC2 (you manage instances) and Fargate (serverless). For most new workloads, Fargate minimizes operational overhead while maintaining performance and security isolation.
2) Container Orchestration Made Simple
ECS abstracts cluster complexity: you define tasks, group them into services, and optionally put them behind an Application Load Balancer (ALB). The scheduler handles placement across Availability Zones, health checks, and failure remediation.
- Rolling and blue/green deployments with CodeDeploy.
- Target tracking and step scaling using CloudWatch metrics.
- Integration with AWS Systems Manager for ops automation.
3) Native Docker Integration
ECS runs Docker-compatible images from ECR or any registry. Keep images lean using multi-stage builds and scan them regularly. Use immutable tags (e.g., Git SHA) to prevent accidental rollbacks when pushing latest.
4) Launch Types
- EC2: Full control of AMIs, agents, daemonsets, and sidecars. Ideal for specialized hardware.
- Fargate: Serverless isolation, per-task billing, simplest ops surface for most apps.
5) Fargate Integration
Fargate isolates tasks at the runtime boundary, auto patches the infra, and scales on demand. No capacity management.
6) Task Definitions
Blueprint for your containers: images, CPU/memory, env vars, secrets, awslogs, health checks, and volumes.
7) ECS Clusters & Services
A cluster groups compute capacity; a service maintains the desired count of tasks and handles rolling updates. With ALB target groups, health checks dictate replacement decisions—keep them realistic (timeout, interval, thresholds) to avoid flapping during cold starts.
Zonal resilience with multi-AZ tasks
Typical failover on task health failure
Rolling/blue-green deploys
8) Scaling & Load Balancing
Attach an Application Load Balancer for HTTP/HTTPS and Network Load Balancer for TCP/UDP. Scale via CloudWatch target tracking (e.g., average CPU 60%) or request rate (ALB RequestCountPerTarget).
| Scenario | ALB | NLB | Notes |
|---|---|---|---|
| HTTP APIs | ✅ | — | Use path/host-based routing, stickiness when stateful. |
| gRPC / HTTP2 | ✅ | — | Enable HTTP/2, adjust idle timeout. |
| TCP microservices | — | ✅ | Ultra-low latency; fewer L7 features. |
| WebSockets | ✅ | — | Consider higher idle timeout and connection draining. |
9) Networking with Amazon VPC
- awsvpc networking mode gives each task an ENI (own security group, IP). Required for Fargate.
- Place tasks in private subnets; route egress via NAT or VPC endpoints for ECR/CloudWatch to reduce cost/egress risk.
- Restrict ALB/NLB inbound security groups by CIDR or WAF conditions.
10) Security & IAM Integration
ECS embraces least privilege with two roles: the task execution role (pull images, write logs, fetch secrets) and the task role (your app’s AWS API access). Rotate credentials, use short sessions, and prefer VPC Endpoints for Secrets Manager and SSM.
- Encrypt env variables via secrets (SSM/Secrets Manager) instead of plain text.
- Enable ECR image scanning and block vulnerable images on deploy with CodePipeline/CodeBuild quality gates.
- Use AWS WAF on ALB for Layer-7 protection and bot control.
11) Observability: Logs, Metrics, Traces
Ship STDOUT/STDERR via the awslogs driver to CloudWatch Logs. Emit application metrics to CloudWatch (or Prometheus w/Sidecar) and use AWS X-Ray for distributed tracing between microservices.
- Set structured JSON logging to simplify log insights.
- Emit RED/USE metrics (Rate, Errors, Duration / Utilization, Saturation, Errors).
- Create AnomalyDetection alarms for sudden spikes in 5XX or latency.
12) CI/CD & DevOps
Use CodePipeline → CodeBuild → ECS blue/green (via CodeDeploy) for zero-downtime updates. For GitHub Actions or GitLab CI, use the aws-actions/amazon-ecr-login and ECS deploy actions to push images and update services automatically.
- Build, test, scan image → push to ECR (immutable tag).
- Register new task definition revision.
- CodeDeploy switches target group from Green to Blue post health validation.
13) Cost Optimization
- Prefer Fargate for bursty loads; tune CPU/Memory to right-size tasks.
- For steady 24×7, consider EC2 with Savings Plans or Spot for worker nodes.
- Leverage Graviton (ARM64) images; many see 20–30% savings.
- Use VPC endpoints for ECR/CloudWatch to reduce NAT egress charges.
14) ECS Anywhere & Hybrid
With ECS Anywhere, register on-prem or edge instances to the ECS control plane. Keep a unified deployment and operations model while complying with data residency or latency constraints.
15) Real-World Use Cases
- Public APIs and web frontends behind ALB with path-based routing.
- Event-driven workers consuming SQS/Kinesis streams.
- ETL/batch jobs scheduled via EventBridge.
- Internal microservices with NLB + service discovery.
16) Sample Task Definition (Fargate)
{
"family": "api-blue",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/appTaskRole",
"containerDefinitions": [{
"name": "web",
"image": "123456789012.dkr.ecr.ap-south-1.amazonaws.com/api:%24%7BGIT_SHA%7D",
"portMappings": [{ "containerPort": 8080, "protocol": "tcp" }],
"essential": true,
"linuxParameters": { "initProcessEnabled": true },
"environment": [
{"name":"ENV","value":"prod"},
{"name":"LOG_LEVEL","value":"info"}
],
"secrets": [
{"name":"DB_PASSWORD", "valueFrom":"arn:aws:secretsmanager:ap-south-1:123456789012:secret:prod-db-password"}
],
"healthCheck": {
"command": ["CMD-SHELL","curl -fsS http://localhost:8080/health || exit 1"],
"interval": 15, "timeout": 5, "retries": 3, "startPeriod": 20
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api",
"awslogs-region": "ap-south-1",
"awslogs-stream-prefix": "web"
}
}
}]
}
17) Troubleshooting: A Practical, Script-Backed Playbook
When production pages go red, follow this escalation path. All scripts use AWS Tools for PowerShell (Install-Module AWSPowerShell.NetCore).
Step 1 — Confirm Service & Task Health
# 1) Identify ECS services failing health
$Region = "ap-south-1"
$Cluster = "prod-cluster"
Initialize-AWSDefaultConfiguration -Region $Region
$services = Get-ECSService -Cluster $Cluster -MaxResult 100
$services | ForEach-Object {
$name = $_.ServiceName
$events = (Get-ECSService -Cluster $Cluster -Service $name).Events | Select-Object -First 10
[PSCustomObject]@{
Service = $name
RunningCount = $_.RunningCount
DesiredCount = $_.DesiredCount
PendingCount = $_.PendingCount
LastEvent = $events[0].Message
}
} | Sort-Object RunningCount | Format-Table -Auto
Step 2 — Inspect Failing Tasks & Last Status
# 2) List STOPPED tasks with reasons to find crash loops
$svc = "api-service"
$stopped = Get-ECSTask -Cluster $Cluster -DesiredStatus STOPPED -MaxResult 100 |
Where-Object { $_.TaskArn -match $svc }
$stopped | ForEach-Object {
$desc = (Get-ECSTaskDetail -Cluster $Cluster -Task $_.TaskArn).Tasks[0]
[PSCustomObject]@{
Task = $_.TaskArn.Split('/')[-1]
StopCode = $desc.StopCode
StoppedReason = $desc.StoppedReason
ExitCode = $desc.Containers[0].ExitCode
LastStatus = $desc.LastStatus
}
} | Format-Table -Auto
Step 3 — Pull Recent Logs from CloudWatch
# 3) Tail last 200 lines from container log streams (awslogs)
$group = "/ecs/api"
$streams = Get-CWLogStream -LogGroupName $group | Sort-Object -Property LastEventTimestamp -Descending | Select-Object -First 3
foreach($s in $streams){
Write-Host "=== Stream:" $s.LogStreamName "===" -ForegroundColor Cyan
Get-CWLogEvent -LogGroupName $group -LogStreamName $s.LogStreamName -Limit 200 |
Select-Object -ExpandProperty Message
}
Step 4 — Verify Target Group Health
# 4) Check ALB target health for the service's target group
$lbArns = (Get-ELB2LoadBalancer).LoadBalancers | Where-Object { $_.Type -eq "application" } | Select-Object -ExpandProperty LoadBalancerArn
$tgList = Get-ELB2TargetGroup
foreach($tg in $tgList){
$health = Get-ELB2TargetHealth -TargetGroupArn $tg.TargetGroupArn
if($health.TargetHealthDescriptions.TargetHealth.State -contains "unhealthy"){
[PSCustomObject]@{ TargetGroup = $tg.TargetGroupName; Unhealthy = ($health.TargetHealthDescriptions | Where-Object {$_.TargetHealth.State -ne "healthy"}).Count } |
Format-Table -Auto
}
}
Step 5 — DNS & Service Discovery Sanity
# 5) Resolve Cloud Map (if used) and ALB DNS Resolve-DnsName "api.internal.local" -Type A -Server 8.8.8.8 Resolve-DnsName "my-alb-123.ap-south-1.elb.amazonaws.com" -Type A
Step 6 — Rapid Redeploy / Rollback
# 6) Force new deployment or rollback to previous task def $service = "api-service" Update-ECSService -Cluster $Cluster -Service $service -ForceNewDeployment # rollback (register prior task def revision) $tdFamily = "api-blue" $defs = (Get-ECSTaskDefinitionList -FamilyPrefix $tdFamily).TaskDefinitionArns | Sort-Object $prev = $defs[-2] Update-ECSService -Cluster $Cluster -Service $service -TaskDefinition $prev
Step 7 — Find Configuration Drift
# 7) Compare running task vs template (ports, health, env)
$taskArn = (Get-ECSTask -Cluster $Cluster -ServiceName $service -DesiredStatus RUNNING -MaxResult 1).TaskArns[0]
$running = (Get-ECSTaskDetail -Cluster $Cluster -Task $taskArn).Tasks[0].Containers[0]
$template = (Get-ECSTaskDefinition -TaskDefinition "$tdFamily:latest").TaskDefinition.ContainerDefinitions[0]
Compare-Object ($running.Environment | % { $_.Name + '=' + $_.Value }) `
($template.Environment | % { $_.Name + '=' + $_.Value })
CloudWatch Logs Insights Queries
-- Slow requests > 1s by URL (ALB) fields @timestamp, @message | filter elb_status_code >= 500 or target_status_code >= 500 | stats count() as errors by elb, elb_status_code, target_status_code -- Application JSON logs: top errors fields @timestamp, level, msg, path | filter level in ['ERROR','WARN'] | stats count() by msg, path | sort by count() desc
18) Networking & Security Troubleshooting Matrix
| Symptom | Probable Cause | How to Fix |
|---|---|---|
| ALB 502/504 | Wrong health check path/timeout or container not listening | Align container containerPort and app port; increase health timeout for cold starts. |
| Task stuck in PENDING | Subnets/SGs misconfigured; no free IPs (awsvpc); insufficient CPU/mem | Use more private IPs (bigger CIDR) or reduce task ENIs; right-size task. |
| Image pull fails | ECR access denied; no VPC endpoint; rate limited via NAT | Add ECR VPC endpoints; grant execution role permissions; pre-warm cache. |
| Secrets not found | Task role lacks secretsmanager:GetSecretValue | Attach least-privilege policy to the task role, not just execution role. |
| Intermittent timeouts | NLB idle timeout, SG egress blocked, DB connections exhausted | Increase timeouts; expand DB pool; confirm SG egress to DB port. |
19) Blue/Green Deployments with ALB
Use dual target groups (Blue & Green) and let CodeDeploy shift traffic after health checks and optional automated tests.
# appspec.json (snippet for ECS blue/green)
{
"version": 1,
"Resources": [
{
"TargetService": {
"Type": "AWS::ECS::Service",
"Properties": {
"TaskDefinition": "api-blue:123",
"LoadBalancerInfo": {
"ContainerName": "web",
"ContainerPort": 8080
}
}
}
}
],
"Hooks": [
{ "BeforeInstall": "validate_migrations.sh" },
{ "AfterAllowTraffic": "smoke_tests.sh" }
]
}
20) Monitoring SLOs & Alarms (Production Checklist)
- API SLO: 99.9% success, p95 latency < 300ms.
- Alarms on: target 5XX, HTTPCode_Target_5XX_Count, TargetResponseTime, CPU > 80%, Mem > 85%.
- Log retention policies (14–90 days) and export to S3 for long-term audit.
21) Migration Patterns to ECS
- L2: Lift-and-shift VMs → containers (define Dockerfile, externalize config).
- L3: Introduce ALB + service discovery; split monolith by bounded contexts.
- L4: Adopt eventing (SQS/EventBridge), autoscaling, blue/green.
22) Security Hardening (Actionable)
- Block :latest tags in production deploys.
- Mandatory TLS 1.2+ on ALB listeners; HSTS headers at app layer.
- Limit egress with SGs and NACLs; allowlist dependencies by port and CIDR.
- Enable secret rotation for database/API keys.
23) Performance Tuning
- Right-size cpu and memory to minimize throttling and OOM kills.
- Use HEALTHCHECK in Dockerfile plus ECS health checks.
- Batch expensive I/O; cache with ElastiCache; reuse connections (HTTP keep-alive).
24) Frequently Asked Questions
Q: When should I prefer ECS over EKS?
A: If you want minimal control-plane operations and a native AWS experience with simpler constructs, ECS is faster to value. Choose EKS for Kubernetes-specific ecosystem or portability needs.
Q: Can I mix Fargate and EC2?
A: Yes, via separate services or capacity providers; this is common when some tasks need host features and others want serverless simplicity.
Q: Do I need a service mesh?
A: Often not. Start with ALB and X-Ray; add App Mesh only if you need retries, timeouts, traffic splitting at L7 among many services.
25) Operational Runbooks (Copy-Paste)
Scale a Service for an Event Spike
$Cluster = "prod-cluster"; $Service = "checkout-service" # Scale from 4 to 12 tasks temporarily Update-ECSService -Cluster $Cluster -Service $Service -DesiredCount 12 # Later, revert Update-ECSService -Cluster $Cluster -Service $Service -DesiredCount 4
Drain Tasks Safely for Maintenance (EC2 launch type)
# Put container instance into DRAINING state so tasks migrate off $instance = (Get-ECSContainerInstance -Cluster $Cluster)[0].ContainerInstanceArn Update-ECSContainerInstanceState -Cluster $Cluster -ContainerInstance $instance -Status DRAINING
Find Largest Images & Reclaim Space (EC2 hosts)
# Via SSM: list image sizes to find bloat; requires SSM Agent on hosts
Send-SSMCommand -InstanceId "i-0123456789abcdef0" -DocumentName "AWS-RunShellScript" -Parameter @{ commands = @(
"docker images --format '{{.Repository}}:{{.Tag}} {{.Size}}' | sort -hr -k2 | head -n 20"
)}
Force Roll to New Revision After Config Change
Register-ECSTaskDefinition -Family "api-blue" -ContainerDefinition $templateUpdated Update-ECSService -Cluster $Cluster -Service "api-service" -TaskDefinition "api-blue:$(Get-Date -Format HHmmss)" -ForceNewDeployment
26) Health Check Tuning Cheatsheet
| Workload | Path | Interval | Timeout | Healthy/Unhealthy |
|---|---|---|---|---|
| HTTP API (fast) | /health | 10s | 5s | 2/2 |
| Cold-starting app | /ready | 15s | 10s | 3/2 |
| Background worker | TCP:8125 | 20s | 5s | 2/2 |
27) Common Pitfalls (and How to Avoid Them)
- Skipping task role permissions and overloading execution role → breaks principle of least privilege.
- No backoff/retry policy in client → ALB spikes cause cascading failures.
- Forgetting connection draining → terminate tasks while requests are in flight.
- Hardcoding secrets → leaks via logs or images; always use secrets.
- Under-provisioned awsvpc IPs → tasks stuck in PENDING.
28) Governance & Compliance
- Tag everything: Service, Env, Owner, CostCenter.
- Config conformance: build Config rules to detect public SGs or missing logs.
- Backup logs and artifacts to S3 with lifecycle and Object Lock where needed.
29) Reference Architecture (At a Glance)
- Public ALB (443) → Private subnets with ECS tasks (awsvpc) → RDS/ElastiCache (no public IP).
- VPC Endpoints for ECR (api + dkr), CloudWatch, SSM, Secrets Manager.
- Multi-AZ placement, one task per AZ minimum.
30) Copy-Ready IAM Snippets
# Execution role policy (minimal)
{
"Version": "2012-10-17",
"Statement": [
{"Effect":"Allow","Action":[
"ecr:GetAuthorizationToken","ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer","ecr:BatchGetImage",
"logs:CreateLogStream","logs:PutLogEvents"
],"Resource":"*"}
]
}
# Task role policy example: SSM + Secrets Manager (scope to specific ARNs)
{
"Version": "2012-10-17",
"Statement": [
{"Effect":"Allow","Action":["ssm:GetParameter","ssm:GetParameters"],"Resource":"arn:aws:ssm:ap-south-1:123456789012:parameter/prod/*"},
{"Effect":"Allow","Action":["secretsmanager:GetSecretValue"],"Resource":"arn:aws:secretsmanager:ap-south-1:123456789012:secret:prod/*"}
]
}
31) Blueprints for Different Workloads
- Public API: Fargate, ALB, WAF, auto scaling on request rate.
- Async worker: SQS trigger via EventBridge schedule; no ALB; DLQ alarms.
- Batch ETL: Fargate Spot, EventBridge cron, higher CPU for short runtime.
32) Key Terms (Quick Glossary)
Task Definition Service Cluster Capacity Provider Target Group awsvpc Execution Role Task Role
33) Conclusion
Amazon ECS unblocks teams to ship reliable, scalable services without the overhead of running a control plane. Start simple with Fargate, enforce least privilege through IAM roles, observe with CloudWatch/X-Ray, then iterate with blue/green deploys and cost tuning. With the playbooks and scripts above, you can diagnose 80–90% of incidents within minutes and restore service confidently.
Keep exploring on CloudKnowledge: Amazon ECS, AWS Fargate, Elastic Load Balancing, Amazon VPC, CloudWatch.








注册
Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?