Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes with a Fully Managed Control Plane
Amazon EKS is a fully managed Kubernetes service that simplifies deploying, operating, and scaling containerized applications on AWS — and even on-premises with EKS Anywhere. This guide explains the managed control plane, security, networking, storage, observability, cost optimization, and hands-on troubleshooting playbooks using PowerShell and kubectl.
1) What Is Amazon EKS?
Amazon Elastic Kubernetes Service (EKS) is a managed platform for running upstream Kubernetes clusters. AWS operates the control plane — the kube-apiserver, etcd, the scheduler, and controller managers — across multiple Availability Zones, providing fault tolerance, patching, and version upgrades so your team can focus on deploying applications instead of racking masters or babysitting etcd.
2) Fully Managed Control Plane
The EKS control plane is multi-AZ by default. AWS handles high availability, patching, and scaling of the API server and etcd. This removes operational toil and reduces the blast-radius of failures.
- Automatic health checks and failover across AZs.
- Automated minor version patching with predictable upgrade paths.
- Seamless Kubernetes upgrades for the control plane; you schedule node upgrades.
3) Deep AWS Integrations
EKS integrates with IAM (including IRSA), VPC CNI, CloudWatch, ECR, Elastic Load Balancing (ALB/NLB), Route 53, and KMS. These integrations enable secure, observable, and production-ready clusters out of the box.
4) Standard, Portable Kubernetes
EKS runs upstream Kubernetes, so your manifests, controllers, and tools are portable. Moving workloads from another CNCF-compliant cluster usually needs no code changes.
5) Node Management Options
- EKS Managed Node Groups — AWS manages lifecycle: provisioning, health, rolling updates.
- Self-Managed Nodes (EC2) — maximum flexibility for AMIs, daemons, and custom bootstrap.
- AWS Fargate — serverless data plane: run pods without managing EC2 capacity.
Managed Node Groups
Great default for production clusters that need predictable updates and autoscaling.
Self-Managed EC2
Choose this when you need unusual kernels, GPUs, or tight control over systemd units and drivers.
Fargate
Ideal for spiky, small services where ops overhead must be minimal and pod-level isolation is desired.
6) High Availability & Reliability
Control plane spans multiple AZs. For nodes, create multiple node groups per AZ and use PodDisruptionBudgets and TopologySpreadConstraints to avoid single-AZ concentration.
7) Security & Compliance
- IAM for RBAC, plus IRSA for pod credentials.
- KMS encryption at rest; TLS in transit.
- Private clusters via AWS PrivateLink.
- Meets common frameworks: HIPAA, SOC 2, ISO, PCI DSS.
8) EKS Anywhere (Hybrid & Edge)
EKS Anywhere extends the EKS operational model to on-premises or edge. Get consistent cluster lifecycle, tooling, and support for hybrid strategies.
9) Scaling: CA, HPA, & Karpenter
- Cluster Autoscaler — scales nodes based on unschedulable pods.
- Horizontal Pod Autoscaler — scales replicas by CPU/memory/custom metrics.
- Karpenter — open-source provisioner that launches just-in-time capacity across flexible instance types.
10) Versioning & Upgrades
EKS provides managed control-plane upgrades. Plan quarterly cadences, validate apiVersion changes, and roll nodes using surge parameters on node groups. Canary business workloads first.
11) Networking & Load Balancing
- Amazon VPC CNI provides native VPC IPs for pods (ENIs per node).
- Ingress with ALB Ingress Controller; NLB for TCP/UDP or high throughput.
- Service discovery with CoreDNS and AWS Cloud Map.
12) Cost Optimization
- EKS control plane fee is predictable; data-plane spend dominates. Right-size nodes and requests/limits.
- Use Spot Instances with interruption-tolerant workloads.
- Adopt Savings Plans or RIs for steady-state capacity.
- Bin-pack with Karpenter and consolidate small deployments.
| Area | Levers | Why it helps |
|---|---|---|
| Compute | Spot, Graviton, autoscaling | Lower $/vCPU and scale to demand |
| Storage | EBS GP3, lifecycle policies | Right tier and shrink idle volumes |
| Network | NLB for L4, ALB reuse | Reduce LB over-provisioning |
| Observability | Log sampling, retention | Cut CloudWatch ingest & retention cost |
13) Observability & Monitoring
Integrate with Amazon CloudWatch, Prometheus, and Grafana. Emit application metrics, collect logs via Fluent Bit, and trace critical flows with AWS X-Ray.
14) Logging & Auditing
Enable Kubernetes audit logs to CloudWatch Logs. Ship application and systemd logs from nodes. Use retention and subscription filters to route only what you need to analytics or SIEM.
15) Multi-Cluster & Multi-Region
Use EKS Connector for visibility across clusters (including some external). Standardize add-ons and policies with GitOps and templates for repeatability.
16) Service Mesh
EKS supports AWS App Mesh and Istio for traffic shaping, encryption in mesh, and rich telemetry.
17) CI/CD Pipelines
Pair EKS with AWS CodePipeline, GitHub Actions, or Jenkins. Bake immutable images in ECR, scan for CVEs, sign with cosign, and deploy via GitOps.
18) Storage: EBS, EFS, and FSx
19) Security Best Practices
- Adopt Pod Security Standards and network policies (Calico or Cilium).
- Use IRSA; avoid node-wide credentials.
- Enforce signed images and run vulnerability scans.
- Rotate secrets with AWS Secrets Manager or external secret operators.
20) Troubleshooting & Diagnostics: Practical Playbooks
Below are actionable steps and scripts you can run from your workstation or CI runner.
20.1 Cluster Health (kubectl)
# Context & API server reachability
kubectl cluster-info
kubectl get --raw='/readyz?verbose' | head -n 100
# Core components
kubectl -n kube-system get deploy,ds,po
# Node & condition health
kubectl get nodes -o wide
kubectl describe node <node-name>
# Pod scheduling issues
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
kubectl describe pod <pod> -n <ns>
# Logs
kubectl logs -n <ns> <pod> -c <container> --tail=200
20.2 AWS Tools for PowerShell — EKS & CloudWatch
Install once: Install-Module AWS.Tools.EKS -Scope CurrentUser and AWS.Tools.CloudWatchLogs.
# Requires: AWS.Tools.Common, AWS.Tools.EKS, AWS.Tools.EC2, AWS.Tools.CloudWatchLogs
# Set your default region and credentials before running
Set-DefaultAWSRegion -Region ap-south-1
# 1) Inspect EKS clusters
Get-EKSClusterList -MaxResults 50 | Select-Object clusters
# Describe a cluster (status, version, endpoint, OIDC)
$clusterName = "prod-eks"
$cluster = Get-EKSCluster -Name $clusterName
$cluster.Cluster | Select-Object name, status, version, endpoint, @{n="OIDC";e={$_.identity.oidc.issuer}}
# 2) Managed node groups & health
Get-EKSNodegroupList -ClusterName $clusterName | Select-Object nodegroups
Get-EKSNodegroup -ClusterName $clusterName -NodegroupName "prod-ng-1" |
Select-Object nodegroup, status, scalingConfig, updateConfig, capacityType, version
# 3) Fargate profiles
Get-EKSFargateProfileList -ClusterName $clusterName | Select-Object fargateProfileNames
# 4) Cluster and audit logs in CloudWatch
$logGroup = "/aws/eks/$clusterName/cluster"
Get-CWLLogStreams -LogGroupName $logGroup | Sort-Object -Property LastEventTimestamp -Descending | Select-Object -First 5
# Tail last 200 lines from audit log stream that contains 'kube-apiserver-audit'
$stream = (Get-CWLLogStreams -LogGroupName $logGroup | Where-Object {$_.LogStreamName -like "*audit*"} | Sort-Object LastEventTimestamp -Descending | Select-Object -First 1).LogStreamName
Get-CWLLogEvents -LogGroupName $logGroup -LogStreamName $stream -StartFromHead:$false | Select-Object -Last 200
# 5) Search for throttling or RBAC denies in last hour
$pattern = "Too Many Requests|throttl|Forbidden|RBAC|denied"
Start-CWLQuery -LogGroupName $logGroup -QueryString @"
fields @timestamp, @message
| filter @message like /$pattern/
| sort @timestamp desc
| limit 50
"@ -StartTime (Get-Date).AddHours(-1) -EndTime (Get-Date) | Wait-CWLQuery |
Select-Object -ExpandProperty Results | ForEach-Object {
($_ | Where-Object {$_.Field -eq "@message"}).Value
}
20.3 PowerShell: Spot & Capacity Shortages (EC2 + Karpenter/CA)
# View recent EC2 Spot interruptions (if using Spot)
Get-EC2SpotFleetRequestHistory -StartTime (Get-Date).AddHours(-6) -EventType instanceChange -MaxResults 200 |
Select-Object Timestamp, EventType, EventInformation
# Check Auto Scaling Group activity if using Managed Node Groups backed by ASG
(Get-ASAutoScalingGroup -AutoScalingGroupNames "eks-prod-ng-asg").Activities |
Select-Object StartTime, StatusCode, StatusMessage | Sort-Object StartTime -Descending | Select-Object -First 10
20.4 PowerShell: ECR Image Issues
# List latest image tags for a repo
Get-ECRImage -RepositoryName "orders-service" -MaxResults 20 |
Select-Object -ExpandProperty imageIds | Select-Object imageTag,imageDigest
# Check image scan findings if enabled
Get-ECRImageScanFindings -RepositoryName "orders-service" -ImageId @{ imageTag="prod" } |
Select-Object -ExpandProperty imageScanFindings | Select-Object -ExpandProperty findings |
Select-Object name, severity, description | Format-Table -AutoSize
20.5 kubectl: DNS & Networking Quick Tests
# Exec into a debug pod with busybox
kubectl -n default run dnsutils --image=busybox:1.36 --rm -it --command -- sh
nslookup kubernetes.default.svc.cluster.local
wget -S -O - http://<service-name>.<ns>.svc.cluster.local:8080
exit
20.6 PowerShell: Cluster Add-ons (CNI / CoreDNS / kube-proxy) Drift
# Read EKS recommended versions for core add-ons (requires AWS CLI v2 in PATH for 'eks describe-addon-versions')
$addons = @("vpc-cni","coredns","kube-proxy")
$region = "ap-south-1"
foreach($a in $addons){
$json = & eks describe-addon-versions --addon-name $a --region $region | ConvertFrom-Json
$latest = $json.addons[0].addonVersions | Sort-Object -Property addonVersion -Descending | Select-Object -First 1
"{0}: latest {1}" -f $a, $latest.addonVersion
}
Hands-On: Production-Ready Blueprint
- Create private subnets and dedicated security groups per node group; enable VPC flow logs for forensics.
- Provision EKS with the right version baseline; pin add-on versions and record SBOMs for base images.
- Choose capacity strategy: one on-demand MNG for baseline, one Spot MNG for burst, plus Fargate for jobs.
- Install ALB Ingress Controller, external-dns, and metrics stack (Prometheus/Grafana).
- Enforce policies with OPA Gatekeeper or Kyverno; implement Pod Security Standards.
- Adopt GitOps; require signed container images and enforce via admission policy.
- Define SLOs; add dashboards for error budgets, p95 latency, and saturation.
SRE Playbook: Common Failure Modes & Fixes
Unschedulable Pods
Check requests/limits mismatch; verify taints/tolerations and node selectors; CA/Karpenter events for capacity.
CrashLoopBackOff
Inspect container logs, readiness/liveness probes, secrets/configmaps, and image tags; compare with last good deploy.
DNS Problems
CoreDNS deployment health, node CIDR/iptables, VPC DHCP options; try a debug busybox pod.
Networking/Load Balancing
ALB/NLB target health; SGs and NACLs; pod IP exhaustion (ENI limits) — consider prefix delegation or larger nodes.
Auth/RBAC Denies
Use kubectl auth can-i; check AWS IAM mapping/IRSA policies; audit logs for Forbidden events.
Cost Spikes
Look for over-requested CPU/memory, idle DaemonSets, high log ingest, and excessive ALBs; apply rightsizing.
Compliance & Governance
- Centralize logging for audits; define retention per policy.
- Back up etcd (managed by AWS) and persistent volumes (Velero + EBS snapshots or EFS backups).
- Use AWS Organizations and SCPs to restrict regions and services.
Operations Runbook Snippets
Drain & Roll a Node Group (Zero-Downtime)
# Cordon and drain a single node
kubectl cordon ip-10-0-12-34.ec2.internal
kubectl drain ip-10-0-12-34.ec2.internal --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
# Uncordon after health checks
kubectl uncordon ip-10-0-12-34.ec2.internal
Validate IRSA Permissions
# Who am I inside the pod?
TOKEN=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
curl -sS -H "Authorization: Bearer $(cat $TOKEN)" \
$(kubectl get --raw /apis/authentication.k8s.io/v1/tokenreviews) >/dev/null
# Attempt an AWS call with the pod's role (example: S3 list)
aws sts get-caller-identity
aws s3 ls s3://<bucket> --no-sign-request:false
FAQ: Short Answers to Big Questions
Is EKS locked to AWS?
No. EKS runs upstream Kubernetes. Your manifests and controllers remain portable.
Can I run EKS on-prem?
Use EKS Anywhere for consistent tooling and support.
Do I need a service mesh?
Not always. Start with simple ingress and add mesh when you need traffic shaping, mTLS, or detailed telemetry.
How do I cut costs fast?
Right-size requests, enable Spot for burst, consolidate ALBs, limit log ingestion, and adopt Graviton nodes.
Copy-Paste Templates
PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
namespace: web
spec:
minAvailable: 2
selector:
matchLabels:
app: web
Topology Spread
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: web
spec:
replicas: 6
selector:
matchLabels: { app: api }
template:
metadata:
labels: { app: api }
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: api }
IRSA Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-reader
namespace: data
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eks-s3-readonly
Executive Summary: Why Teams Choose EKS
- Managed Control Plane → Zero master ops.
- Scalable & Secure → Deep AWS integrations.
- Portable & Compatible → Upstream Kubernetes.
- Hybrid Flexibility → EKS Anywhere.
- Cost-Efficient → Pay for usage; optimize capacity.













Leave a Reply