Cloud Knowledge

Your Go-To Hub for Cloud Solutions & Insights

Advertisement

Amazon EKS explained: managed control plane, scaling, security, costs, and PowerShell troubleshooting playbooks.

Amazon EKS explained: managed control plane, scaling, security, costs, and PowerShell troubleshooting playbooks.
Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes with a Fully Managed Control Plane

Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes with a Fully Managed Control Plane

Amazon EKS is a fully managed Kubernetes service that simplifies deploying, operating, and scaling containerized applications on AWS — and even on-premises with EKS Anywhere. This guide explains the managed control plane, security, networking, storage, observability, cost optimization, and hands-on troubleshooting playbooks using PowerShell and kubectl.

Amazon EKS High-Level Architecture (monochrome)
Amazon EKS high-level view — AWS runs the control plane; you choose the data-plane model (managed, self-managed, or Fargate).

1) What Is Amazon EKS?

Amazon Elastic Kubernetes Service (EKS) is a managed platform for running upstream Kubernetes clusters. AWS operates the control plane — the kube-apiserver, etcd, the scheduler, and controller managers — across multiple Availability Zones, providing fault tolerance, patching, and version upgrades so your team can focus on deploying applications instead of racking masters or babysitting etcd.

Key idea: You manage applications and (optionally) nodes. AWS manages the Kubernetes brain.

2) Fully Managed Control Plane

The EKS control plane is multi-AZ by default. AWS handles high availability, patching, and scaling of the API server and etcd. This removes operational toil and reduces the blast-radius of failures.

  • Automatic health checks and failover across AZs.
  • Automated minor version patching with predictable upgrade paths.
  • Seamless Kubernetes upgrades for the control plane; you schedule node upgrades.
Why it matters: No more masters to provision, snapshot, encrypt, or recover.

3) Deep AWS Integrations

EKS integrates with IAM (including IRSA), VPC CNI, CloudWatch, ECR, Elastic Load Balancing (ALB/NLB), Route 53, and KMS. These integrations enable secure, observable, and production-ready clusters out of the box.

4) Standard, Portable Kubernetes

EKS runs upstream Kubernetes, so your manifests, controllers, and tools are portable. Moving workloads from another CNCF-compliant cluster usually needs no code changes.

5) Node Management Options

  • EKS Managed Node Groups — AWS manages lifecycle: provisioning, health, rolling updates.
  • Self-Managed Nodes (EC2) — maximum flexibility for AMIs, daemons, and custom bootstrap.
  • AWS Fargate — serverless data plane: run pods without managing EC2 capacity.

Managed Node Groups

Great default for production clusters that need predictable updates and autoscaling.

Self-Managed EC2

Choose this when you need unusual kernels, GPUs, or tight control over systemd units and drivers.

Fargate

Ideal for spiky, small services where ops overhead must be minimal and pod-level isolation is desired.

6) High Availability & Reliability

Control plane spans multiple AZs. For nodes, create multiple node groups per AZ and use PodDisruptionBudgets and TopologySpreadConstraints to avoid single-AZ concentration.

7) Security & Compliance

Best practice: Use fine-grained IAM policies per service account and keep node roles minimal.

8) EKS Anywhere (Hybrid & Edge)

EKS Anywhere extends the EKS operational model to on-premises or edge. Get consistent cluster lifecycle, tooling, and support for hybrid strategies.

9) Scaling: CA, HPA, & Karpenter

  • Cluster Autoscaler — scales nodes based on unschedulable pods.
  • Horizontal Pod Autoscaler — scales replicas by CPU/memory/custom metrics.
  • Karpenter — open-source provisioner that launches just-in-time capacity across flexible instance types.

10) Versioning & Upgrades

EKS provides managed control-plane upgrades. Plan quarterly cadences, validate apiVersion changes, and roll nodes using surge parameters on node groups. Canary business workloads first.

Checklist: Read release notes, update aws-vpc-cni, kube-proxy, coredns, and test upgrade on non-prod.

11) Networking & Load Balancing

12) Cost Optimization

  • EKS control plane fee is predictable; data-plane spend dominates. Right-size nodes and requests/limits.
  • Use Spot Instances with interruption-tolerant workloads.
  • Adopt Savings Plans or RIs for steady-state capacity.
  • Bin-pack with Karpenter and consolidate small deployments.
AreaLeversWhy it helps
ComputeSpot, Graviton, autoscalingLower $/vCPU and scale to demand
StorageEBS GP3, lifecycle policiesRight tier and shrink idle volumes
NetworkNLB for L4, ALB reuseReduce LB over-provisioning
ObservabilityLog sampling, retentionCut CloudWatch ingest & retention cost

13) Observability & Monitoring

Integrate with Amazon CloudWatch, Prometheus, and Grafana. Emit application metrics, collect logs via Fluent Bit, and trace critical flows with AWS X-Ray.

14) Logging & Auditing

Enable Kubernetes audit logs to CloudWatch Logs. Ship application and systemd logs from nodes. Use retention and subscription filters to route only what you need to analytics or SIEM.

15) Multi-Cluster & Multi-Region

Use EKS Connector for visibility across clusters (including some external). Standardize add-ons and policies with GitOps and templates for repeatability.

16) Service Mesh

EKS supports AWS App Mesh and Istio for traffic shaping, encryption in mesh, and rich telemetry.

17) CI/CD Pipelines

Pair EKS with AWS CodePipeline, GitHub Actions, or Jenkins. Bake immutable images in ECR, scan for CVEs, sign with cosign, and deploy via GitOps.

docker build trivy fs kubectl apply helm upgrade --install

18) Storage: EBS, EFS, and FSx

  • EBS via CSI for stateful single-AZ volumes.
  • EFS for shared POSIX storage across AZs.
  • FSx for high-performance filesystems (e.g., Lustre).

19) Security Best Practices

  • Adopt Pod Security Standards and network policies (Calico or Cilium).
  • Use IRSA; avoid node-wide credentials.
  • Enforce signed images and run vulnerability scans.
  • Rotate secrets with AWS Secrets Manager or external secret operators.

20) Troubleshooting & Diagnostics: Practical Playbooks

Below are actionable steps and scripts you can run from your workstation or CI runner.

20.1 Cluster Health (kubectl)

# Context & API server reachability
kubectl cluster-info
kubectl get --raw='/readyz?verbose' | head -n 100

# Core components
kubectl -n kube-system get deploy,ds,po

# Node & condition health
kubectl get nodes -o wide
kubectl describe node <node-name>

# Pod scheduling issues
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
kubectl describe pod <pod> -n <ns>

# Logs
kubectl logs -n <ns> <pod> -c <container> --tail=200

20.2 AWS Tools for PowerShell — EKS & CloudWatch

Install once: Install-Module AWS.Tools.EKS -Scope CurrentUser and AWS.Tools.CloudWatchLogs.

# Requires: AWS.Tools.Common, AWS.Tools.EKS, AWS.Tools.EC2, AWS.Tools.CloudWatchLogs
# Set your default region and credentials before running
Set-DefaultAWSRegion -Region ap-south-1

# 1) Inspect EKS clusters
Get-EKSClusterList -MaxResults 50 | Select-Object clusters

# Describe a cluster (status, version, endpoint, OIDC)
$clusterName = "prod-eks"
$cluster = Get-EKSCluster -Name $clusterName
$cluster.Cluster | Select-Object name, status, version, endpoint, @{n="OIDC";e={$_.identity.oidc.issuer}}

# 2) Managed node groups & health
Get-EKSNodegroupList -ClusterName $clusterName | Select-Object nodegroups
Get-EKSNodegroup -ClusterName $clusterName -NodegroupName "prod-ng-1" |
  Select-Object nodegroup, status, scalingConfig, updateConfig, capacityType, version

# 3) Fargate profiles
Get-EKSFargateProfileList -ClusterName $clusterName | Select-Object fargateProfileNames

# 4) Cluster and audit logs in CloudWatch
$logGroup = "/aws/eks/$clusterName/cluster"
Get-CWLLogStreams -LogGroupName $logGroup | Sort-Object -Property LastEventTimestamp -Descending | Select-Object -First 5

# Tail last 200 lines from audit log stream that contains 'kube-apiserver-audit'
$stream = (Get-CWLLogStreams -LogGroupName $logGroup | Where-Object {$_.LogStreamName -like "*audit*"} | Sort-Object LastEventTimestamp -Descending | Select-Object -First 1).LogStreamName
Get-CWLLogEvents -LogGroupName $logGroup -LogStreamName $stream -StartFromHead:$false | Select-Object -Last 200

# 5) Search for throttling or RBAC denies in last hour
$pattern = "Too Many Requests|throttl|Forbidden|RBAC|denied"
Start-CWLQuery -LogGroupName $logGroup -QueryString @"
fields @timestamp, @message
| filter @message like /$pattern/
| sort @timestamp desc
| limit 50
"@ -StartTime (Get-Date).AddHours(-1) -EndTime (Get-Date) | Wait-CWLQuery |
  Select-Object -ExpandProperty Results | ForEach-Object {
    ($_ | Where-Object {$_.Field -eq "@message"}).Value
  }

20.3 PowerShell: Spot & Capacity Shortages (EC2 + Karpenter/CA)

# View recent EC2 Spot interruptions (if using Spot)
Get-EC2SpotFleetRequestHistory -StartTime (Get-Date).AddHours(-6) -EventType instanceChange -MaxResults 200 |
  Select-Object Timestamp, EventType, EventInformation

# Check Auto Scaling Group activity if using Managed Node Groups backed by ASG
(Get-ASAutoScalingGroup -AutoScalingGroupNames "eks-prod-ng-asg").Activities | 
  Select-Object StartTime, StatusCode, StatusMessage | Sort-Object StartTime -Descending | Select-Object -First 10

20.4 PowerShell: ECR Image Issues

# List latest image tags for a repo
Get-ECRImage -RepositoryName "orders-service" -MaxResults 20 | 
  Select-Object -ExpandProperty imageIds | Select-Object imageTag,imageDigest

# Check image scan findings if enabled
Get-ECRImageScanFindings -RepositoryName "orders-service" -ImageId @{ imageTag="prod" } |
  Select-Object -ExpandProperty imageScanFindings | Select-Object -ExpandProperty findings | 
  Select-Object name, severity, description | Format-Table -AutoSize

20.5 kubectl: DNS & Networking Quick Tests

# Exec into a debug pod with busybox
kubectl -n default run dnsutils --image=busybox:1.36 --rm -it --command -- sh
nslookup kubernetes.default.svc.cluster.local
wget -S -O - http://<service-name>.<ns>.svc.cluster.local:8080
exit

20.6 PowerShell: Cluster Add-ons (CNI / CoreDNS / kube-proxy) Drift

# Read EKS recommended versions for core add-ons (requires AWS CLI v2 in PATH for 'eks describe-addon-versions')
$addons = @("vpc-cni","coredns","kube-proxy")
$region = "ap-south-1"
foreach($a in $addons){
  $json = & eks describe-addon-versions --addon-name $a --region $region | ConvertFrom-Json
  $latest = $json.addons[0].addonVersions | Sort-Object -Property addonVersion -Descending | Select-Object -First 1
  "{0}: latest {1}" -f $a, $latest.addonVersion
}
Note on Graph API: For Amazon EKS, the Microsoft Graph API is not used. Identity integration relies on AWS IAM (including IRSA) and OIDC providers. If your platform team also manages Entra ID for SSO to developer tools, use Microsoft Graph for that surface area — but EKS itself is troubleshot via kubectl, AWS APIs/CLI, and AWS Tools for PowerShell.

Hands-On: Production-Ready Blueprint

  1. Create private subnets and dedicated security groups per node group; enable VPC flow logs for forensics.
  2. Provision EKS with the right version baseline; pin add-on versions and record SBOMs for base images.
  3. Choose capacity strategy: one on-demand MNG for baseline, one Spot MNG for burst, plus Fargate for jobs.
  4. Install ALB Ingress Controller, external-dns, and metrics stack (Prometheus/Grafana).
  5. Enforce policies with OPA Gatekeeper or Kyverno; implement Pod Security Standards.
  6. Adopt GitOps; require signed container images and enforce via admission policy.
  7. Define SLOs; add dashboards for error budgets, p95 latency, and saturation.

SRE Playbook: Common Failure Modes & Fixes

Unschedulable Pods

Check requests/limits mismatch; verify taints/tolerations and node selectors; CA/Karpenter events for capacity.

CrashLoopBackOff

Inspect container logs, readiness/liveness probes, secrets/configmaps, and image tags; compare with last good deploy.

DNS Problems

CoreDNS deployment health, node CIDR/iptables, VPC DHCP options; try a debug busybox pod.

Networking/Load Balancing

ALB/NLB target health; SGs and NACLs; pod IP exhaustion (ENI limits) — consider prefix delegation or larger nodes.

Auth/RBAC Denies

Use kubectl auth can-i; check AWS IAM mapping/IRSA policies; audit logs for Forbidden events.

Cost Spikes

Look for over-requested CPU/memory, idle DaemonSets, high log ingest, and excessive ALBs; apply rightsizing.

Compliance & Governance

  • Centralize logging for audits; define retention per policy.
  • Back up etcd (managed by AWS) and persistent volumes (Velero + EBS snapshots or EFS backups).
  • Use AWS Organizations and SCPs to restrict regions and services.

Operations Runbook Snippets

Drain & Roll a Node Group (Zero-Downtime)

# Cordon and drain a single node
kubectl cordon ip-10-0-12-34.ec2.internal
kubectl drain ip-10-0-12-34.ec2.internal --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
# Uncordon after health checks
kubectl uncordon ip-10-0-12-34.ec2.internal

Validate IRSA Permissions

# Who am I inside the pod?
TOKEN=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
curl -sS -H "Authorization: Bearer $(cat $TOKEN)" \
  $(kubectl get --raw /apis/authentication.k8s.io/v1/tokenreviews) >/dev/null

# Attempt an AWS call with the pod's role (example: S3 list)
aws sts get-caller-identity
aws s3 ls s3://<bucket> --no-sign-request:false

FAQ: Short Answers to Big Questions

Is EKS locked to AWS?

No. EKS runs upstream Kubernetes. Your manifests and controllers remain portable.

Can I run EKS on-prem?

Use EKS Anywhere for consistent tooling and support.

Do I need a service mesh?

Not always. Start with simple ingress and add mesh when you need traffic shaping, mTLS, or detailed telemetry.

How do I cut costs fast?

Right-size requests, enable Spot for burst, consolidate ALBs, limit log ingestion, and adopt Graviton nodes.

Copy-Paste Templates

PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
  namespace: web
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web

Topology Spread

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: web
spec:
  replicas: 6
  selector:
    matchLabels: { app: api }
  template:
    metadata:
      labels: { app: api }
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app: api }

IRSA Service Account

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  namespace: data
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eks-s3-readonly

Executive Summary: Why Teams Choose EKS

  • Managed Control Plane → Zero master ops.
  • Scalable & Secure → Deep AWS integrations.
  • Portable & Compatible → Upstream Kubernetes.
  • Hybrid Flexibility → EKS Anywhere.
  • Cost-Efficient → Pay for usage; optimize capacity.
Kubernetes lifecycle pipeline (monochrome)
A simple monochrome pipeline: Build → Scan → Push → GitOps → Deploy → Observe.

Leave a Reply

Your email address will not be published. Required fields are marked *