Compute Services: Preemptible VMs — Low-cost, short-lived VMs for fault-tolerant workloads

A practical, in-depth guide to choosing, architecting, automating and troubleshooting Preemptible / Spot VMs across cloud providers. Includes scripts, diagnostics, and Kubernetes patterns.

Updated technical guide — includes PowerShell examples, CLI/REST calls, monitoring tips, and resilient architecture patterns.

Introduction

Preemptible virtual machines (VMs) are an attractive option for organizations that need large amounts of compute at a fraction of the cost. These instances are low-cost and short-lived, intended for workloads that can tolerate interruptions and for architectures built to be resilient. They are offered natively by cloud providers under different names — for example Preemptible VMs on Google Cloud, Spot Instances on AWS, and Spot VMs on Azure — and are ideal for batch jobs, large-scale training, CI/CD, and more.

Key promise: Deep cost savings (often 70–90%) in exchange for the potential to be preempted (terminated) when cloud capacity is reclaimed.

Definition & core properties

Definition: Preemptible or spot VMs are discounted compute instances that can be terminated by the cloud provider with a short notice when resources are needed elsewhere.
Cost advantage: Savings can range up to 80–90% versus on-demand VMs.
Lifecycle limit: Google Cloud preemptible VMs can run up to 24 hours; Azure and AWS have no hard 24-hour limit but can be interrupted anytime.
Interruption notice: Providers typically send a short warning (e.g., Google Cloud's ~30-second preemption notice) before termination.
No SLA: These instances are best-effort and do not receive the standard uptime SLA.

Key terms: Preemptible VMs, Spot Instances, Spot VMs, Compute Engine, GKE.

Cost & pricing model

Preemptible/spot pricing is fixed and significantly lower than on-demand pricing. Cloud vendors use this market to expose spare capacity. Important pricing characteristics:

Predictable discount: Unlike ephemeral bidding markets of the past, many providers now offer fixed deep-discount pricing for preemptible instances.
No sustained discounts: Sustained-use or reserved discounts usually do not apply to preemptible instances because they already have deep discounts.
Billing granularity: Billing is typically per-second or per-minute depending on the cloud provider.
Budget control: Use quotas, budgets, and alerts to avoid cost surprises when scaling.

Keypoint — Finance: Preemptible VMs are ideal for predictable, repeatable workloads that can be restarted; they are not suitable where guaranteed uptime is required.

Primary use cases

Typical workloads that benefit the most:

Batch processing: Large ETL jobs, video transcoding, genomic pipelines.
Machine learning training: Distributed training that supports checkpointing and resume.
CI/CD pipelines: Parallel test runners and build agents.
High-throughput compute: Simulation runs, rendering, analytics.
Stateless microservices: When used in conjunction with resilient load balancing and autoscaling.

Keypoint — Workload fit: If the workload is horizontally scalable and can tolerate instance loss without human intervention, it’s a good candidate.

Cloud providers & equivalent offerings

Common provider names and short notes:

Google Cloud: Preemptible VMs — up to 24-hour runtime, 30-second preemption notice.
AWS: Spot Instances — very flexible, instance pools, bidding model replaced largely by capacity-based model.
Azure: Spot VMs — configurable eviction policy and eviction actions, integration with Scale Sets.

Keypoint — Multi-cloud strategy: Mixing providers can improve capacity availability and reduce correlated preemptions during peak periods.

Lifecycle & interruption behavior

Understanding lifetime and preemption is essential for designing resilient apps:

Maximum lifetime: Google preemptible VMs may be terminated after 24 hours; other providers may not impose a strict maximum but will preempt when capacity is needed.
Preemption window: Providers usually send a short notice (e.g., ~30 seconds on Google) allowing scripts to checkpoint or flush state.
Non-restartable: Once preempted the instance cannot be restarted — it must be recreated.
Eviction policies: Some providers provide eviction policies and allow you to set eviction action (e.g., deallocate vs delete).

FQU (Frequently QUestion): Can a preempted VM be restarted? — No. After preemption, you must create a new instance; design the system to be able to re-provision quickly.

Storage patterns: ephemeral vs persistent

Because preemptible VMs are ephemeral, follow these guidelines:

Avoid relying on local ephemeral disks for any critical state.
Persist state to durable storage solutions: cloud persistent disks, cloud object storage (e.g., GCS, S3, Azure Blob), external databases, or distributed filesystems.
Checkpoint frequently: For long-running tasks, checkpoint to persistent storage on a schedule or on preemption notice if possible.

Keypoint — Data safety: Use persistent disks or cloud object stores for any output you need to keep after preemption.

Resilient architecture patterns

Designing for preemptible instances requires thinking in terms of fault tolerance and ephemeral compute:

1. Hybrid pools (preemptible + regular)

Maintain a small pool of regular (non-preemptible) instances and scale preemptible workers for cost savings. Keep critical services on regular VMs while using spot/preemptible VMs for scale-out tasks.

Use preemptible workers for parallel tasks and fall back to standard VMs if preemptible capacity is unavailable.

2. Queue-driven worker architecture

Use message queues (e.g., Cloud Pub/Sub, SQS, Azure Service Bus) or job schedulers. Workers poll the queue and process jobs; failed/abandoned tasks are returned to the queue.

3. Checkpoint & resume

Design tasks to flush intermediate results to persistent storage — resume from the last known good checkpoint after preemption.

4. Diversified spot pools

Use multiple instance types, sizes, and zones to reduce the chance all instances are preempted simultaneously.

5. Graceful shutdowns

Handle preemption signals to allow the VM to save state or drain current work.

Keypoint — Design principle: Treat preemptible instances as disposable workers; externalize critical state and make re-provisioning automated.

Kubernetes & GKE integration

Preemptible or spot nodes are commonly used in Kubernetes clusters to host non-critical pods. Best practices:

Node pools: Create separate node pools for preemptible/spot instances. Keep critical system pods on stable node pools.
Pod disruption budgets (PDBs): Use PDBs to control the rate of disruption for important workloads.
Pod priority and eviction: Use Kubernetes priorities so critical pods are scheduled on non-preemptible nodes.
Cluster-autoscaler: Configure the autoscaler to consider both regular and spot node pools.

Example: GKE spot node-pool (YAML)

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  name: gke-spot-pool
spec:
  location: us-central1
  initialNodeCount: 1
  autoscaling:
    enabled: true
    minNodeCount: 0
    maxNodeCount: 10
  management:
    autoRepair: true
    autoUpgrade: true
  config:
    machineType: n1-standard-4
    preemptible: true
    oauthScopes:
      - https://www.googleapis.com/auth/cloud-platform
      # other node config...

Note: Different tools (gcloud, Terraform, or GKE UI) can create node pools with preemptible/spot nodes. Use diversification across zones for resilience.

FQU: Should I run stateful pods on spot nodes? — Avoid stateful critical pods on spot nodes unless you replicate storage and can tolerate restarts.

Checkpointing & graceful shutdown

Because providers will send short interruption notices, build quick checkpointing and shutdown handlers:

Short-lived flush: Keep checkpointing intervals small enough to minimize lost progress.
On-notice handlers: Implement a handler that reacts to the preemption signal (e.g., system event, metadata change, or instance metadata endpoint) to persist state quickly.
Application-level retry: On restart, detect incomplete jobs and resume from the last checkpoint.

Example: Preemption notification on Google Cloud

On Google Cloud, instances can poll the metadata endpoint for preemption notices. Example pseudo-logic:

while true:
  if GET("http://metadata.google.internal/computeMetadata/v1/instance/preempted") == "TRUE":
    # flush checkpoints, notify orchestrator, drain work
    break
  sleep(1)

Keypoint — Response time: 30 seconds is often enough to save a small checkpoint and send a notification — test this thoroughly.

Automation: provisioning, scaling and orchestration

Automation is essential — manual operations defeat the cost benefit of preemptible VMs. Use Infrastructure-as-Code and orchestration tools:

Terraform: Manage instance templates, managed instance groups and autoscaler rules.
Managed Instance Groups / Scale Sets: Use provider-managed scaling to replace preempted instances quickly.
CI/CD: Use ephemeral runners for tests; they are perfect candidates for preemptible instances.

Example: Terraform snippet (GCP MIG with preemptible)

resource "google_compute_instance_template" "preemptible_template" {
  name_prefix = "preemptible-template"
  machine_type = "n1-standard-4"
  scheduling {
    preemptible = true
  }
  disk {
    # ...
  }
  # other template settings
}

resource "google_compute_instance_group_manager" "preemptible_mig" {
  name = "preemptible-mig"
  base_instance_name = "preemptible"
  instance_template = google_compute_instance_template.preemptible_template.self_link
  target_size = 0
  auto_healing_policies { /* ... */ }
}

Keypoint — IaC: Use Infrastructure-as-Code for reproducible instance definitions and fast reprovisioning.

Monitoring & preemption logs

Monitoring is critical for diagnosing preemption patterns:

Cloud Logging / Stackdriver: Collect preemption and instance termination events.
Metrics: Track spot capacity usage, average runtime, preemption rate, and job completion rate.
Alerting: Alert on sudden spikes in preemption or job failures.

Example: gcloud logging query for preemptions (GCP)

gcloud logging read 'resource.type="gce_instance" AND protoPayload.methodName="v1.compute.instances.preempted"' --limit 50

For AWS, check CloudTrail/CloudWatch events for Spot Instance interruption notices. For Azure, check Activity Logs for eviction events on Spot VMs.

Keypoint — Observability: Build dashboards showing preemption frequency, recovery time, and job success rates to make informed decisions.

Troubleshooting & automation scripts (PowerShell, CLI, REST)

Below are pragmatic scripts and commands for automation and troubleshooting. They target Azure (PowerShell), GCP (gcloud/REST), and general patterns. You can adapt them to your environment and CI/CD.

1) Azure PowerShell — create and monitor Spot VMs (Spot VMs + eviction)

Prereqs: Az PowerShell module installed and logged in via Connect-AzAccount.

# Create an Azure Spot VM (simple example)
$rg = "rg-spot-demo"
$loc = "eastus"
$vmName = "spot-vm-01"
$vmSize = "Standard_DS2_v2"
New-AzResourceGroup -Name $rg -Location $loc -ErrorAction SilentlyContinue

$subnetConfig = New-AzVirtualNetworkSubnetConfig -Name "default" -AddressPrefix "10.0.0.0/24"
$vnet = New-AzVirtualNetwork -ResourceGroupName $rg -Location $loc -Name "vnet-spot" -AddressPrefix "10.0.0.0/16" -Subnet $subnetConfig

$publicIP = New-AzPublicIpAddress -ResourceGroupName $rg -Location $loc -Name "$vmName-pip" -AllocationMethod Dynamic
$nic = New-AzNetworkInterface -ResourceGroupName $rg -Location $loc -Name "$vmName-nic" -SubnetId $vnet.Subnets[0].Id -PublicIpAddressId $publicIP.Id

$vmConfig = New-AzVMConfig -VMName $vmName -VMSize $vmSize
$vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Windows -ComputerName $vmName -Credential (Get-Credential) -ProvisionVMAgent
$vmConfig = Add-AzVMNetworkInterface -VM $vmConfig -Id $nic.Id

# Configure Spot settings: EvictionPolicy and Priority
$vmConfig.Priority = "Spot"
$vmConfig.EvictionPolicy = "Deallocate"  # or "Delete" depending on need
$vmConfig.BillingProfile = @{MaxPrice = -1} # -1 = on-demand price cap (use -1 to accept any discount)

New-AzVM -ResourceGroupName $rg -Location $loc -VM $vmConfig

This creates a Spot VM with deallocate eviction policy. Use Get-AzActivityLog to find eviction-related events.

# Query Activity Log for spot eviction events
$start = (Get-Date).AddDays(-7)
Get-AzActivityLog -ResourceGroupName $rg -StartTime $start | Where-Object {$_.OperationName -like "*evict*" -or $_.Category -eq "Administrative"} | Select EventTimestamp, OperationName, ResourceId, Status

Troubleshooting tip: If Spot VMs are being evicted frequently, diversify VM sizes, add fallback normal VMs, or increase quota.

2) Google Cloud — detect preemption via metadata, checkpoint & notify

Use the metadata endpoint to detect preemption and respond.

# Bash example (run on the instance)
METADATA='http://metadata.google.internal/computeMetadata/v1/instance/preempted'
HEADER='-H "Metadata-Flavor: Google"'
while true; do
  PREEMPTED=$(curl -s $HEADER $METADATA || echo "false")
  if [ "$PREEMPTED" = "TRUE" ]; then
    # Perform quick checkpoint
    echo "Preemption detected. Uploading checkpoint..."
    # Example: tar results and upload to GCS
    tar -czf /tmp/checkpoint.tgz /workdir
    gsutil cp /tmp/checkpoint.tgz gs://my-bucket/checkpoints/$(hostname)-$(date -u +%s).tgz
    # Notify orchestrator (HTTP callback)
    curl -X POST -H "Content-Type: application/json" -d '{"host":"'"$(hostname)"'","event":"preempted"}' https://my-orchestrator.example.com/preempt
    break
  fi
  sleep 1
done

Keypoint — Speed: Keep checkpoint payloads small and upload to fast stores to complete within the preemption window.

3) GCP CLI: find preempted instances

# List instances that have recently been preempted (example query)
gcloud logging read 'protoPayload.methodName="v1.compute.instances.preempted"' --freshness=7d --format="table(timestamp,resource.labels.instance_id,resource.labels.zone,protoPayload.resourceName)"

4) REST API: create an instance template (sample snippet)

Example HTTP payload for Google Compute Engine instance template (shortened):

POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/global/instanceTemplates
Authorization: Bearer $(gcloud auth print-access-token)
Content-Type: application/json

{
  "name": "preemptible-template",
  "properties": {
    "machineType": "n1-standard-4",
    "scheduling": {
      "preemptible": true
    },
    "disks": [
      {
        "boot": true,
        "autoDelete": true,
        "initializeParams": {
          "sourceImage": "projects/debian-cloud/global/images/family/debian-10"
        }
      }
    ],
    "networkInterfaces": [ /* ... */ ]
  }
}

FQU: Can I automate reprovisioning with REST calls? — Yes. Use the provider APIs to programmatically recreate instances on preemption / scale events.

Preemptible GPUs for ML workloads

GPUs attached to preemptible instances are a powerful cost-saving option for model training. Guidelines:

Checkpoint frequently: Save model weights to durable storage frequently.
Use distributed training: Use frameworks that can re-balance or re-launch workers.
Consider mixed-instance types: Mix cheaper CPUs and preemptible GPUs to reduce total cost and improve availability.

Keypoint — ML: Preemptible GPUs are excellent for non-critical training but require robust checkpoint and job orchestration.

Spot pool strategy & zone diversification

Mix instance types, families, and zones to reduce correlated preemptions. Strategies:

Use capacity-aware autoscaling: Autoscalers that understand spot pool availability improve reliability.
Fallback rules: Have policies to use on-demand instances when spot capacity is insufficient.

Keypoint — Resilience: A diversified spot pool reduces the chance of entire workload interruption.

Metrics & KPIs to track

Measure these KPIs to understand effectiveness and risk:

Preemption rate: percent of instances preempted vs launched.
Mean time to recovery (MTTR): how long to replace preempted instances and resume workloads.
Job success rate: percent of jobs that complete despite preemptions.
Cost per successful job: total cost divided by completed tasks (including re-runs due to preemptions).

Keypoint — Business metric: Use "cost per successful job" as a single metric to decide if spot/preemptible approach is worthwhile.

Best practices checklist

Design for ephemeral compute — externalize state.
Checkpoint frequently to durable storage and use small checkpoint sizes.
Use hybrid pools — combine preemptible & regular VMs.
Use diversified instance types, zones, and autoscaling policies.
Monitor preemptions, success rates, MTTR, and cost per job.
Automate reprovisioning with IaC (Terraform), autoscalers and managed instance groups.
Use queues and idempotency to make tasks retry-safe.
Test preemption handling — simulate preemptions in staging.

FQUs (Frequently QUestions) & answers

Q: What is the main difference between spot and preemptible?

A: Terminology differs across providers but the concept is the same — deep-discount instances that can be reclaimed. AWS uses "Spot Instances", Google uses "Preemptible VMs", and Azure uses "Spot VMs".

Q: How much do I save?

A: Savings vary by region and time, but discounts of 70–90% are common. Always test with your workload and track "cost per successful run".

Q: Do preemptible instances have an SLA?

A: No. They are best-effort and do not carry standard uptime SLAs.

Q: Are preemptible instances secure?

A: Yes — they have the same security posture as on-demand VMs; security practices remain the same (patching, IAM, network controls).

Q: Can I attach persistent disks/volumes?

A: Yes — attach cloud persistent disks or remote storage. Remember to design for the case where the compute is gone but storage remains.

Troubleshooting checklist

Verify preemption logs in provider's logging service (Stackdriver, CloudWatch, Activity Log).
Check whether preemptions are correlated with zone capacity or instance type.
Confirm checkpointing logic executes within preemption window.
Ensure autoscaler policies are replacing preempted nodes.
Audit job queue visibility & message redelivery settings in case of worker loss.

End-to-end example scenario

Scenario: You run a distributed training job across 50 preemptible GPU VMs in GCP. Steps to succeed:

Provision an instance template with preemptible: true and GPU config.
Use a Managed Instance Group (MIG) with autoscaling to maintain the worker pool.
Implement frequent model checkpointing to Cloud Storage (GCS) every N minutes/epochs.
On preemption notice (via metadata), upload any final checkpoint and post message to orchestrator.
Orchestrator notices missing workers and starts new workers from the template.
Training resumes from latest checkpoint automatically.

Keypoint — Automation is essential: The system should automatically reprovision and resume without manual steps.

Security & governance considerations

IAM & least privilege: Preemptible workers should have minimal rights — access only to the buckets/services they need to checkpoint & read jobs.
Secrets management: Use cloud secret stores (e.g., Secret Manager, Key Vault) instead of embedding credentials in images or scripts.
Audit logs: Monitor and store logs for incident investigation and compliance.

Real-world tips from practitioners

Stagger checkpoint timings across workers to avoid spikes to storage systems on preemption windows.
Keep checkpoints compact — save only deltas or compressed state.
Run chaos tests — simulate instance losses to validate orchestration and resume logic.
Use job idempotency to ensure reprocessing does not break downstream systems.

Conclusion

Preemptible and spot VMs provide an effective way to reduce compute costs dramatically when your workloads are architected for failure. The required investment in automation, checkpointing, monitoring, and orchestration usually pays off with significant savings for suitable workloads. Use the hybrid approach — combining preemptible workers with regular instances — to gain balance between cost-efficiency and reliability.

Final keypoint: Evaluate using real workload metrics (preemption rate, MTTR, cost per successful job) to decide whether preemptible VMs are the right fit.