Compute Services: Preemptible VMs — Low-cost, short-lived VMs for fault-tolerant workloads
A practical, in-depth guide to choosing, architecting, automating and troubleshooting Preemptible / Spot VMs across cloud providers. Includes scripts, diagnostics, and Kubernetes patterns.
Updated technical guide — includes PowerShell examples, CLI/REST calls, monitoring tips, and resilient architecture patterns.
Introduction
Preemptible virtual machines (VMs) are an attractive option for organizations that need large amounts of compute at a fraction of the cost. These instances are low-cost and short-lived, intended for workloads that can tolerate interruptions and for architectures built to be resilient. They are offered natively by cloud providers under different names — for example Preemptible VMs on Google Cloud, Spot Instances on AWS, and Spot VMs on Azure — and are ideal for batch jobs, large-scale training, CI/CD, and more.
Key promise: Deep cost savings (often 70–90%) in exchange for the potential to be preempted (terminated) when cloud capacity is reclaimed.
Definition & core properties
- Definition: Preemptible or spot VMs are discounted compute instances that can be terminated by the cloud provider with a short notice when resources are needed elsewhere.
- Cost advantage: Savings can range up to 80–90% versus on-demand VMs.
- Lifecycle limit: Google Cloud preemptible VMs can run up to 24 hours; Azure and AWS have no hard 24-hour limit but can be interrupted anytime.
- Interruption notice: Providers typically send a short warning (e.g., Google Cloud's ~30-second preemption notice) before termination.
- No SLA: These instances are best-effort and do not receive the standard uptime SLA.
Key terms: Preemptible VMs, Spot Instances, Spot VMs, Compute Engine, GKE.
Cost & pricing model
Preemptible/spot pricing is fixed and significantly lower than on-demand pricing. Cloud vendors use this market to expose spare capacity. Important pricing characteristics:
- Predictable discount: Unlike ephemeral bidding markets of the past, many providers now offer fixed deep-discount pricing for preemptible instances.
- No sustained discounts: Sustained-use or reserved discounts usually do not apply to preemptible instances because they already have deep discounts.
- Billing granularity: Billing is typically per-second or per-minute depending on the cloud provider.
- Budget control: Use quotas, budgets, and alerts to avoid cost surprises when scaling.
Primary use cases
Typical workloads that benefit the most:
- Batch processing: Large ETL jobs, video transcoding, genomic pipelines.
- Machine learning training: Distributed training that supports checkpointing and resume.
- CI/CD pipelines: Parallel test runners and build agents.
- High-throughput compute: Simulation runs, rendering, analytics.
- Stateless microservices: When used in conjunction with resilient load balancing and autoscaling.
Cloud providers & equivalent offerings
Common provider names and short notes:
- Google Cloud: Preemptible VMs — up to 24-hour runtime, 30-second preemption notice.
- AWS: Spot Instances — very flexible, instance pools, bidding model replaced largely by capacity-based model.
- Azure: Spot VMs — configurable eviction policy and eviction actions, integration with Scale Sets.
Lifecycle & interruption behavior
Understanding lifetime and preemption is essential for designing resilient apps:
- Maximum lifetime: Google preemptible VMs may be terminated after 24 hours; other providers may not impose a strict maximum but will preempt when capacity is needed.
- Preemption window: Providers usually send a short notice (e.g., ~30 seconds on Google) allowing scripts to checkpoint or flush state.
- Non-restartable: Once preempted the instance cannot be restarted — it must be recreated.
- Eviction policies: Some providers provide eviction policies and allow you to set eviction action (e.g., deallocate vs delete).
Storage patterns: ephemeral vs persistent
Because preemptible VMs are ephemeral, follow these guidelines:
- Avoid relying on local ephemeral disks for any critical state.
- Persist state to durable storage solutions: cloud persistent disks, cloud object storage (e.g., GCS, S3, Azure Blob), external databases, or distributed filesystems.
- Checkpoint frequently: For long-running tasks, checkpoint to persistent storage on a schedule or on preemption notice if possible.
Resilient architecture patterns
Designing for preemptible instances requires thinking in terms of fault tolerance and ephemeral compute:
1. Hybrid pools (preemptible + regular)
Maintain a small pool of regular (non-preemptible) instances and scale preemptible workers for cost savings. Keep critical services on regular VMs while using spot/preemptible VMs for scale-out tasks.
- Use preemptible workers for parallel tasks and fall back to standard VMs if preemptible capacity is unavailable.
2. Queue-driven worker architecture
Use message queues (e.g., Cloud Pub/Sub, SQS, Azure Service Bus) or job schedulers. Workers poll the queue and process jobs; failed/abandoned tasks are returned to the queue.
3. Checkpoint & resume
Design tasks to flush intermediate results to persistent storage — resume from the last known good checkpoint after preemption.
4. Diversified spot pools
Use multiple instance types, sizes, and zones to reduce the chance all instances are preempted simultaneously.
5. Graceful shutdowns
Handle preemption signals to allow the VM to save state or drain current work.
Kubernetes & GKE integration
Preemptible or spot nodes are commonly used in Kubernetes clusters to host non-critical pods. Best practices:
- Node pools: Create separate node pools for preemptible/spot instances. Keep critical system pods on stable node pools.
- Pod disruption budgets (PDBs): Use PDBs to control the rate of disruption for important workloads.
- Pod priority and eviction: Use Kubernetes priorities so critical pods are scheduled on non-preemptible nodes.
- Cluster-autoscaler: Configure the autoscaler to consider both regular and spot node pools.
Example: GKE spot node-pool (YAML)
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
name: gke-spot-pool
spec:
location: us-central1
initialNodeCount: 1
autoscaling:
enabled: true
minNodeCount: 0
maxNodeCount: 10
management:
autoRepair: true
autoUpgrade: true
config:
machineType: n1-standard-4
preemptible: true
oauthScopes:
- https://www.googleapis.com/auth/cloud-platform
# other node config...
Note: Different tools (gcloud, Terraform, or GKE UI) can create node pools with preemptible/spot nodes. Use diversification across zones for resilience.
Checkpointing & graceful shutdown
Because providers will send short interruption notices, build quick checkpointing and shutdown handlers:
- Short-lived flush: Keep checkpointing intervals small enough to minimize lost progress.
- On-notice handlers: Implement a handler that reacts to the preemption signal (e.g., system event, metadata change, or instance metadata endpoint) to persist state quickly.
- Application-level retry: On restart, detect incomplete jobs and resume from the last checkpoint.
Example: Preemption notification on Google Cloud
On Google Cloud, instances can poll the metadata endpoint for preemption notices. Example pseudo-logic:
while true:
if GET("http://metadata.google.internal/computeMetadata/v1/instance/preempted") == "TRUE":
# flush checkpoints, notify orchestrator, drain work
break
sleep(1)
Automation: provisioning, scaling and orchestration
Automation is essential — manual operations defeat the cost benefit of preemptible VMs. Use Infrastructure-as-Code and orchestration tools:
- Terraform: Manage instance templates, managed instance groups and autoscaler rules.
- Managed Instance Groups / Scale Sets: Use provider-managed scaling to replace preempted instances quickly.
- CI/CD: Use ephemeral runners for tests; they are perfect candidates for preemptible instances.
Example: Terraform snippet (GCP MIG with preemptible)
resource "google_compute_instance_template" "preemptible_template" {
name_prefix = "preemptible-template"
machine_type = "n1-standard-4"
scheduling {
preemptible = true
}
disk {
# ...
}
# other template settings
}
resource "google_compute_instance_group_manager" "preemptible_mig" {
name = "preemptible-mig"
base_instance_name = "preemptible"
instance_template = google_compute_instance_template.preemptible_template.self_link
target_size = 0
auto_healing_policies { /* ... */ }
}
Monitoring & preemption logs
Monitoring is critical for diagnosing preemption patterns:
- Cloud Logging / Stackdriver: Collect preemption and instance termination events.
- Metrics: Track spot capacity usage, average runtime, preemption rate, and job completion rate.
- Alerting: Alert on sudden spikes in preemption or job failures.
Example: gcloud logging query for preemptions (GCP)
gcloud logging read 'resource.type="gce_instance" AND protoPayload.methodName="v1.compute.instances.preempted"' --limit 50
For AWS, check CloudTrail/CloudWatch events for Spot Instance interruption notices. For Azure, check Activity Logs for eviction events on Spot VMs.
Troubleshooting & automation scripts (PowerShell, CLI, REST)
Below are pragmatic scripts and commands for automation and troubleshooting. They target Azure (PowerShell), GCP (gcloud/REST), and general patterns. You can adapt them to your environment and CI/CD.
1) Azure PowerShell — create and monitor Spot VMs (Spot VMs + eviction)
Prereqs: Az PowerShell module installed and logged in via Connect-AzAccount.
# Create an Azure Spot VM (simple example)
$rg = "rg-spot-demo"
$loc = "eastus"
$vmName = "spot-vm-01"
$vmSize = "Standard_DS2_v2"
New-AzResourceGroup -Name $rg -Location $loc -ErrorAction SilentlyContinue
$subnetConfig = New-AzVirtualNetworkSubnetConfig -Name "default" -AddressPrefix "10.0.0.0/24"
$vnet = New-AzVirtualNetwork -ResourceGroupName $rg -Location $loc -Name "vnet-spot" -AddressPrefix "10.0.0.0/16" -Subnet $subnetConfig
$publicIP = New-AzPublicIpAddress -ResourceGroupName $rg -Location $loc -Name "$vmName-pip" -AllocationMethod Dynamic
$nic = New-AzNetworkInterface -ResourceGroupName $rg -Location $loc -Name "$vmName-nic" -SubnetId $vnet.Subnets[0].Id -PublicIpAddressId $publicIP.Id
$vmConfig = New-AzVMConfig -VMName $vmName -VMSize $vmSize
$vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Windows -ComputerName $vmName -Credential (Get-Credential) -ProvisionVMAgent
$vmConfig = Add-AzVMNetworkInterface -VM $vmConfig -Id $nic.Id
# Configure Spot settings: EvictionPolicy and Priority
$vmConfig.Priority = "Spot"
$vmConfig.EvictionPolicy = "Deallocate" # or "Delete" depending on need
$vmConfig.BillingProfile = @{MaxPrice = -1} # -1 = on-demand price cap (use -1 to accept any discount)
New-AzVM -ResourceGroupName $rg -Location $loc -VM $vmConfig
This creates a Spot VM with deallocate eviction policy. Use Get-AzActivityLog to find eviction-related events.
# Query Activity Log for spot eviction events
$start = (Get-Date).AddDays(-7)
Get-AzActivityLog -ResourceGroupName $rg -StartTime $start | Where-Object {$_.OperationName -like "*evict*" -or $_.Category -eq "Administrative"} | Select EventTimestamp, OperationName, ResourceId, Status
2) Google Cloud — detect preemption via metadata, checkpoint & notify
Use the metadata endpoint to detect preemption and respond.
# Bash example (run on the instance)
METADATA='http://metadata.google.internal/computeMetadata/v1/instance/preempted'
HEADER='-H "Metadata-Flavor: Google"'
while true; do
PREEMPTED=$(curl -s $HEADER $METADATA || echo "false")
if [ "$PREEMPTED" = "TRUE" ]; then
# Perform quick checkpoint
echo "Preemption detected. Uploading checkpoint..."
# Example: tar results and upload to GCS
tar -czf /tmp/checkpoint.tgz /workdir
gsutil cp /tmp/checkpoint.tgz gs://my-bucket/checkpoints/$(hostname)-$(date -u +%s).tgz
# Notify orchestrator (HTTP callback)
curl -X POST -H "Content-Type: application/json" -d '{"host":"'"$(hostname)"'","event":"preempted"}' https://my-orchestrator.example.com/preempt
break
fi
sleep 1
done
3) GCP CLI: find preempted instances
# List instances that have recently been preempted (example query)
gcloud logging read 'protoPayload.methodName="v1.compute.instances.preempted"' --freshness=7d --format="table(timestamp,resource.labels.instance_id,resource.labels.zone,protoPayload.resourceName)"
4) REST API: create an instance template (sample snippet)
Example HTTP payload for Google Compute Engine instance template (shortened):
POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/global/instanceTemplates
Authorization: Bearer $(gcloud auth print-access-token)
Content-Type: application/json
{
"name": "preemptible-template",
"properties": {
"machineType": "n1-standard-4",
"scheduling": {
"preemptible": true
},
"disks": [
{
"boot": true,
"autoDelete": true,
"initializeParams": {
"sourceImage": "projects/debian-cloud/global/images/family/debian-10"
}
}
],
"networkInterfaces": [ /* ... */ ]
}
}
Preemptible GPUs for ML workloads
GPUs attached to preemptible instances are a powerful cost-saving option for model training. Guidelines:
- Checkpoint frequently: Save model weights to durable storage frequently.
- Use distributed training: Use frameworks that can re-balance or re-launch workers.
- Consider mixed-instance types: Mix cheaper CPUs and preemptible GPUs to reduce total cost and improve availability.
Spot pool strategy & zone diversification
Mix instance types, families, and zones to reduce correlated preemptions. Strategies:
- Use capacity-aware autoscaling: Autoscalers that understand spot pool availability improve reliability.
- Fallback rules: Have policies to use on-demand instances when spot capacity is insufficient.
Metrics & KPIs to track
Measure these KPIs to understand effectiveness and risk:
- Preemption rate: percent of instances preempted vs launched.
- Mean time to recovery (MTTR): how long to replace preempted instances and resume workloads.
- Job success rate: percent of jobs that complete despite preemptions.
- Cost per successful job: total cost divided by completed tasks (including re-runs due to preemptions).
Best practices checklist
- Design for ephemeral compute — externalize state.
- Checkpoint frequently to durable storage and use small checkpoint sizes.
- Use hybrid pools — combine preemptible & regular VMs.
- Use diversified instance types, zones, and autoscaling policies.
- Monitor preemptions, success rates, MTTR, and cost per job.
- Automate reprovisioning with IaC (Terraform), autoscalers and managed instance groups.
- Use queues and idempotency to make tasks retry-safe.
- Test preemption handling — simulate preemptions in staging.
FQUs (Frequently QUestions) & answers
Q: What is the main difference between spot and preemptible?
A: Terminology differs across providers but the concept is the same — deep-discount instances that can be reclaimed. AWS uses "Spot Instances", Google uses "Preemptible VMs", and Azure uses "Spot VMs".
Q: How much do I save?
A: Savings vary by region and time, but discounts of 70–90% are common. Always test with your workload and track "cost per successful run".
Q: Do preemptible instances have an SLA?
A: No. They are best-effort and do not carry standard uptime SLAs.
Q: Are preemptible instances secure?
A: Yes — they have the same security posture as on-demand VMs; security practices remain the same (patching, IAM, network controls).
Q: Can I attach persistent disks/volumes?
A: Yes — attach cloud persistent disks or remote storage. Remember to design for the case where the compute is gone but storage remains.
Troubleshooting checklist
- Verify preemption logs in provider's logging service (Stackdriver, CloudWatch, Activity Log).
- Check whether preemptions are correlated with zone capacity or instance type.
- Confirm checkpointing logic executes within preemption window.
- Ensure autoscaler policies are replacing preempted nodes.
- Audit job queue visibility & message redelivery settings in case of worker loss.
End-to-end example scenario
Scenario: You run a distributed training job across 50 preemptible GPU VMs in GCP. Steps to succeed:
- Provision an instance template with preemptible: true and GPU config.
- Use a Managed Instance Group (MIG) with autoscaling to maintain the worker pool.
- Implement frequent model checkpointing to Cloud Storage (GCS) every N minutes/epochs.
- On preemption notice (via metadata), upload any final checkpoint and post message to orchestrator.
- Orchestrator notices missing workers and starts new workers from the template.
- Training resumes from latest checkpoint automatically.
Security & governance considerations
- IAM & least privilege: Preemptible workers should have minimal rights — access only to the buckets/services they need to checkpoint & read jobs.
- Secrets management: Use cloud secret stores (e.g., Secret Manager, Key Vault) instead of embedding credentials in images or scripts.
- Audit logs: Monitor and store logs for incident investigation and compliance.
Real-world tips from practitioners
- Stagger checkpoint timings across workers to avoid spikes to storage systems on preemption windows.
- Keep checkpoints compact — save only deltas or compressed state.
- Run chaos tests — simulate instance losses to validate orchestration and resume logic.
- Use job idempotency to ensure reprocessing does not break downstream systems.
Conclusion
Preemptible and spot VMs provide an effective way to reduce compute costs dramatically when your workloads are architected for failure. The required investment in automation, checkpointing, monitoring, and orchestration usually pays off with significant savings for suitable workloads. Use the hybrid approach — combining preemptible workers with regular instances — to gain balance between cost-efficiency and reliability.







Leave a Reply