Azure Virtual Machines (VMs): End-to-End Guide for Real-World Deployments
From picking the right size and disk to high availability, security, cost control, and battle-tested troubleshooting.
Definition & Purpose
Azure Virtual Machines (VMs) are scalable, on-demand compute resources in Microsoft Azure that let you deploy Windows or Linux servers without managing physical hardware. You control the OS, disks, and networking, while Azure provides the fabric for elasticity, security, and global availability.
Typical scenarios include hosting enterprise web apps, databases, API backends, batch jobs, and dev/test environments where you need full OS-level control.
Primary Use Cases
Web & API Hosting
- Run IIS, Nginx, Apache, or Node.js with custom middleware.
- Pair with Azure Load Balancer or Application Gateway for L4/L7 traffic.
Databases & Analytics
- Self-managed SQL Server, MySQL, PostgreSQL, or NoSQL engines with fine-grained disk layout.
- Attach Premium SSD or Ultra Disk for predictable latency.
Dev/Test
- Spin up sandboxes with Azure Dev/Test Labs and cost caps.
- Automate with Azure DevOps, ARM/Bicep, Terraform, or GitHub Actions.
GPU/AI & Graphics
- N-series for CUDA/AI training, inference, and visualization.
- Stream desktops/apps via Azure Virtual Desktop.
Operating System Support
Azure supports Windows Server and major Linux distributions such as Ubuntu, Red Hat, SUSE, Debian, CentOS (legacy), plus custom images you import from on-premises. Use Shared Image Gallery to replicate standardized images across regions.
VM Sizes & Families
Choose the right size for your workload. You can resize later if telemetry suggests a different fit.
| Family | Best For | Notes |
|---|---|---|
| B-series | Low-cost, burstable | Credit-based CPU; great for dev, small services. |
| D-series | General purpose | Balanced CPU/memory; common for app servers. |
| E-series | Memory-optimized | In-memory databases, analytics engines. |
| F-series | Compute-optimized | High CPU workloads, microservices, batch. |
| M-series | Massive memory | SAP HANA, large in-memory apps. |
| N-series | GPU/AI/Graphics | Training, inference, rendering, visualization. |
Tip: For production, prefer the current-generation SKUs (v5/v6) in your region for better perf/price.
Scaling Patterns
Vertical Scaling
Change the size of a single VM for quick resource adjustments. Requires stop/deallocate for some size moves.
Horizontal Scaling with Virtual Machine Scale Sets (VMSS)
- Uniform or Flexible orchestration for rolling upgrades and mixed sizes.
- Autoscale via CPU, memory (with custom metrics), or queue length signals.
- Integrate with Azure Load Balancer or Application Gateway.
High Availability (HA)
- Availability Sets distribute VMs across update and fault domains within a datacenter.
- Availability Zones spread VMs across physically separate zones for higher resiliency.
- Combine with zone-redundant load balancers and replicated disks.
Disks, Throughput & Performance
| Disk Type | Use Case | IOPS/Latency Characteristics |
|---|---|---|
| Standard HDD | Lowest cost, dev/test | Higher latency; avoid for transactional DBs. |
| Standard SSD | Balanced general-purpose | Better latency than HDD; good default for non-critical apps. |
| Premium SSD | Production apps, DBs | Consistent perf; plan for host cache and queue depths. |
| Ultra Disk | Mission-critical low-latency | Provision IOPS/MBps per disk; ideal for top-tier DB workloads. |
Use ReadOnly host caching for OS disks and read-heavy data disks; test queue depth and striping across multiple data disks for DBs.
Networking Integration
- Place VMs in a Virtual Network (VNet) with well-planned subnets and Network Security Groups (NSG).
- Expose public endpoints via Azure Load Balancer or Application Gateway (WAF for Layer 7).
- Use VPN Gateway or ExpressRoute for hybrid connectivity.
- Centralize outbound with NAT Gateway; log flows to Log Analytics.
Security & Compliance
- Microsoft Defender for Cloud for posture and threat protection.
- Azure Disk Encryption (ADE) or server-side encryption at rest.
- Just-in-time (JIT) VM access to reduce RDP/SSH exposure.
- Restrict with NSGs, Privileged Identity Management (PIM) and RBAC.
- Back compliance with policy assignments and Blueprints or landing zones.
Identity & Access Control
Grant access via Entra ID (Azure AD) roles and Azure RBAC. Use Managed Identities for apps running on VMs to access Azure services without secrets.
Automation & IaC
- Portal for one-offs; Azure CLI or PowerShell for scripts.
- Repeatability with ARM/Bicep or Terraform.
- Use VM Extensions for configuration, diagnostics, and agents.
Monitoring & Insights
- Collect metrics/logs via Azure Monitor and the Log Analytics agent/AMA.
- Enable boot diagnostics and serial console for early boot issues.
- Alert on CPU, memory (via guest), disk queue, packet drops, and heartbeat.
Backup & Disaster Recovery
- Protect with Azure Backup (policy-based, app-consistent snapshots).
- Replicate with Azure Site Recovery (cross-zone/region failover).
- Test restores and failovers regularly; document RPO/RTO.
Cost Models & Optimization
- Pay-as-you-go: On-demand flexibility for spiky workloads.
- Reserved Instances (1/3 years): Big savings for steady state.
- Spot VMs: Lowest price, eviction-tolerant jobs.
- Azure Hybrid Benefit: Reuse Windows/SQL licenses.
- Right-size based on Azure Advisor and performance data.
Custom Images & Shared Image Gallery
Bake gold images with security baselines and agents. Replicate via Shared Image Gallery for consistent regional deployments and staged rollouts.
Dev/Test Labs Integration
Azure Dev/Test Labs lets teams quickly create controlled environments with quotas, auto-shutdown, and reusable formulas to keep experimentation affordable.
Extensions & Agents
- Custom Script Extension for post-provision config.
- Dependency Agent / Azure Monitor Agent for telemetry.
- Antimalware, DSC, and third-party agents.
Hybrid & Multi-Cloud
Manage servers anywhere with Azure Arc. Bring governance, policy, and monitoring consistency to on-prem and other clouds. Extend cloud services to your datacenter with Azure Stack family.
Migration Support
Azure Migrate discovers on-prem workloads, assesses readiness and cost, and orchestrates agentless or agent-based migration with minimal downtime. Combine with change freezes and DR rehearsals to cut risk.
Future-Ready Capabilities
- Confidential computing with hardware-based TEEs for sensitive data.
- GPU and FPGA acceleration for AI/ML or real-time analytics.
- ARM-based VMs for efficient scale-out apps.
Production Blueprint: A Proven Azure VM Architecture
- Landing Zone: Hub-spoke with centralized shared services (DNS, Firewall, Bastion, Azure Monitor).
- Network: Spoke VNets per app; subnets for web/app/data; NSGs + ASGs; private endpoints.
- Compute: VMSS for stateless tiers; single VMs with clustering for stateful DBs as needed.
- Storage: Premium SSD v2/Ultra for DB; OS disks separate from data; write-ahead logs isolated.
- Security: JIT, Defender for Cloud, PIM, Key Vault; Azure Policy guardrails.
- Reliability: Availability Zones, zone-redundant LB, backup, ASR to paired region.
- Observability: End-to-end logging, distributed tracing, SLO dashboards and alerts.
- Cost: Reservations/Hybrid Benefit; dev/test subscriptions; auto-shutdown; schedules.
Troubleshooting Playbook (Copy-Paste-Ready)
1) Quick Health Snapshot (PowerShell)
# Requires Az PowerShell (Connect-AzAccount first)
$sub = (Get-AzSubscription | Out-GridView -Title 'Pick subscription' -PassThru).Id
Select-AzSubscription -SubscriptionId $sub
$rg = Read-Host "Resource Group"
$vm = Read-Host "VM Name"
# Basic state, size, zone
Get-AzVM -Name $vm -ResourceGroupName $rg -Status |
Select-Object Name, ResourceGroupName, Location,
@{n='PowerState';e={$_.Statuses[-1].DisplayStatus}},
HardwareProfile, Zones
# Public IP and private IPs
$nicId = (Get-AzVM -Name $vm -ResourceGroupName $rg).NetworkProfile.NetworkInterfaces[0].Id
$nic = Get-AzNetworkInterface -ResourceId $nicId
$pubIp = if ($nic.IpConfigurations.PublicIpAddress.Id) { Get-AzPublicIpAddress -ResourceId $nic.IpConfigurations.PublicIpAddress.Id }
$privIps = $nic.IpConfigurations.PrivateIpAddress
$pubIp | Select-Object Name, IpAddress
$privIps
2) Boot Diagnostics & Serial Console
# Enable boot diagnostics to a storage account (if not enabled)
$sa = Get-AzStorageAccount | Out-GridView -Title 'Pick storage for boot diag' -PassThru
Set-AzVMBootDiagnostics -Enable -ResourceGroupName $rg -VMName $vm -StorageAccountName $sa.StorageAccountName
# Fetch recent boot logs
Get-AzVMBootDiagnosticsData -ResourceGroupName $rg -Name $vm -Windows | `
ForEach-Object { $_ | Out-File -FilePath (Join-Path $env:TEMP "boot-$vm.log") }
Write-Host "Saved to $env:TEMP\boot-$vm.log"
3) Reset Access (Windows & Linux)
# Reset password/SSH/RDP config via VMAccess extension
Set-AzVMAccessExtension -ResourceGroupName $rg -VMName $vm `
-Location (Get-AzVM -rg $rg -name $vm).Location `
-UserName (Read-Host "New Admin Username") `
-Password (Read-Host -AsSecureString "New Password") `
-Name "VMAccessAgent"
4) Repair a Broken OS Disk (Automated Rescue)
# Use the "Repair VM" extension approach: creates a repair VM, mounts the broken disk, runs fixups
# Install Az.Compute if needed: Install-Module Az.Compute -Scope CurrentUser
$repair = Repair-AzVM -ResourceGroupName $rg -Name $vm -Restore -Verbose
# Review output, then revalidate boot. When done, remove the repair resources:
# Remove-AzResourceGroup -Name $repair.RepairResourceGroupName -Force
5) NSG Flow Logging & Port Tests
# Show effective NSG rules on NIC
Get-AzEffectiveNetworkSecurityGroup -NetworkInterfaceName $nic.Name -ResourceGroupName $rg
# Test reachability from the VM (Run Command)
$cmd = "Test-NetConnection -ComputerName www.bing.com -Port 443"
Invoke-AzVMRunCommand -ResourceGroupName $rg -Name $vm -CommandId RunPowerShellScript -ScriptString $cmd
6) Resize/Change Disk Performance On the Fly
# Change data disk performance tier (Premium SSD v2/Ultra) where supported
$vmObj = Get-AzVM -Name $vm -ResourceGroupName $rg
# Example: increase size to next tier (be cautious for Ultra)
$diskId = $vmObj.StorageProfile.DataDisks[0].ManagedDisk.Id
$disk = Get-AzDisk -ResourceId $diskId
$disk.DiskSizeGB = [Math]::Max($disk.DiskSizeGB, 512)
Update-AzDisk -Disk $disk -ResourceGroupName $rg
7) Azure CLI One-Liners
# Login and target subscription
az login
az account set --subscription "<SUBSCRIPTION_NAME_OR_ID>"
# Get status quickly
az vm get-instance-view -g <RG> -n <VM> --query "instanceView.statuses[].displayStatus" -o tsv
# Run command to check disk queue length (Windows example)
az vm run-command invoke -g <RG> -n <VM> --command-id RunPowerShellScript --scripts "typeperf -sc 3 \"\\LogicalDisk(_Total)\\Avg. Disk Queue Length\""
8) Azure Resource Graph (ARG) – Fleet-wide Queries
# Requires Az.ResourceGraph
# All running VMs without backup
Search-AzGraph -Query @"
resources
| where type == 'microsoft.compute/virtualmachines'
| extend power = tostring(properties.extended.instanceView.powerState.code)
| where power has 'running'
| join kind=leftouter (
resourcecontainers
| where type == 'microsoft.recoveryservices/vaults/backupFabrics/protectionContainers/protectedItems'
) on $left.id == $right.properties.sourceResourceId
| where isnull(name1)
| project id, name, resourceGroup, location, power
"@
# VMs missing critical tags (e.g., Owner, Environment)
Search-AzGraph -Query @"
resources
| where type == 'microsoft.compute/virtualmachines'
| where isnull(tags['Owner']) or isnull(tags['Environment'])
| project name, resourceGroup, location, tags
"@
9) Kusto (Log Analytics) – Performance Hotspots
// VM heartbeat gaps (possible reboots/outages)
Heartbeat
| summarize gap=min_gaps=minif(next(timestamp)-timestamp) by Computer
| top 50 by gap asc
// CPU & memory pressure
Perf
| where ObjectName in ("Processor","Memory")
| where CounterName in ("% Processor Time","Available MBytes")
| summarize avg(v=CounterValue) by Computer, CounterName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
10) ARM/Compute REST – Scriptable Checks (via PowerShell)
# Call ARM REST for VM instance view (no Microsoft Graph needed for VM metadata)
$token = (Get-AzAccessToken -ResourceUrl "https://management.azure.com").Token
$subId = (Get-AzContext).Subscription.Id
$uri = "https://management.azure.com/subscriptions/$subId/resourceGroups/$rg/providers/Microsoft.Compute/virtualMachines/$vm/instanceView?api-version=2024-03-01"
$resp = Invoke-RestMethod -Uri $uri -Headers @{Authorization = "Bearer $token"} -Method GET
$resp.statuses | Format-Table code, displayStatus, time
11) Common Fix Patterns
- Boot loops: attach OS disk to helper VM; disable faulty drivers/services; rebuild bootloader; reattach.
- RDP/SSH failures: reset credentials with VMAccess; verify NSG, route table, and effective rules; use Bastion.
- High IO wait: check disk cache mode; spread across multiple data disks; consider Premium v2 or Ultra.
- App slowness: inspect guest metrics, GC settings, TLS offload, and Application Insights.
- Unexpected reboots: review activity log, host maintenance windows, Update Manager schedules, and heat throttling.
Operational Checklists
Security Hardening (Weekly)
- Ensure JIT access is on; remove dangling public IPs.
- Rotate local admin creds; verify PIM assignment durations.
- Update guest OS and agent extensions; audit installed software.
Reliability (Monthly)
Cost (Bi-weekly)
- Review Advisor right-size and idle VM recommendations.
- Validate reservation utilization and spot eviction rates.
FAQ – Quick Answers
What’s the simplest HA setup?
Use two or more VMs in an Availability Zone pair behind a zone-redundant load balancer.
Do I need Premium SSD for every VM?
No—pick disks by workload. For DBs or high IOPS apps, Premium/Ultra pays off; dev/test can use Standard tiers.
How do I secure RDP/SSH?
Prefer Azure Bastion + JIT, restrict NSG rules, and log connections.
VMSS vs. single VM?
VMSS for stateless scale-out and autoscaling; single/clustered VMs for certain stateful or licensed workloads.
Where do I see performance?
Azure Monitor, guest metrics, and Log Analytics workbooks; add custom metrics if needed.
Sample IaC (Bicep) – Zone-Redundant VMSS with Premium Disks
param location string = resourceGroup().location
param vmSku string = 'Standard_D4s_v5'
param instanceCount int = 2
param adminUser string
@secure()
param adminPassword string
resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2024-03-01' = {
name: 'web-vmss'
location: location
zones: ['1','2','3']
sku: {
name: vmSku
capacity: instanceCount
tier: 'Standard'
}
properties: {
upgradePolicy: {
mode: 'Rolling'
rollingUpgradePolicy: {
maxBatchInstancePercent: 20
maxUnhealthyInstancePercent: 20
pauseTimeBetweenBatches: 'PT5M'
}
}
virtualMachineProfile: {
osProfile: {
computerNamePrefix: 'web'
adminUsername: adminUser
adminPassword: adminPassword
}
storageProfile: {
osDisk: {
createOption: 'FromImage'
caching: 'ReadOnly'
managedDisk: { storageAccountType: 'Premium_LRS' }
}
dataDisks: [
{
lun: 0
createOption: 'Empty'
diskSizeGB: 256
managedDisk: { storageAccountType: 'Premium_LRS' }
caching: 'ReadOnly'
}
]
}
networkProfile: {
networkInterfaceConfigurations: [
{
name: 'nic'
properties: {
primary: true
ipConfigurations: [
{
name: 'ipconfig'
properties: {
subnet: {
id: '/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.Network/virtualNetworks/<vnet>/subnets/<subnet>'
}
loadBalancerBackendAddressPools: [
{ id: '/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.Network/loadBalancers/<lb>/backendAddressPools/<pool>' }
]
}
}
]
}
}
]
}
}
}
}
Governance & Policy Guardrails
- Require tags (Owner, CostCenter, Environment) via Azure Policy.
- Block public IP creation by default; enforce JIT on VMs.
- Deny non-compliant disk types in prod subscriptions.
Performance Tuning Cheatsheet
- Prefer accelerated networking on supported SKUs/NICs.
- Right-size NUMA, vCPU counts, and memory for JVM/.NET GC behavior.
- Use Application Gateway with WAF and keep-alive tuning for high RPS workloads.
- Pin critical OS updates to maintenance windows; leverage rolling upgrades.
End-to-End Example: Lift-and-Shift a Line-of-Business App
- Discover with Azure Migrate and capture dependencies.
- Design the target topology (hub-spoke, subnets, NSGs, Bastion, LB).
- Size with perf baselines; choose Premium SSD for DBs.
- Migrate in waves; validate data integrity.
- Harden (JIT, Defender, PIM, Key Vault, private endpoints).
- Optimize with reservations, schedules, and autoscale policies.
- Operate with SLO dashboards, alerts, and monthly DR tests.
Copy-Ready Runbooks (Windows & Linux)
Restart Stuck Extensions
# List and remove a problematic extension then reinstall
Get-AzVMExtension -ResourceGroupName $rg -VMName $vm
Remove-AzVMExtension -ResourceGroupName $rg -VMName $vm -Name "MicrosoftMonitoringAgent" -Force
# Reinstall as needed...
Collect Sysinfo from Many VMs (Parallel)
$vms = Get-AzVM -Status | Where-Object { $_.PowerState -match 'running' }
$script = @'
uname -a || ver
df -h || Get-PSDrive
uptime || (Get-CimInstance Win32_OperatingSystem).LastBootUpTime
'@
# For Linux
foreach($v in $vms){
try{
Invoke-AzVMRunCommand -ResourceGroupName $v.ResourceGroupName -Name $v.Name `
-CommandId RunShellScript -ScriptString "echo '---'; $script"
}catch{}
}
When to Use PaaS Instead of VMs
If you don’t need OS control, consider Azure App Service, AKS, or Azure SQL to reduce ops overhead, gain auto-patching, and scale more easily.
Key Takeaways
- Pick VM family by workload, not habit.
- Design for zones and automate everything.
- Instrument first; tune with data, then reserve for savings.
- Use the troubleshooting scripts above to shorten MTTR.













Leave a Reply