Cloud Knowledge

Your Go-To Hub for Cloud Solutions & Insights

Advertisement

Azure VMs explained: sizing, HA, storage, security, costs—with PowerShell & API troubleshooting

Azure VMs explained: sizing, HA, storage, security, costs—with PowerShell & API troubleshooting
Azure Virtual Machines (VMs): The Complete, Practical Guide for Apps, Databases, and Dev/Test

Azure Virtual Machines (VMs): End-to-End Guide for Real-World Deployments

From picking the right size and disk to high availability, security, cost control, and battle-tested troubleshooting.

Azure VMs VM Scale Sets Availability Zones Premium SSD Azure Monitor Azure Backup Azure Site Recovery Azure Resource Graph

Definition & Purpose

Azure Virtual Machines (VMs) are scalable, on-demand compute resources in Microsoft Azure that let you deploy Windows or Linux servers without managing physical hardware. You control the OS, disks, and networking, while Azure provides the fabric for elasticity, security, and global availability.

Typical scenarios include hosting enterprise web apps, databases, API backends, batch jobs, and dev/test environments where you need full OS-level control.

Primary Use Cases

Web & API Hosting

Databases & Analytics

  • Self-managed SQL Server, MySQL, PostgreSQL, or NoSQL engines with fine-grained disk layout.
  • Attach Premium SSD or Ultra Disk for predictable latency.

Dev/Test

GPU/AI & Graphics

Operating System Support

Azure supports Windows Server and major Linux distributions such as Ubuntu, Red Hat, SUSE, Debian, CentOS (legacy), plus custom images you import from on-premises. Use Shared Image Gallery to replicate standardized images across regions.

VM Sizes & Families

Choose the right size for your workload. You can resize later if telemetry suggests a different fit.

FamilyBest ForNotes
B-seriesLow-cost, burstableCredit-based CPU; great for dev, small services.
D-seriesGeneral purposeBalanced CPU/memory; common for app servers.
E-seriesMemory-optimizedIn-memory databases, analytics engines.
F-seriesCompute-optimizedHigh CPU workloads, microservices, batch.
M-seriesMassive memorySAP HANA, large in-memory apps.
N-seriesGPU/AI/GraphicsTraining, inference, rendering, visualization.

Tip: For production, prefer the current-generation SKUs (v5/v6) in your region for better perf/price.

Scaling Patterns

Vertical Scaling

Change the size of a single VM for quick resource adjustments. Requires stop/deallocate for some size moves.

Horizontal Scaling with Virtual Machine Scale Sets (VMSS)

  • Uniform or Flexible orchestration for rolling upgrades and mixed sizes.
  • Autoscale via CPU, memory (with custom metrics), or queue length signals.
  • Integrate with Azure Load Balancer or Application Gateway.

High Availability (HA)

Disks, Throughput & Performance

Disk TypeUse CaseIOPS/Latency Characteristics
Standard HDDLowest cost, dev/testHigher latency; avoid for transactional DBs.
Standard SSDBalanced general-purposeBetter latency than HDD; good default for non-critical apps.
Premium SSDProduction apps, DBsConsistent perf; plan for host cache and queue depths.
Ultra DiskMission-critical low-latencyProvision IOPS/MBps per disk; ideal for top-tier DB workloads.

Use ReadOnly host caching for OS disks and read-heavy data disks; test queue depth and striping across multiple data disks for DBs.

Networking Integration

Security & Compliance

Identity & Access Control

Grant access via Entra ID (Azure AD) roles and Azure RBAC. Use Managed Identities for apps running on VMs to access Azure services without secrets.

Automation & IaC

Monitoring & Insights

  • Collect metrics/logs via Azure Monitor and the Log Analytics agent/AMA.
  • Enable boot diagnostics and serial console for early boot issues.
  • Alert on CPU, memory (via guest), disk queue, packet drops, and heartbeat.

Backup & Disaster Recovery

  • Protect with Azure Backup (policy-based, app-consistent snapshots).
  • Replicate with Azure Site Recovery (cross-zone/region failover).
  • Test restores and failovers regularly; document RPO/RTO.

Cost Models & Optimization

Custom Images & Shared Image Gallery

Bake gold images with security baselines and agents. Replicate via Shared Image Gallery for consistent regional deployments and staged rollouts.

Dev/Test Labs Integration

Azure Dev/Test Labs lets teams quickly create controlled environments with quotas, auto-shutdown, and reusable formulas to keep experimentation affordable.

Extensions & Agents

  • Custom Script Extension for post-provision config.
  • Dependency Agent / Azure Monitor Agent for telemetry.
  • Antimalware, DSC, and third-party agents.

Hybrid & Multi-Cloud

Manage servers anywhere with Azure Arc. Bring governance, policy, and monitoring consistency to on-prem and other clouds. Extend cloud services to your datacenter with Azure Stack family.

Migration Support

Azure Migrate discovers on-prem workloads, assesses readiness and cost, and orchestrates agentless or agent-based migration with minimal downtime. Combine with change freezes and DR rehearsals to cut risk.

Future-Ready Capabilities

  • Confidential computing with hardware-based TEEs for sensitive data.
  • GPU and FPGA acceleration for AI/ML or real-time analytics.
  • ARM-based VMs for efficient scale-out apps.

Production Blueprint: A Proven Azure VM Architecture

  1. Landing Zone: Hub-spoke with centralized shared services (DNS, Firewall, Bastion, Azure Monitor).
  2. Network: Spoke VNets per app; subnets for web/app/data; NSGs + ASGs; private endpoints.
  3. Compute: VMSS for stateless tiers; single VMs with clustering for stateful DBs as needed.
  4. Storage: Premium SSD v2/Ultra for DB; OS disks separate from data; write-ahead logs isolated.
  5. Security: JIT, Defender for Cloud, PIM, Key Vault; Azure Policy guardrails.
  6. Reliability: Availability Zones, zone-redundant LB, backup, ASR to paired region.
  7. Observability: End-to-end logging, distributed tracing, SLO dashboards and alerts.
  8. Cost: Reservations/Hybrid Benefit; dev/test subscriptions; auto-shutdown; schedules.

Troubleshooting Playbook (Copy-Paste-Ready)

1) Quick Health Snapshot (PowerShell)

# Requires Az PowerShell (Connect-AzAccount first)
$sub = (Get-AzSubscription | Out-GridView -Title 'Pick subscription' -PassThru).Id
Select-AzSubscription -SubscriptionId $sub

$rg = Read-Host "Resource Group"
$vm = Read-Host "VM Name"

# Basic state, size, zone
Get-AzVM -Name $vm -ResourceGroupName $rg -Status |
  Select-Object Name, ResourceGroupName, Location,
                @{n='PowerState';e={$_.Statuses[-1].DisplayStatus}},
                HardwareProfile, Zones

# Public IP and private IPs
$nicId = (Get-AzVM -Name $vm -ResourceGroupName $rg).NetworkProfile.NetworkInterfaces[0].Id
$nic = Get-AzNetworkInterface -ResourceId $nicId
$pubIp = if ($nic.IpConfigurations.PublicIpAddress.Id) { Get-AzPublicIpAddress -ResourceId $nic.IpConfigurations.PublicIpAddress.Id }
$privIps = $nic.IpConfigurations.PrivateIpAddress
$pubIp | Select-Object Name, IpAddress
$privIps

2) Boot Diagnostics & Serial Console

# Enable boot diagnostics to a storage account (if not enabled)
$sa = Get-AzStorageAccount | Out-GridView -Title 'Pick storage for boot diag' -PassThru
Set-AzVMBootDiagnostics -Enable -ResourceGroupName $rg -VMName $vm -StorageAccountName $sa.StorageAccountName

# Fetch recent boot logs
Get-AzVMBootDiagnosticsData -ResourceGroupName $rg -Name $vm -Windows | `
  ForEach-Object { $_ | Out-File -FilePath (Join-Path $env:TEMP "boot-$vm.log") }
Write-Host "Saved to $env:TEMP\boot-$vm.log"

3) Reset Access (Windows & Linux)

# Reset password/SSH/RDP config via VMAccess extension
Set-AzVMAccessExtension -ResourceGroupName $rg -VMName $vm `
  -Location (Get-AzVM -rg $rg -name $vm).Location `
  -UserName (Read-Host "New Admin Username") `
  -Password (Read-Host -AsSecureString "New Password") `
  -Name "VMAccessAgent"

4) Repair a Broken OS Disk (Automated Rescue)

# Use the "Repair VM" extension approach: creates a repair VM, mounts the broken disk, runs fixups
# Install Az.Compute if needed: Install-Module Az.Compute -Scope CurrentUser
$repair = Repair-AzVM -ResourceGroupName $rg -Name $vm -Restore -Verbose
# Review output, then revalidate boot. When done, remove the repair resources:
# Remove-AzResourceGroup -Name $repair.RepairResourceGroupName -Force

5) NSG Flow Logging & Port Tests

# Show effective NSG rules on NIC
Get-AzEffectiveNetworkSecurityGroup -NetworkInterfaceName $nic.Name -ResourceGroupName $rg

# Test reachability from the VM (Run Command)
$cmd = "Test-NetConnection -ComputerName www.bing.com -Port 443"
Invoke-AzVMRunCommand -ResourceGroupName $rg -Name $vm -CommandId RunPowerShellScript -ScriptString $cmd

6) Resize/Change Disk Performance On the Fly

# Change data disk performance tier (Premium SSD v2/Ultra) where supported
$vmObj = Get-AzVM -Name $vm -ResourceGroupName $rg
# Example: increase size to next tier (be cautious for Ultra)
$diskId = $vmObj.StorageProfile.DataDisks[0].ManagedDisk.Id
$disk = Get-AzDisk -ResourceId $diskId
$disk.DiskSizeGB = [Math]::Max($disk.DiskSizeGB, 512)
Update-AzDisk -Disk $disk -ResourceGroupName $rg

7) Azure CLI One-Liners

# Login and target subscription
az login
az account set --subscription "<SUBSCRIPTION_NAME_OR_ID>"

# Get status quickly
az vm get-instance-view -g <RG> -n <VM> --query "instanceView.statuses[].displayStatus" -o tsv

# Run command to check disk queue length (Windows example)
az vm run-command invoke -g <RG> -n <VM> --command-id RunPowerShellScript --scripts "typeperf -sc 3 \"\\LogicalDisk(_Total)\\Avg. Disk Queue Length\""

8) Azure Resource Graph (ARG) – Fleet-wide Queries

# Requires Az.ResourceGraph
# All running VMs without backup
Search-AzGraph -Query @"
resources
| where type == 'microsoft.compute/virtualmachines'
| extend power = tostring(properties.extended.instanceView.powerState.code)
| where power has 'running'
| join kind=leftouter (
  resourcecontainers
  | where type == 'microsoft.recoveryservices/vaults/backupFabrics/protectionContainers/protectedItems'
) on $left.id == $right.properties.sourceResourceId
| where isnull(name1)
| project id, name, resourceGroup, location, power
"@

# VMs missing critical tags (e.g., Owner, Environment)
Search-AzGraph -Query @"
resources
| where type == 'microsoft.compute/virtualmachines'
| where isnull(tags['Owner']) or isnull(tags['Environment'])
| project name, resourceGroup, location, tags
"@

9) Kusto (Log Analytics) – Performance Hotspots

// VM heartbeat gaps (possible reboots/outages)
Heartbeat
| summarize gap=min_gaps=minif(next(timestamp)-timestamp) by Computer
| top 50 by gap asc

// CPU & memory pressure
Perf
| where ObjectName in ("Processor","Memory")
| where CounterName in ("% Processor Time","Available MBytes")
| summarize avg(v=CounterValue) by Computer, CounterName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

10) ARM/Compute REST – Scriptable Checks (via PowerShell)

# Call ARM REST for VM instance view (no Microsoft Graph needed for VM metadata)
$token = (Get-AzAccessToken -ResourceUrl "https://management.azure.com").Token
$subId = (Get-AzContext).Subscription.Id
$uri = "https://management.azure.com/subscriptions/$subId/resourceGroups/$rg/providers/Microsoft.Compute/virtualMachines/$vm/instanceView?api-version=2024-03-01"
$resp = Invoke-RestMethod -Uri $uri -Headers @{Authorization = "Bearer $token"} -Method GET
$resp.statuses | Format-Table code, displayStatus, time

11) Common Fix Patterns

  • Boot loops: attach OS disk to helper VM; disable faulty drivers/services; rebuild bootloader; reattach.
  • RDP/SSH failures: reset credentials with VMAccess; verify NSG, route table, and effective rules; use Bastion.
  • High IO wait: check disk cache mode; spread across multiple data disks; consider Premium v2 or Ultra.
  • App slowness: inspect guest metrics, GC settings, TLS offload, and Application Insights.
  • Unexpected reboots: review activity log, host maintenance windows, Update Manager schedules, and heat throttling.

Operational Checklists

Security Hardening (Weekly)

  • Ensure JIT access is on; remove dangling public IPs.
  • Rotate local admin creds; verify PIM assignment durations.
  • Update guest OS and agent extensions; audit installed software.

Reliability (Monthly)

  • DR drill: test ASR failover and Backup restores.
  • Zone capacity check and reservation coverage.

Cost (Bi-weekly)

  • Review Advisor right-size and idle VM recommendations.
  • Validate reservation utilization and spot eviction rates.

FAQ – Quick Answers

What’s the simplest HA setup? Use two or more VMs in an Availability Zone pair behind a zone-redundant load balancer.

Do I need Premium SSD for every VM? No—pick disks by workload. For DBs or high IOPS apps, Premium/Ultra pays off; dev/test can use Standard tiers.

How do I secure RDP/SSH? Prefer Azure Bastion + JIT, restrict NSG rules, and log connections.

VMSS vs. single VM? VMSS for stateless scale-out and autoscaling; single/clustered VMs for certain stateful or licensed workloads.

Where do I see performance? Azure Monitor, guest metrics, and Log Analytics workbooks; add custom metrics if needed.

Sample IaC (Bicep) – Zone-Redundant VMSS with Premium Disks

param location string = resourceGroup().location
param vmSku string = 'Standard_D4s_v5'
param instanceCount int = 2
param adminUser string
@secure()
param adminPassword string

resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2024-03-01' = {
  name: 'web-vmss'
  location: location
  zones: ['1','2','3']
  sku: {
    name: vmSku
    capacity: instanceCount
    tier: 'Standard'
  }
  properties: {
    upgradePolicy: {
      mode: 'Rolling'
      rollingUpgradePolicy: {
        maxBatchInstancePercent: 20
        maxUnhealthyInstancePercent: 20
        pauseTimeBetweenBatches: 'PT5M'
      }
    }
    virtualMachineProfile: {
      osProfile: {
        computerNamePrefix: 'web'
        adminUsername: adminUser
        adminPassword: adminPassword
      }
      storageProfile: {
        osDisk: {
          createOption: 'FromImage'
          caching: 'ReadOnly'
          managedDisk: { storageAccountType: 'Premium_LRS' }
        }
        dataDisks: [
          {
            lun: 0
            createOption: 'Empty'
            diskSizeGB: 256
            managedDisk: { storageAccountType: 'Premium_LRS' }
            caching: 'ReadOnly'
          }
        ]
      }
      networkProfile: {
        networkInterfaceConfigurations: [
          {
            name: 'nic'
            properties: {
              primary: true
              ipConfigurations: [
                {
                  name: 'ipconfig'
                  properties: {
                    subnet: {
                      id: '/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.Network/virtualNetworks/<vnet>/subnets/<subnet>'
                    }
                    loadBalancerBackendAddressPools: [
                      { id: '/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.Network/loadBalancers/<lb>/backendAddressPools/<pool>' }
                    ]
                  }
                }
              ]
            }
          }
        ]
      }
    }
  }
}

Governance & Policy Guardrails

  • Require tags (Owner, CostCenter, Environment) via Azure Policy.
  • Block public IP creation by default; enforce JIT on VMs.
  • Deny non-compliant disk types in prod subscriptions.

Performance Tuning Cheatsheet

  • Prefer accelerated networking on supported SKUs/NICs.
  • Right-size NUMA, vCPU counts, and memory for JVM/.NET GC behavior.
  • Use Application Gateway with WAF and keep-alive tuning for high RPS workloads.
  • Pin critical OS updates to maintenance windows; leverage rolling upgrades.

End-to-End Example: Lift-and-Shift a Line-of-Business App

  1. Discover with Azure Migrate and capture dependencies.
  2. Design the target topology (hub-spoke, subnets, NSGs, Bastion, LB).
  3. Size with perf baselines; choose Premium SSD for DBs.
  4. Migrate in waves; validate data integrity.
  5. Harden (JIT, Defender, PIM, Key Vault, private endpoints).
  6. Optimize with reservations, schedules, and autoscale policies.
  7. Operate with SLO dashboards, alerts, and monthly DR tests.

Copy-Ready Runbooks (Windows & Linux)

Restart Stuck Extensions

# List and remove a problematic extension then reinstall
Get-AzVMExtension -ResourceGroupName $rg -VMName $vm
Remove-AzVMExtension -ResourceGroupName $rg -VMName $vm -Name "MicrosoftMonitoringAgent" -Force
# Reinstall as needed...

Collect Sysinfo from Many VMs (Parallel)

$vms = Get-AzVM -Status | Where-Object { $_.PowerState -match 'running' }
$script = @'
uname -a || ver
df -h || Get-PSDrive
uptime || (Get-CimInstance Win32_OperatingSystem).LastBootUpTime
'@
# For Linux
foreach($v in $vms){ 
  try{
    Invoke-AzVMRunCommand -ResourceGroupName $v.ResourceGroupName -Name $v.Name `
      -CommandId RunShellScript -ScriptString "echo '---'; $script"
  }catch{}
}

When to Use PaaS Instead of VMs

If you don’t need OS control, consider Azure App Service, AKS, or Azure SQL to reduce ops overhead, gain auto-patching, and scale more easily.

Key Takeaways

  • Pick VM family by workload, not habit.
  • Design for zones and automate everything.
  • Instrument first; tune with data, then reserve for savings.
  • Use the troubleshooting scripts above to shorten MTTR.

Leave a Reply

Your email address will not be published. Required fields are marked *