Azure Virtual Machines: Autoscaling with VMSS, Custom Images, and Hybrid Management via Azure Arc

Definition: Azure Virtual Machines are on-demand, scalable computing resources in Microsoft Azure, giving you full control over operating systems, applications, storage, and networking in an Infrastructure-as-a-Service (IaaS) model.

This actionable guide explains how to operate Azure Virtual Machines at scale using Virtual Machine Scale Sets (VMSS) for autoscaling, how to standardize builds with custom images, and how to extend consistent governance to on-premises and multi-cloud servers with Azure Arc. You’ll also find production-ready PowerShell and CLI snippets, troubleshooting playbooks, and 30 FAQs (10 each for autoscaling, custom images, and Azure Arc) alongside key points for adjacent VM capabilities—availability, networking, backup & disaster recovery, monitoring, identity and access, DevOps, IaC, performance, and cost optimization.

Wide Range of VM Sizes & Families

Azure offers VM families for diverse workloads—General Purpose (Dsv5/Dv5), Compute Optimized (Fsv2), Memory Optimized (Ebdsv5, M-series), Storage Optimized (Lsv3), GPU (NV/NC/ND), and HPC (HBv4/HX). Match the SKU to your workload’s CPU-to-memory ratio, storage throughput, and acceleration needs like SR-IOV with accelerated networking.

Key points: Right-size using Perf metrics; consider Premium SSD or Ultra Disk for IOPS; prefer Gen2 and Trusted Launch for security.
Use placement in Availability Zones when latency and HA require zonal spread.
For databases and analytics, validate throughput (MB/s), IOPS, and egress costs up front.

Autoscaling with Virtual Machine Scale Sets (VMSS)

VMSS adds and removes VM instances automatically based on metrics or schedules. Scale on CPU %, memory (via custom metrics), queue length, or HTTP latency. With Orchestration Modes (Uniform vs. Flexible), you can choose golden-image-based uniformity or instance diversity (e.g., mixed SKUs, stateful workloads).

Quick Start: Create a VMSS (Azure CLI)

# Create resource group and VMSS (Uniform)
az group create -n rg-vmss-demo -l eastus
az vmss create \
  -g rg-vmss-demo -n vmss-web \
  --image Ubuntu2204 \
  --orchestration-mode Uniform \
  --upgrade-policy-mode automatic \
  --admin-username azureuser --generate-ssh-keys \
  --instance-count 2 --vm-sku Standard_D2s_v5

# Attach autoscale rule on CPU > 60% (scale out) and < 30% (scale in)
az monitor autoscale create \
  --resource-group rg-vmss-demo \
  --resource vmss-web --resource-type Microsoft.Compute/virtualMachineScaleSets \
  --name vmss-web-autoscale --min-count 2 --max-count 10 --count 2

az monitor autoscale rule create \
  --resource-group rg-vmss-demo --autoscale-name vmss-web-autoscale \
  --scale out 1 --condition "Percentage CPU > 60 avg 10m"

az monitor autoscale rule create \
  --resource-group rg-vmss-demo --autoscale-name vmss-web-autoscale \
  --scale in 1 --condition "Percentage CPU < 30 avg 15m"

Troubleshooting Autoscale (PowerShell)

# Requires Az PowerShell
Connect-AzAccount
$rg  = "rg-vmss-demo"
$vmss = "vmss-web"

# 1) Inspect VMSS health and instances
Get-AzVmss -ResourceGroupName $rg -VMScaleSetName $vmss | Format-List Name,UpgradePolicy,OrchestrationMode
Get-AzVmssVM -ResourceGroupName $rg -VMScaleSetName $vmss | Select-Object Name,InstanceId,ProvisioningState,LatestModelApplied

# 2) Check autoscale settings and recent operations
(Get-AzAutoscaleSetting -ResourceGroupName $rg -Name "vmss-web-autoscale").Profiles.Rules
Get-AzActivityLog -ResourceGroupName $rg -MaxRecord 50 | Where-Object {$_.ResourceGroupName -eq $rg} |
  Select-Object -First 20 -Property EventTimestamp,OperationName,Status,Caller | Format-Table -AutoSize

# 3) Validate metric pipeline (CPU)
$metric = Get-AzMetric -ResourceId (Get-AzVMSS -RG $rg -VMSSName $vmss).Id -MetricName "Percentage CPU" -TimeGrain 00:05:00 -DetailedOutput
$metric.Data | Sort-Object TimeStamp -Descending | Select-Object -First 10 TimeStamp,Average

Operational Best Practices

Use Application Health Extension or custom health probes for graceful scale-in.
Prefer Flexible mode for heterogeneous SKUs or stateful sets; use Uniform for golden-image fleets.
Pin baseline capacity with minimum instances; cap cost with max instances and spot quotas.
Model scale with schedule profiles (weekday vs. weekend) plus metric-based rules.
Automate deployments with Bicep/Terraform to keep configs drift-free.

FAQs — Autoscaling with VMSS (10)

Can VMSS scale on memory? Yes, via custom metrics pushed to Azure Monitor (e.g., Telegraf/AMA) and autoscale rules on those metrics.
Uniform vs. Flexible? Uniform = identical instances from a common model; Flexible = heterogeneous instances, availability set semantics, better for stateful or mixed SKUs.
Rolling upgrades? Use automatic or rolling policy; set health/timeout thresholds and max surge for safe rollouts.
Graceful drain on scale-in? Use health probes and pre-stop scripts; detach from the load balancer before deallocation.
Scale based on queue length? Yes—emit custom metrics from Service Bus/Storage Queue and set thresholds.
Spot VMs with VMSS? Supported; define eviction policy and priority mixing to cut costs.
Per-zone scaling? Create zonal VMSS or use multiple VMSS per zone for fine-grained control.
Blue/green? Run dual VMSS behind a gateway; flip routing when healthy.
Warm instances? Keep a buffer (min instances) to absorb bursts without cold-start penalties.
Autoscale didn’t trigger? Check metric ingestion delay, evaluation period, rule scope, and activity logs.

Custom Images: Standardize Your VM Fleet

Build golden images with your OS hardening, agents (Defender, AMA), and baseline apps. Use Azure Compute Gallery (formerly Shared Image Gallery) to version and replicate images across regions and tenants. Bake with Packer, pipeline with Azure DevOps or GitHub Actions.

Packer + Azure CLI: Create & Publish an Image Version

# Create a gallery and image definition
az group create -n rg-img -l westus2
az sig create -g rg-img -r acg-demo
az sig image-definition create -g rg-img -r acg-demo -i win2019-web \
  --os-type Windows --publisher Contoso --offer WebStack --sku WS2019

# After Packer builds a managed image 'img-win2019-web', version and replicate it
az sig image-version create -g rg-img -r acg-demo -i win2019-web -e 1.0.0 \
  --managed-image img-win2019-web \
  --target-regions "westus2=2" "eastus=2"

Troubleshooting Image Deployments (PowerShell)

Connect-AzAccount
$rg = "rg-img"; $gallery = "acg-demo"; $imgDef = "win2019-web"; $version = "1.0.0"

# 1) Verify gallery replication status
Get-AzGalleryImageVersion -ResourceGroupName $rg -GalleryName $gallery -GalleryImageDefinitionName $imgDef -Name $version |
  Select-Object Name,ProvisioningState,PublishingProfile

# 2) Validate VM creation from the version
$vmrg="rg-workload"; $vmname="vm-from-image"
New-AzResourceGroup -Name $vmrg -Location "eastus" -ErrorAction SilentlyContinue
New-AzVM -ResourceGroupName $vmrg -Name $vmname -Location "eastus" `
  -ImageName "/subscriptions/<subId>/resourceGroups/$rg/providers/Microsoft.Compute/galleries/$gallery/images/$imgDef/versions/$version" `
  -Size "Standard_D2s_v5" -Credential (Get-Credential)

# 3) Health check the VM post-provision
Get-AzVM -RG $vmrg -Name $vmname -Status | Select-Object -ExpandProperty Statuses

Best Practices for Golden Images

Use Gen2, Trusted Launch, Secure Boot, and disk encryption for hardened baselines.
Version images semantically (MAJOR.MINOR.PATCH) and expire old versions safely.
Keep images slim; install app payload at deploy time via extensions for faster patch cycles.
Replicate to active regions near workloads to cut deployment latency.
Record SBOM and CIS hardening diffs for auditability.

FAQs — Custom Images (10)

Managed image vs. Compute Gallery? Use the gallery for versioning, replication, and scale; managed image is single-region.
How to keep images patched? Rebuild on Patch Tuesday cadence; use pipeline triggers and image templates.
License portability? Check base OS and marketplace terms; BYOL differs by publisher.
Can I mix extensions with golden images? Yes—keep images minimal; inject agents/secrets at deploy time.
Replicate across tenants? Use community/shared access features of the gallery with RBAC.
Rollback image? Pin VMSS to prior version; use staged rings to test before global rollout.
Secure secrets in bake? Use Key Vault for Packer variables and provisioning scripts.
Custom kernel/driver? Validate Gen2 compatibility, SR-IOV, and signed drivers for Secure Boot.
Image size too big? Remove caches/temp, use DISM/waagent clean, compress where safe.
Detect drift? Use Azure Policy and guest configuration baselines.

Hybrid and Multi-Cloud Management with Azure Arc

Azure Arc projects on-premises and multi-cloud servers into Azure as resources, enabling centralized inventory, tags, policy, RBAC, Update Manager, Defender for Cloud, and Azure Monitor. Use it alongside Azure Stack HCI and VPN/ExpressRoute for hybrid coherence.

Onboard Servers to Arc (PowerShell + Scripted)

# Connect and generate a service principal for at-scale onboarding (least privilege)
Connect-AzAccount
$rg="rg-arc"; $loc="eastus"
New-AzResourceGroup -Name $rg -Location $loc -ErrorAction SilentlyContinue

$app = New-AzADServicePrincipal -DisplayName "sp-arc-onboard"
$passwd = (New-Guid).Guid | ConvertTo-SecureString -AsPlainText -Force
New-AzADSpCredential -ObjectId $app.Id -Password $passwd -StartDate (Get-Date) -EndDate (Get-Date).AddYears(1)
New-AzRoleAssignment -ObjectId $app.Id -RoleDefinitionName "Azure Connected Machine Onboarding" -ResourceGroupName $rg

# Windows server onboarding (example)
$tenantId = (Get-AzContext).Tenant.Id
$subId = (Get-AzContext).Subscription.Id
$spId = $app.AppId
$spSecret = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto([System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($passwd))

Invoke-WebRequest -Uri https://aka.ms/azcmagent -OutFile azcmagent.msi
msiexec /i azcmagent.msi /l*v azcmagent.log /qn

& "C:\Program Files\AzureConnectedMachineAgent\azcmagent.exe" connect `
  --service-principal-id $spId `
  --service-principal-secret $spSecret `
  --resource-group $rg --tenant-id $tenantId --location $loc --subscription-id $subId `
  --tags "env=prod;platform=vmware"

Policy, Updates, and Defender (at Scale)

Apply Azure Policy initiatives to Arc servers for configuration drift control.
Standardize patching with Azure Update Manager schedules and maintenance windows.
Onboard to Microsoft Defender for Cloud for vulnerability and threat protection.
Centralize logs via Azure Monitor Agent (AMA) and Log Analytics.

Troubleshooting Arc Connectivity

# On the server:
azcmagent show
azcmagent check
azcmagent logs

# In Azure:
# Verify resource projection and policy compliance
Get-AzConnectedMachine -ResourceGroupName $rg | Select-Object Name,Status,Location,OSName,LastStatusChange
Get-AzPolicyState -Filter "ResourceGroup eq '$rg'" | Select-Object -First 20 -Property Timestamp,ComplianceState,PolicyDefinitionName

FAQs — Azure Arc (10)

Which servers can I onboard? Windows/Linux across on-prem, VMware, AWS, GCP—treated as Azure resources.
Do Arc servers need public IPs? No; outbound access to Azure endpoints is sufficient (via proxy if needed).
How is RBAC applied? Through projected resources; use least-privilege roles and scoped assignments.
Policy on non-Azure? Yes—guest configuration and initiatives evaluate Arc servers the same way.
Patch orchestration? Use Update Manager for schedules, pre/post scripts, and maintenance windows.
Billing? Core-based billing for certain capabilities; basic resource projection is free.
Agent conflicts? Consolidate to AMA; remove legacy agents where required.
Disconnected sites? Use intermittent connectivity mode; cache and retry operations.
Security posture? Leverage Defender for Cloud, auto-provisioning, and just-in-time access.
Inventory & tags? Use tags and Azure Resource Graph for fleet queries.

High Availability: Availability Sets & Zones

Protect against rack and datacenter failures using Availability Sets (fault/update domains) and Availability Zones (physically separate facilities). For tiered apps, distribute instances across zones and use zonal services for data/state.

Key points: Design for retry logic; use zone-redundant load balancers; plan cross-zone data replication.
For VMSS, prefer zonal VMSS in multi-AZ regions to meet RTO/RPO needs.
Test zone failure scenarios with chaos experiments to validate resilience.

Operating Systems & Azure Marketplace

Run Windows, Linux, or custom distros. Speed delivery with Azure Marketplace images (Windows Server, Ubuntu, RHEL, Oracle, hardened builds).

Key points: Verify publisher terms; ensure Secure Boot & Trusted Launch compatibility.
Use plan acceptance (–accept-terms) for Marketplace images in automation.
Pin versions for deterministic builds; test upgrades in rings.

Secure Networking with VNets, NSGs, and Private Access

Place VMs in Virtual Networks (VNets), segment with subnets, control traffic with Network Security Groups (NSGs), and use private endpoints to reach PaaS services. Integrate site-to-site with VPN or ExpressRoute.

Key points: Enable Accelerated Networking where supported; model inbound with Azure Firewall or WAF.
Use Just-in-Time (JIT) VM access instead of public RDP/SSH.
Baseline with Azure Policy to block noncompliant network configurations.

Disaster Recovery and Backup

Protect workloads with Azure Backup for point-in-time restore and Azure Site Recovery (ASR) for region-to-region replication and failover drills.

Key points: Define RPO/RTO; test failovers; ensure identity, DNS, and secrets are included in DR plans.
Tag tiers and retention for automated governance.
Encrypt backups and restrict restore scope with RBAC.

Monitoring, Logs, and Diagnostics

Use Azure Monitor, Log Analytics, and Application Insights for visibility. Collect guest and platform metrics, set alerts, and trace deployments using activity logs.

Key points: Standardize data collection with Data Collection Rules (DCRs).
Correlate autoscale events with app latency/throughput to validate capacity.
Use Workbooks for fleet visualizations.

Kusto Queries for VMSS & VM Health

// CPU & scale correlation
InsightsMetrics
| where Namespace == "Processor" and Name == "PercentProcessorTime"
| summarize avgCPU = avg(Val) by bin(TimeGenerated, 5m), Computer
| join kind=leftouter (
  AzureActivity
  | where OperationNameValue has "Autoscale" or OperationNameValue has "Write Virtual Machine Scale Sets"
  | summarize eventCount = count() by bin(TimeGenerated, 5m)
) on TimeGenerated
| order by TimeGenerated desc

Identity & Access with Microsoft Entra ID and RBAC

Control access via Microsoft Entra ID and Azure RBAC. Use managed identities for apps instead of embedded secrets, and restrict elevation with PIM.

Key points: Map least privilege roles; isolate prod/non-prod subscriptions; enforce MFA and Conditional Access.
Enable system-assigned or user-assigned managed identities for VM apps.
Audit with activity logs and Access Reviews.

Microsoft Graph: Inventory VM Role Assignments

# Using Microsoft Graph PowerShell to list role assignments (requires appropriate scopes)
Connect-MgGraph -Scopes "RoleManagement.Read.Directory","Directory.Read.All"
# Query Azure RBAC via Azure Resource Graph or Az PowerShell for resource-level scope.
# Example with Az PowerShell for a VM:
$rg = "rg-workload"; $vm = "vm-from-image"
$vmRes = Get-AzVM -ResourceGroupName $rg -Name $vm
Get-AzRoleAssignment -Scope $vmRes.Id | Select-Object PrincipalName,RoleDefinitionName,Scope

Cost Management: PAYG, Reserved, and Spot

Balance cost and flexibility using pay-as-you-go, 1- or 3-year Reserved Instances, and Spot for interruptible workloads. Combine with autoscale and schedule-based shutdowns for savings.

Key points: Tag cost centers; use budgets/alerts; right-size with performance baselines.
Use Azure Advisor and Cost Management exports.
Evaluate hybrid benefit and bring-your-own license scenarios.

Automation & Infrastructure as Code (ARM, Bicep, Terraform)

Keep environments declarative and repeatable with ARM templates, Bicep, or Terraform. Use CI/CD with Azure DevOps or GitHub Actions.

# Bicep: Minimal VM with managed identity (snippet)
param vmName string = 'vm-bicep'
param adminUser string = 'azureuser'
param location string = resourceGroup().location

resource nic 'Microsoft.Network/networkInterfaces@2023-09-01' = {
  name: '${vmName}-nic'
  location: location
  properties: {
    ipConfigurations: [{
      name: 'ipconfig1'
      properties: {
        privateIPAllocationMethod: 'Dynamic'
        subnet: { id: '/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.Network/virtualNetworks/<vnet>/subnets/<subnet>' }
      }
    }]
  }
}

resource vm 'Microsoft.Compute/virtualMachines@2023-09-01' = {
  name: vmName
  location: location
  identity: { type: 'SystemAssigned' }
  properties: {
    hardwareProfile: { vmSize: 'Standard_D2s_v5' }
    storageProfile: { imageReference: { publisher: 'Canonical', offer: '0001-com-ubuntu-server-jammy', sku: '22_04-lts', version: 'latest' } }
    osProfile: { computerName: vmName, adminUsername: adminUser, linuxConfiguration: { disablePasswordAuthentication: true, ssh: { publicKeys: [] } } }
    networkProfile: { networkInterfaces: [{ id: nic.id }] }
  }
}

Performance Optimization

Use Premium SSD v2/Ultra Disk where IOPS/latency matter, enable Accelerated Networking, and tune queues and IRQs on Linux. For databases, isolate data/logs/tempdb across disks and set the right caching policy.

Key points: Benchmark with FIO/DSKSpd; align VM SKU to storage/NET targets; avoid noisy neighbors via capacity reservations if needed.
Use NUMA-aware SKUs for big memory workloads; pin interrupts for high packets/s.
Profile app latency after scaling events to catch warm-up effects.

Security Enhancements

Harden with Trusted Launch, Secure Boot, vTPM, full-disk encryption, Defender for Cloud, and JIT. Use Microsoft Entra ID managed identities for secrets-free auth.

Key points: Disable password logins where possible; rotate keys; monitor guest configuration compliance.
Apply CIS baselines during image bake and enforce with Policy.
Segment admin endpoints; prefer private links over public.

Dev/Test, CI/CD, and Modern Workloads

VMs remain ideal for lift-and-shift apps, stateful services, and GPU/HPC. Integrate provisioning with pipelines, canary deploys, and staged rollouts using Azure DevOps or GitHub Actions, and gate releases with health probes.

Key points: Use ephemeral OS disks for stateless fleets; pin images to gallery versions in VMSS.
For data-heavy apps, co-locate compute and storage; cache read-heavy workloads.
Collect build provenance for audit and rollback.

Consolidated FAQs — Azure VMs (10)

What’s the fastest way to start? Use Marketplace image + Bicep/Terraform template, then layer extensions and policy.
How do I pick a VM size? Start from perf baselines; map to family (CPU-, memory-, storage-, or GPU-bound).
How do I reduce costs? Right-size, autoscale, schedule shutdowns, consider RI/Spot, and monitor utilization.
Is lift-and-shift safe? Yes with guardrails: network segmentation, policy, backups, and monitoring from day one.
How do I secure access? Use JIT, private endpoints, bastion hosts, and managed identities.
What about compliance? Enforce with Azure Policy, Defender for Cloud, and logging to immutable storage.
How do I handle secrets? Store in Key Vault; use managed identities where possible.
How to plan DR? Define RPO/RTO, replicate with ASR, test regularly, and include identity/DNS.
When not to use VMs? Prefer PaaS/containers for highly elastic stateless microservices when feasible.
How do I monitor everything? Standardize on AMA + DCR, Workbooks, and alert routing via ITSM/Teams.

Troubleshooting Playbooks: VMSS, Images, and Arc

Playbook 1 — Autoscale Didn’t Trigger

Verify metric scope (correct resource ID) and recent ingestion in Monitor.
Check autoscale rules’ operator, threshold, evaluation windows, and cool-downs.
Inspect AzureActivity for autoscale evaluations and failures.
Validate health probes; unhealthy instances won’t receive traffic, distorting metrics.
Test with synthetic load (k6/ApacheBench) to cross the threshold deliberately.

Playbook 2 — Image Deploy Fails

Confirm Marketplace terms accepted and gallery version available in the target region.
Check Trusted Launch/Gen2 compatibility and disk encryption settings.
Look for extension failures in VMExtensionProvisioningError.
Republish the version; ensure replication finished across regions.
Try a minimal SKU (D2s_v5) to rule out quota/SKU availability issues.

Playbook 3 — Arc Machine Not Reporting

Run azcmagent check on the host; collect azcmagent logs.
Validate outbound firewall/proxy to required Azure endpoints.
Check policy compliance and update baselines; remediate drift.
Review Defender for Cloud recommendations and agent health.
Re-connect with refreshed service principal if credentials expired.

Mastering Azure Virtual Machines means combining the right VM sizes, gold-standard images, intelligent autoscaling, and hybrid governance with Azure Arc. Use the scripts and playbooks above to accelerate delivery, improve reliability, and control costs—with security and compliance built in.

Azure VMs guide: autoscale with VMSS, custom images, and hybrid management with Azure Arc