Block Storage for Virtual Machines — Architecture, Best Practices, Troubleshooting & Automation
Comprehensive, practical guide with provider comparisons, performance tuning, snapshot & DR procedures, security, automation scripts (PowerShell, Azure Resource Graph, AWS CLI), FAQs and keypoints for each section.
Quick navigation
- Definition & purpose
- How block storage works
- Block vs File vs Object
- Performance characteristics & IOPS tuning
- Volume types by cloud providers (Azure, AWS, GCP)
- Snapshots, backups & DR
- Security & encryption
- Automation & troubleshooting (PowerShell, Azure Resource Graph, AWS CLI)
- Monitoring & capacity planning
- Hybrid, migration & best practices
- FAQs & keypoints
Definition & purpose
Definition of Block Storage — Block storage stores data in fixed-size blocks and exposes these blocks to operating systems or VMs as virtual disks. Each block is addressable, and the storage subsystem treats the collection of blocks as a raw volume that the VM can partition, format, and mount.
Purpose in virtual machines — Block storage provides persistent, low-latency storage volumes for VMs in cloud and virtualized environments. Use cases commonly include: databases, application servers, high I/O workloads, transactional systems, and enterprise applications that require consistent performance and direct file system control.
Keypoints
- Exposed to VM as a raw disk (attach/detach independent of VM lifecycle).
- Predictable latency and IOPS for transactional workloads.
- Supports snapshots, cloning, resizing and replication.
How block storage works (simple)
Block storage systems present volumes (LUNs or virtual disks) that can be attached to VMs via a hypervisor or cloud fabric. The VM's OS formats the block device using a filesystem (NTFS, ext4, XFS) or can use raw block access for database engines that support it.
Lifecycle overview
- Create a volume — allocate capacity and performance tier.
- Attach the volume to a VM — the OS discovers the disk (hot attach supported).
- Format & mount — create file system or use raw device.
- Use — perform reads/writes; the cloud provider handles underlying storage.
- Snapshot/Backup — use snapshots for point-in-time backups.
- Detach/Delete — volumes can persist after VM deletion if configured.
Block vs File vs Object storage (concise comparison)
| Capability | Block Storage | File Storage | Object Storage |
|---|---|---|---|
| Primary use | VM disks, databases | Shared file systems, home directories | Large-scale archives, blobs |
| Performance | Low latency, high IOPS | Moderate, SMB/NFS overhead | Optimized for throughput, eventual consistency |
| Access method | Block-level (raw disk) | File-level (POSIX, SMB/NFS) | HTTP-based (REST API) |
| Scalability | Scales in capacity & IOPS | Scales with cluster | Massive scale, object metadata |
Keypoints
- Choose block storage when you need direct disk control and predictable I/O (databases, transactional systems).
- Choose object storage for cold data, backups, and massive content distribution.
Performance characteristics & IOPS optimization
Block storage performance primarily depends on: IOPS (input/output operations per second), throughput (MB/s), and latency (ms). Providers offer performance tiers — e.g., Provisioned IOPS, standard SSD, premium SSD, ultra disk.
What affects performance?
- Disk type (SSD vs HDD)
- Provisioned IOPS and burst credits
- VM size and network bandwidth
- File system and block size alignment
- Queue depth and concurrency
Tuning checklist
- Use default block alignments (64K, 1MB depending on DB vendor).
- Configure the VM's IO scheduler (noop or none for many cloud VMs).
- Use multiple volumes striped for higher IOPS/throughput where supported.
- Avoid single-threaded IO patterns for very high throughput workloads.
- Monitor queue depth and adjust application concurrency.
Volume types by cloud providers (provider quick reference)
Below are the common block storage offerings — link keywords to authoritative resources for deeper reading.
Microsoft Azure — Managed Disks
- Standard HDD (cost-effective, sequential).
- Standard SSD (general purpose).
- Premium SSD (low-latency, high IOPS).
- Ultra Disk (customizable IOPS/throughput/latency).
See the Managed Disks documentation and performance guidance at Azure Managed Disks.
AWS — Elastic Block Store (EBS)
- General Purpose SSD (gp3/gp2).
- Provisioned IOPS SSD (io2/io2 Block Express).
- Throughput Optimized HDD (st1) and Cold HDD (sc1).
More on EBS volume types at EBS volume types.
Google Cloud — Persistent Disks
- Standard persistent disk (HDD).
- Balanced PD (SSD).
- Extreme Persistent Disk (high IOPS/throughput).
Read the Persistent Disks overview at GCP Persistent Disks.
Keypoints
- Ultra/Extreme disks are best for latency-sensitive DBs.
- Provisioned IOPS solves predictable performance needs but costs more.
Snapshots, backups & disaster recovery
Snapshots are point-in-time, incremental captures of block volumes. Cloud providers optimize storage of snapshots (incremental differences) to minimize cost and speed up restores.
Snapshot workflows
- Quiesce application (if possible) to flush buffers.
- Initiate snapshot (consistent backup).
- Store snapshot in region or replicate to secondary region.
- Test restore periodically (DR drills).
Snapshot-based disaster recovery patterns
- Cold DR: snapshots copied to another region; restore when needed.
- Warm DR: snapshots used to maintain standby VMs.
- Hot DR (replication): synchronous or near-synchronous replication to other zones/regions for RTO/RPO requirements.
Replication & availability zones
Cloud providers offer different redundancy options: locally redundant, zone-redundant, and geo-redundant. Understand SLA implications for each redundancy model.
Integration with databases
Databases (SQL Server, Oracle, PostgreSQL, MySQL) typically require low-latency, consistent IO. Recommended patterns:
- Place DB data files on Premium SSD or equivalent.
- Separate logs on lower-latency dedicated disks to avoid I/O contention.
- Consider raw devices for certain DB engines that handle volume-level management.
IOPS & throughput optimization — practical examples
Common approaches to increase IOPS/throughput:
- Increase volume performance tier (e.g., gp3 -> io2).
- Increase volume count and stripe across volumes inside VM (software RAID 0) — increases aggregate IOPS but careful with redundancy.
- Use instance types with higher network bandwidth and EBS/attachment performance.
- Enable write-back caches only when safe; ensure battery-backed caches or cloud guarantees.
Cost models & pricing considerations
Block storage costs are usually: per-GB-month + per-IOPS or throughput tiers + snapshot storage costs. Additional costs may include data transfer for cross-region replication.
Keypoints
- Monitor snapshot growth; delete stale snapshots to control costs.
- Balance provisioned IOPS vs. burst models to optimize spend.
Automation & management tools (PowerShell, CLI, IaC)
Block storage can be managed via portal, CLI, SDKs, PowerShell, and infrastructure-as-code (Terraform, ARM/Bicep). Below are practical troubleshooting and automation snippets.
Azure: PowerShell examples (Managed Disks)
Prerequisites: Install-Module Az and sign in (Connect-AzAccount).
List unattached managed disks (useful for cost cleanup)
# List all unattached managed disks in subscription
$disks = Get-AzDisk
$unattached = $disks | Where-Object { -not $_.ManagedBy }
$unattached | Select-Object Name, ResourceGroupName, DiskSizeGB, Sku.Name | Format-Table -AutoSize
Resize a managed disk (online for many disk types)
# Resize managed disk to 512 GB
$rg = "myResourceGroup"
$diskName = "myDisk"
$disk = Get-AzDisk -ResourceGroupName $rg -DiskName $diskName
$disk.DiskSizeGB = 512
Update-AzDisk -ResourceGroupName $rg -Disk $disk
Create a snapshot and copy to another region (Azure)
# Create snapshot
$snapshotConfig = New-AzSnapshotConfig -SourceUri $disk.Id -Location "eastus" -CreateOption Copy
New-AzSnapshot -Snapshot $snapshotConfig -SnapshotName "myDiskSnapshot" -ResourceGroupName $rg
# For cross-region copy, use incremental snapshot export/copy (or use AzCopy for managed blob-level copy with proper steps)
Azure Resource Graph query — find large disks (>1TB) and vm pairing
# Requires Az.ResourceGraph
Search-AzGraph -Query @"
Resources
| where type =~ 'microsoft.compute/disks'
| extend diskSizeGB = toint(properties.diskSizeGb)
| where diskSizeGB > 1024
| project name, resourceGroup, location, sku = properties.sku.name, diskSizeGB, managedBy = properties.managedBy
"@
AWS: CLI examples (EBS)
Prerequisites: AWS CLI configured with credentials and region.
List unattached volumes
aws ec2 describe-volumes --filters Name=status,Values=available --query "Volumes[*].[VolumeId,Size,SnapshotId,AvailabilityZone]" --output table
Create snapshot and copy to another region
aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "Prod DB snapshot"
# then copy snapshot to other region
aws ec2 copy-snapshot --source-region us-east-1 --source-snapshot-id snap-0123456789abcdef0 --destination-region us-west-2 --description "Prod DB snapshot copy"
GCP: gcloud examples (Persistent Disk)
# Create snapshot
gcloud compute disks snapshot my-disk --snapshot-names=my-disk-snap --zone=us-central1-a
# List snapshots
gcloud compute snapshots list --filter="name~my-disk"
Troubleshooting guide — step-by-step
Common issues and checks when block storage exhibits problems:
Issue: High latency and slow queries
- Check instance host: CPU, memory, network saturation.
- Check disk metrics: IOPS, throughput, latency (provider metrics).
- Compare actual IOPS to provisioned IOPS.
- Inspect queue depth and application IO patterns.
- Consider upgrading disk tier or striping across multiple volumes.
Issue: Disk not attached or not visible in VM
- Confirm volume status via provider console/CLI (attached state).
- In Linux, run
lsblkorsudo fdisk -l. - On Windows, check Disk Management or
Get-Diskin PowerShell. - Verify VM agent/hypervisor is healthy and latest.
Issue: Snapshot restore failed or inconsistent
- Ensure application-consistent snapshot was used for DBs.
- Check snapshot chain for corruption; attempt restore to new disk and mount read-only to inspect.
- Use database recovery tools (restore logs, apply transactions) if necessary.
Keypoints — troubleshooting
- Always correlate application metrics with storage metrics when troubleshooting latency.
- Use read replicas (for DBs) to offload reporting from primary disk I/O.
Security features & encryption
Block storage supports encryption-at-rest and encryption-in-transit. Providers allow customer-managed keys (CMK) via KMS/HSM, or provider-managed keys.
Recommended configuration
- Encrypt all volumes in production with CMK where possible.
- Use IAM roles/policies to limit attach/detach privileges.
- Audit disk and snapshot usage; monitor for orphaned snapshots/disks.
Backup & restore capabilities
Best-practice: adopt a 3-2-1 strategy adapted for cloud:
- 3 copies of data
- 2 types of media (block snapshot + object storage export)
- 1 off-site copy (cross-region snapshot or replication)
Data migration & hybrid cloud integration
Migration strategies:
- Lift-and-shift: snapshot/replicate disks to cloud region and restore to VM.
- Block-level replication tools (Azure Migrate, AWS Server Migration Service, Storage Gateway).
- Database-native migration (Data Migration Service, replication, logical export/import).
Monitoring & capacity planning
Monitor metrics: latency, IOPS, throughput, queue length, burst credits, and snapshot sizes. Use provider monitoring (Azure Monitor, CloudWatch, Stackdriver) and integrate with SIEM/observability tools.
Sample metric alerts (example)
- Alert if average latency > 20ms for 5 minutes.
- Alert if IOPS >= 90% of provisioned for 10 minutes.
- Alert if snapshot storage increases > 30% month-over-month.
High availability design
High availability for storage requires zoning and replication:
- Use zone-redundant or region-redundant storage for critical volumes.
- Leverage database clustering with shared/dedicated volumes (e.g., clustered file systems or multi-attach where supported).
- Design stateless application tiers where possible and replicate stateful tiers via database replication.
Hybrid & on-prem integration
Hybrid patterns include synchronous/near-synchronous replication to on-prem arrays, or caching/gateway solutions that present cloud volumes locally. Tools: Azure File Sync, AWS Storage Gateway, vendor replication appliances.
Keypoints — hybrid
- Network latency is the major hindrance to synchronous hybrid replication.
- Use asynchronous replication when across long distances; design for eventual consistency accordingly.
Best practices summary
- Match disk type to workload (SSD for DBs, HDD for archival/sequential reads).
- Separate OS, logs, data on different volumes to isolate I/O.
- Automate snapshot lifecycle (retention, copy, deletion).
- Encrypt volumes and use least-privilege IAM for storage operations.
- Monitor and right-size volumes periodically; remove orphaned disks and snapshots.
- Test restores and DR playbooks regularly.
Advanced automation & real-world scripts
Below are longer script examples you can adapt for scheduled cleanup, discovery, and health checks.
Azure PowerShell — cleanup script (unattached disks older than 30 days)
# Cleanup unattached managed disks older than X days
Import-Module Az
Connect-AzAccount
$days = 30
$cutoff = (Get-Date).AddDays(-$days)
$disks = Get-AzDisk
$unattached = $disks | Where-Object { -not $_.ManagedBy -and $_.TimeCreated -lt $cutoff }
foreach ($d in $unattached) {
Write-Output "Found unattached disk: $($d.Name) in RG $($d.ResourceGroupName), size: $($d.DiskSizeGB)GB, created: $($d.TimeCreated)"
# Optionally remove after approval
# Remove-AzDisk -ResourceGroupName $d.ResourceGroupName -DiskName $d.Name -Force
}
Azure Resource Graph + PowerShell — inventory all disks + VM pairing
# Use Search-AzGraph to retrieve disk inventory and cross-check VMs
$results = Search-AzGraph -Query @"
Resources
| where type =~ 'microsoft.compute/disks'
| project diskName = name, resourceGroup, location, sku = properties.sku.name, sizeGB = properties.diskSizeGb, managedBy = properties.managedBy
"@
$results | ConvertTo-Json -Depth 5
AWS Bash snippet — delete snapshots older than 90 days (careful!)
# AWS CLI - delete snapshots older than 90 days owned by account
DAYS=90
NOW=$(date +%s)
aws ec2 describe-snapshots --owner-ids self --query "Snapshots[*].[SnapshotId,StartTime,Description]" --output json | \
jq -r '.[] | @base64' | while read line; do
_jq() { echo ${line} | base64 --decode | jq -r ${1}; }
sid=$(_jq '.[0]')
stime=$(_jq '.[1]')
stime_s=$(date -d "$stime" +%s)
age=$(( (NOW - stime_s) / 86400 ))
if [ $age -gt $DAYS ]; then
echo "Deleting snapshot $sid age $age days"
aws ec2 delete-snapshot --snapshot-id $sid
fi
done
FAQs (FQUs) — Frequently-asked questions & quick answers
Q: Can I attach a single block volume to multiple VMs?
A: Some providers offer multi-attach volumes (AWS EBS Multi-Attach, Azure Shared Disks) for clustered file systems. Application-level coordination (cluster-aware FS) is required to prevent corruption.
Keypoint: Multi-attach is intended for clustered applications (e.g., clustered databases or clustered file systems) and requires careful testing.
Q: Is resizing a disk destructive?
A: Resizing a managed disk is generally non-destructive — you can increase size online. After resizing the volume at provider level, you must extend the partition and filesystem inside the guest OS.
Keypoint: Shrinking volumes is usually unsupported or risky — snapshot first.
Q: How often should I snapshot?
A: Snapshot frequency depends on RPO requirements. For critical DBs, consider transaction log shipping or continuous replication plus periodic full snapshots. For less critical systems, daily snapshots may suffice.
Q: How to choose between Provisioned IOPS and standard SSD?
A: Choose Provisioned IOPS when you need guaranteed IOPS and consistent low latency. Standard SSD is sufficient for many general-purpose workloads and is cheaper.
Mini case study: Migration of on-prem DB to cloud block storage
Scenario: Enterprise running SQL Server on SAN planning move to Azure. Steps taken:
- Inventory SAN LUNs and their IOPS/throughput using perfmon.
- Map LUNs to managed disks; choose Premium SSD for data, Ultra for high IOPS portions, Standard for backups.
- Use Azure Migrate for lift-and-shift; take full backup & restore for final cutover with minimal RTO (log shipping to minimize cutover window).
- After cutover, observe disk metrics and right-size disks and VM SKU within 2 weeks.
Keypoints
- Measure real I/O before picking disk type; cloud metrics/instrumentation help size accurately.
- Test application behavior on same disk types (staging) before production migration.
Operational checklists (copy into runbooks)
Provisioning runbook
- Confirm IOPS/throughput requirements (peak vs sustained).
- Choose disk tier and attach to VM in required AZ.
- Initialize and format with appropriate block size and alignment.
- Ensure monitoring/alerts configured.
- Enable encryption and set IAM/policies.
Pre-snapshot checklist for databases
- Notify stakeholders and freeze write operations if possible.
- Flush DB caches and perform VSS or app-consistent snapshot.
- Verify snapshot chain and replicate copy off-site if required.
Closing — where to go next
This guide covered architecture, provider comparisons, performance tuning and operational scripts you can use immediately to discover, troubleshoot and manage block storage for VMs. For more provider-specific tuning and step-by-step examples, see the provider docs linked in each provider section and adapt the PowerShell/AWS CLI scripts to your environment.
If you'd like, I can:
- Create an ARM/Bicep or Terraform template for provisioning optimized managed disks for a 3-tier database stack.
- Build an automation runbook that runs the cleanup scripts on schedule and produces a report.
- Provide a printable runbook-version of the checklists in PDF.
Final FAQs — short answers & pointers
- Encryption: Always enable CMEK if compliance requires customer control of keys.
- Orphans: Periodically scan for unattached disks & snapshots — they cost money.
- Testing: Regularly test restores; snapshots alone are not sufficient unless validation is performed.
— End of guide —
Content contains actionable code examples; adapt credentials, resource names and permissions before running in production. Hyperlinks for topic keywords point to cloudknowledge.in for further reading.







Leave a Reply