Master of Block Storage for Virtual Machines 2025

Block Storage for Virtual Machines — Architecture, Best Practices, Troubleshooting & Automation

Comprehensive, practical guide with provider comparisons, performance tuning, snapshot & DR procedures, security, automation scripts (PowerShell, Azure Resource Graph, AWS CLI), FAQs and keypoints for each section.

Quick navigation

Definition & purpose
How block storage works
Block vs File vs Object
Performance characteristics & IOPS tuning
Volume types by cloud providers (Azure, AWS, GCP)
Snapshots, backups & DR
Security & encryption
Automation & troubleshooting (PowerShell, Azure Resource Graph, AWS CLI)
Monitoring & capacity planning
Hybrid, migration & best practices
FAQs & keypoints

Definition & purpose

Definition of Block Storage — Block storage stores data in fixed-size blocks and exposes these blocks to operating systems or VMs as virtual disks. Each block is addressable, and the storage subsystem treats the collection of blocks as a raw volume that the VM can partition, format, and mount.

Purpose in virtual machines — Block storage provides persistent, low-latency storage volumes for VMs in cloud and virtualized environments. Use cases commonly include: databases, application servers, high I/O workloads, transactional systems, and enterprise applications that require consistent performance and direct file system control.

Keypoints

Exposed to VM as a raw disk (attach/detach independent of VM lifecycle).
Predictable latency and IOPS for transactional workloads.
Supports snapshots, cloning, resizing and replication.

How block storage works (simple)

Block storage systems present volumes (LUNs or virtual disks) that can be attached to VMs via a hypervisor or cloud fabric. The VM's OS formats the block device using a filesystem (NTFS, ext4, XFS) or can use raw block access for database engines that support it.

Lifecycle overview

Create a volume — allocate capacity and performance tier.
Attach the volume to a VM — the OS discovers the disk (hot attach supported).
Format & mount — create file system or use raw device.
Use — perform reads/writes; the cloud provider handles underlying storage.
Snapshot/Backup — use snapshots for point-in-time backups.
Detach/Delete — volumes can persist after VM deletion if configured.

Pro tip: Use separate volumes for OS, logs, database files, and backups — simplifies IO isolation and recovery.

Block vs File vs Object storage (concise comparison)

Capability	Block Storage	File Storage	Object Storage
Primary use	VM disks, databases	Shared file systems, home directories	Large-scale archives, blobs
Performance	Low latency, high IOPS	Moderate, SMB/NFS overhead	Optimized for throughput, eventual consistency
Access method	Block-level (raw disk)	File-level (POSIX, SMB/NFS)	HTTP-based (REST API)
Scalability	Scales in capacity & IOPS	Scales with cluster	Massive scale, object metadata

Keypoints

Choose block storage when you need direct disk control and predictable I/O (databases, transactional systems).
Choose object storage for cold data, backups, and massive content distribution.

Performance characteristics & IOPS optimization

Block storage performance primarily depends on: IOPS (input/output operations per second), throughput (MB/s), and latency (ms). Providers offer performance tiers — e.g., Provisioned IOPS, standard SSD, premium SSD, ultra disk.

What affects performance?

Disk type (SSD vs HDD)
Provisioned IOPS and burst credits
VM size and network bandwidth
File system and block size alignment
Queue depth and concurrency

Tuning checklist

Use default block alignments (64K, 1MB depending on DB vendor).
Configure the VM's IO scheduler (noop or none for many cloud VMs).
Use multiple volumes striped for higher IOPS/throughput where supported.
Avoid single-threaded IO patterns for very high throughput workloads.
Monitor queue depth and adjust application concurrency.

Example: SQL Server on SSD-backed managed disk + properly aligned volumes typically yields consistent latency <10ms for OLTP workloads; but results vary with workload.

Volume types by cloud providers (provider quick reference)

Below are the common block storage offerings — link keywords to authoritative resources for deeper reading.

Microsoft Azure — Managed Disks

Standard HDD (cost-effective, sequential).
Standard SSD (general purpose).
Premium SSD (low-latency, high IOPS).
Ultra Disk (customizable IOPS/throughput/latency).

See the Managed Disks documentation and performance guidance at Azure Managed Disks.

AWS — Elastic Block Store (EBS)

General Purpose SSD (gp3/gp2).
Provisioned IOPS SSD (io2/io2 Block Express).
Throughput Optimized HDD (st1) and Cold HDD (sc1).

More on EBS volume types at EBS volume types.

Google Cloud — Persistent Disks

Standard persistent disk (HDD).
Balanced PD (SSD).
Extreme Persistent Disk (high IOPS/throughput).

Read the Persistent Disks overview at GCP Persistent Disks.

Keypoints

Ultra/Extreme disks are best for latency-sensitive DBs.
Provisioned IOPS solves predictable performance needs but costs more.

Snapshots, backups & disaster recovery

Snapshots are point-in-time, incremental captures of block volumes. Cloud providers optimize storage of snapshots (incremental differences) to minimize cost and speed up restores.

Snapshot workflows

Quiesce application (if possible) to flush buffers.
Initiate snapshot (consistent backup).
Store snapshot in region or replicate to secondary region.
Test restore periodically (DR drills).

Snapshot-based disaster recovery patterns

Cold DR: snapshots copied to another region; restore when needed.
Warm DR: snapshots used to maintain standby VMs.
Hot DR (replication): synchronous or near-synchronous replication to other zones/regions for RTO/RPO requirements.

Important: Application-consistent snapshots (application quiesce + VSS for Windows, or fsfreeze for Linux) reduce recovery issues for databases.

Replication & availability zones

Cloud providers offer different redundancy options: locally redundant, zone-redundant, and geo-redundant. Understand SLA implications for each redundancy model.

Integration with databases

Databases (SQL Server, Oracle, PostgreSQL, MySQL) typically require low-latency, consistent IO. Recommended patterns:

Place DB data files on Premium SSD or equivalent.
Separate logs on lower-latency dedicated disks to avoid I/O contention.
Consider raw devices for certain DB engines that handle volume-level management.

IOPS & throughput optimization — practical examples

Common approaches to increase IOPS/throughput:

Increase volume performance tier (e.g., gp3 -> io2).
Increase volume count and stripe across volumes inside VM (software RAID 0) — increases aggregate IOPS but careful with redundancy.
Use instance types with higher network bandwidth and EBS/attachment performance.
Enable write-back caches only when safe; ensure battery-backed caches or cloud guarantees.

Cost models & pricing considerations

Block storage costs are usually: per-GB-month + per-IOPS or throughput tiers + snapshot storage costs. Additional costs may include data transfer for cross-region replication.

Keypoints

Monitor snapshot growth; delete stale snapshots to control costs.
Balance provisioned IOPS vs. burst models to optimize spend.

Automation & management tools (PowerShell, CLI, IaC)

Block storage can be managed via portal, CLI, SDKs, PowerShell, and infrastructure-as-code (Terraform, ARM/Bicep). Below are practical troubleshooting and automation snippets.

Azure: PowerShell examples (Managed Disks)

Prerequisites: Install-Module Az and sign in (Connect-AzAccount).

List unattached managed disks (useful for cost cleanup)

# List all unattached managed disks in subscription
$disks = Get-AzDisk
$unattached = $disks | Where-Object { -not $_.ManagedBy }
$unattached | Select-Object Name, ResourceGroupName, DiskSizeGB, Sku.Name | Format-Table -AutoSize

Resize a managed disk (online for many disk types)

# Resize managed disk to 512 GB
$rg = "myResourceGroup"
$diskName = "myDisk"
$disk = Get-AzDisk -ResourceGroupName $rg -DiskName $diskName
$disk.DiskSizeGB = 512
Update-AzDisk -ResourceGroupName $rg -Disk $disk

Create a snapshot and copy to another region (Azure)

# Create snapshot
$snapshotConfig = New-AzSnapshotConfig -SourceUri $disk.Id -Location "eastus" -CreateOption Copy
New-AzSnapshot -Snapshot $snapshotConfig -SnapshotName "myDiskSnapshot" -ResourceGroupName $rg

# For cross-region copy, use incremental snapshot export/copy (or use AzCopy for managed blob-level copy with proper steps)

Azure Resource Graph query — find large disks (>1TB) and vm pairing

# Requires Az.ResourceGraph
Search-AzGraph -Query @"
Resources
| where type =~ 'microsoft.compute/disks'
| extend diskSizeGB = toint(properties.diskSizeGb)
| where diskSizeGB > 1024
| project name, resourceGroup, location, sku = properties.sku.name, diskSizeGB, managedBy = properties.managedBy
"@

Use Resource Graph for subscription-wide discovery — much faster than iterating resources via API.

AWS: CLI examples (EBS)

Prerequisites: AWS CLI configured with credentials and region.

List unattached volumes

aws ec2 describe-volumes --filters Name=status,Values=available --query "Volumes[*].[VolumeId,Size,SnapshotId,AvailabilityZone]" --output table

Create snapshot and copy to another region

aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "Prod DB snapshot"
# then copy snapshot to other region
aws ec2 copy-snapshot --source-region us-east-1 --source-snapshot-id snap-0123456789abcdef0 --destination-region us-west-2 --description "Prod DB snapshot copy"

GCP: gcloud examples (Persistent Disk)

# Create snapshot
gcloud compute disks snapshot my-disk --snapshot-names=my-disk-snap --zone=us-central1-a

# List snapshots
gcloud compute snapshots list --filter="name~my-disk"

Troubleshooting guide — step-by-step

Common issues and checks when block storage exhibits problems:

Issue: High latency and slow queries

Check instance host: CPU, memory, network saturation.
Check disk metrics: IOPS, throughput, latency (provider metrics).
Compare actual IOPS to provisioned IOPS.
Inspect queue depth and application IO patterns.
Consider upgrading disk tier or striping across multiple volumes.

Issue: Disk not attached or not visible in VM

Confirm volume status via provider console/CLI (attached state).
In Linux, run lsblk or sudo fdisk -l.
On Windows, check Disk Management or Get-Disk in PowerShell.
Verify VM agent/hypervisor is healthy and latest.

Issue: Snapshot restore failed or inconsistent

Ensure application-consistent snapshot was used for DBs.
Check snapshot chain for corruption; attempt restore to new disk and mount read-only to inspect.
Use database recovery tools (restore logs, apply transactions) if necessary.

Keypoints — troubleshooting

Always correlate application metrics with storage metrics when troubleshooting latency.
Use read replicas (for DBs) to offload reporting from primary disk I/O.

Security features & encryption

Block storage supports encryption-at-rest and encryption-in-transit. Providers allow customer-managed keys (CMK) via KMS/HSM, or provider-managed keys.

Recommended configuration

Encrypt all volumes in production with CMK where possible.
Use IAM roles/policies to limit attach/detach privileges.
Audit disk and snapshot usage; monitor for orphaned snapshots/disks.

Backup & restore capabilities

Best-practice: adopt a 3-2-1 strategy adapted for cloud:

3 copies of data
2 types of media (block snapshot + object storage export)
1 off-site copy (cross-region snapshot or replication)

Data migration & hybrid cloud integration

Migration strategies:

Lift-and-shift: snapshot/replicate disks to cloud region and restore to VM.
Block-level replication tools (Azure Migrate, AWS Server Migration Service, Storage Gateway).
Database-native migration (Data Migration Service, replication, logical export/import).

Monitoring & capacity planning

Monitor metrics: latency, IOPS, throughput, queue length, burst credits, and snapshot sizes. Use provider monitoring (Azure Monitor, CloudWatch, Stackdriver) and integrate with SIEM/observability tools.

Sample metric alerts (example)

Alert if average latency > 20ms for 5 minutes.
Alert if IOPS >= 90% of provisioned for 10 minutes.
Alert if snapshot storage increases > 30% month-over-month.

High availability design

High availability for storage requires zoning and replication:

Use zone-redundant or region-redundant storage for critical volumes.
Leverage database clustering with shared/dedicated volumes (e.g., clustered file systems or multi-attach where supported).
Design stateless application tiers where possible and replicate stateful tiers via database replication.

Hybrid & on-prem integration

Hybrid patterns include synchronous/near-synchronous replication to on-prem arrays, or caching/gateway solutions that present cloud volumes locally. Tools: Azure File Sync, AWS Storage Gateway, vendor replication appliances.

Keypoints — hybrid

Network latency is the major hindrance to synchronous hybrid replication.
Use asynchronous replication when across long distances; design for eventual consistency accordingly.

Best practices summary

Match disk type to workload (SSD for DBs, HDD for archival/sequential reads).
Separate OS, logs, data on different volumes to isolate I/O.
Automate snapshot lifecycle (retention, copy, deletion).
Encrypt volumes and use least-privilege IAM for storage operations.
Monitor and right-size volumes periodically; remove orphaned disks and snapshots.
Test restores and DR playbooks regularly.

Advanced automation & real-world scripts

Below are longer script examples you can adapt for scheduled cleanup, discovery, and health checks.

Azure PowerShell — cleanup script (unattached disks older than 30 days)

# Cleanup unattached managed disks older than X days
Import-Module Az
Connect-AzAccount

$days = 30
$cutoff = (Get-Date).AddDays(-$days)

$disks = Get-AzDisk
$unattached = $disks | Where-Object { -not $_.ManagedBy -and $_.TimeCreated -lt $cutoff }

foreach ($d in $unattached) {
  Write-Output "Found unattached disk: $($d.Name) in RG $($d.ResourceGroupName), size: $($d.DiskSizeGB)GB, created: $($d.TimeCreated)"
  # Optionally remove after approval
  # Remove-AzDisk -ResourceGroupName $d.ResourceGroupName -DiskName $d.Name -Force
}

Azure Resource Graph + PowerShell — inventory all disks + VM pairing

# Use Search-AzGraph to retrieve disk inventory and cross-check VMs
$results = Search-AzGraph -Query @"
Resources
| where type =~ 'microsoft.compute/disks'
| project diskName = name, resourceGroup, location, sku = properties.sku.name, sizeGB = properties.diskSizeGb, managedBy = properties.managedBy
"@

$results | ConvertTo-Json -Depth 5

AWS Bash snippet — delete snapshots older than 90 days (careful!)

# AWS CLI - delete snapshots older than 90 days owned by account
DAYS=90
NOW=$(date +%s)
aws ec2 describe-snapshots --owner-ids self --query "Snapshots[*].[SnapshotId,StartTime,Description]" --output json | \
jq -r '.[] | @base64' | while read line; do
  _jq() { echo ${line} | base64 --decode | jq -r ${1}; }
  sid=$(_jq '.[0]')
  stime=$(_jq '.[1]')
  stime_s=$(date -d "$stime" +%s)
  age=$(( (NOW - stime_s) / 86400 ))
  if [ $age -gt $DAYS ]; then
    echo "Deleting snapshot $sid age $age days"
    aws ec2 delete-snapshot --snapshot-id $sid
  fi
done

FAQs (FQUs) — Frequently-asked questions & quick answers

Q: Can I attach a single block volume to multiple VMs?

A: Some providers offer multi-attach volumes (AWS EBS Multi-Attach, Azure Shared Disks) for clustered file systems. Application-level coordination (cluster-aware FS) is required to prevent corruption.

Keypoint: Multi-attach is intended for clustered applications (e.g., clustered databases or clustered file systems) and requires careful testing.

Q: Is resizing a disk destructive?

A: Resizing a managed disk is generally non-destructive — you can increase size online. After resizing the volume at provider level, you must extend the partition and filesystem inside the guest OS.

Keypoint: Shrinking volumes is usually unsupported or risky — snapshot first.

Q: How often should I snapshot?

A: Snapshot frequency depends on RPO requirements. For critical DBs, consider transaction log shipping or continuous replication plus periodic full snapshots. For less critical systems, daily snapshots may suffice.

Q: How to choose between Provisioned IOPS and standard SSD?

A: Choose Provisioned IOPS when you need guaranteed IOPS and consistent low latency. Standard SSD is sufficient for many general-purpose workloads and is cheaper.

Mini case study: Migration of on-prem DB to cloud block storage

Scenario: Enterprise running SQL Server on SAN planning move to Azure. Steps taken:

Inventory SAN LUNs and their IOPS/throughput using perfmon.
Map LUNs to managed disks; choose Premium SSD for data, Ultra for high IOPS portions, Standard for backups.
Use Azure Migrate for lift-and-shift; take full backup & restore for final cutover with minimal RTO (log shipping to minimize cutover window).
After cutover, observe disk metrics and right-size disks and VM SKU within 2 weeks.

Keypoints

Measure real I/O before picking disk type; cloud metrics/instrumentation help size accurately.
Test application behavior on same disk types (staging) before production migration.

Operational checklists (copy into runbooks)

Provisioning runbook

Confirm IOPS/throughput requirements (peak vs sustained).
Choose disk tier and attach to VM in required AZ.
Initialize and format with appropriate block size and alignment.
Ensure monitoring/alerts configured.
Enable encryption and set IAM/policies.

Pre-snapshot checklist for databases

Notify stakeholders and freeze write operations if possible.
Flush DB caches and perform VSS or app-consistent snapshot.
Verify snapshot chain and replicate copy off-site if required.

Closing — where to go next

This guide covered architecture, provider comparisons, performance tuning and operational scripts you can use immediately to discover, troubleshoot and manage block storage for VMs. For more provider-specific tuning and step-by-step examples, see the provider docs linked in each provider section and adapt the PowerShell/AWS CLI scripts to your environment.

If you'd like, I can:

Create an ARM/Bicep or Terraform template for provisioning optimized managed disks for a 3-tier database stack.
Build an automation runbook that runs the cleanup scripts on schedule and produces a report.
Provide a printable runbook-version of the checklists in PDF.

Final FAQs — short answers & pointers

Encryption: Always enable CMEK if compliance requires customer control of keys.
Orphans: Periodically scan for unattached disks & snapshots — they cost money.
Testing: Regularly test restores; snapshots alone are not sufficient unless validation is performed.

— End of guide —

Content contains actionable code examples; adapt credentials, resource names and permissions before running in production. Hyperlinks for topic keywords point to cloudknowledge.in for further reading.

Block Storage for Virtual Machines — Architecture, Best Practices, Troubleshooting & Automation

Quick navigation

Definition & purpose

How block storage works (simple)

Lifecycle overview

Block vs File vs Object storage (concise comparison)

Performance characteristics & IOPS optimization

What affects performance?

Tuning checklist

Volume types by cloud providers (provider quick reference)

Microsoft Azure — Managed Disks

AWS — Elastic Block Store (EBS)

Google Cloud — Persistent Disks

Snapshots, backups & disaster recovery

Snapshot workflows

Snapshot-based disaster recovery patterns

Replication & availability zones

Integration with databases

IOPS & throughput optimization — practical examples

Cost models & pricing considerations

Automation & management tools (PowerShell, CLI, IaC)

Azure: PowerShell examples (Managed Disks)

List unattached managed disks (useful for cost cleanup)

Resize a managed disk (online for many disk types)

Create a snapshot and copy to another region (Azure)

Azure Resource Graph query — find large disks (>1TB) and vm pairing

AWS: CLI examples (EBS)

List unattached volumes

Create snapshot and copy to another region

GCP: gcloud examples (Persistent Disk)

Troubleshooting guide — step-by-step

Issue: High latency and slow queries

Issue: Disk not attached or not visible in VM

Issue: Snapshot restore failed or inconsistent

Security features & encryption

Recommended configuration

Backup & restore capabilities

Data migration & hybrid cloud integration

Monitoring & capacity planning

Sample metric alerts (example)

High availability design

Hybrid & on-prem integration

Best practices summary

Advanced automation & real-world scripts

Azure PowerShell — cleanup script (unattached disks older than 30 days)

Azure Resource Graph + PowerShell — inventory all disks + VM pairing

AWS Bash snippet — delete snapshots older than 90 days (careful!)

FAQs (FQUs) — Frequently-asked questions & quick answers

Q: Can I attach a single block volume to multiple VMs?

Q: Is resizing a disk destructive?

Q: How often should I snapshot?

Q: How to choose between Provisioned IOPS and standard SSD?

Mini case study: Migration of on-prem DB to cloud block storage

Operational checklists (copy into runbooks)

Provisioning runbook

Pre-snapshot checklist for databases

Closing — where to go next

Final FAQs — short answers & pointers

Related Story

One comment

Leave a Reply Cancel reply

Leave a Reply

Leave a Reply
Cancel reply