Networking & Connectivity in AWS: VPCs, Transit Gateway, Peering, and Hybrid (VPN / Direct Connect) — A Practical Guide
Design multi-account networks, fix routing/NAT/peering issues, connect on-prem securely, and control egress/latency—with scripts and step-by-step runbooks.
AWS networking can feel daunting—CIDRs, route tables, NAT gateways, security groups, NACLs, VPC peering, AWS Transit Gateway, AWS Direct Connect, and hybrid patterns all collide. This deep-dive gives you a clear blueprint for multi-account design, a precise troubleshooting runbook for connectivity issues, and a cost/latency checklist you can apply immediately.
What You’ll Learn (No Index, just value)
- How to pick between VPC Peering and Transit Gateway in multi-account topologies.
- Step-by-step fixes for routing, NAT, SG/NACL, and peering failures (with commands and diagrams).
- Hybrid connectivity choices: Site-to-Site VPN vs. Direct Connect, plus SD-WAN tips.
- How to keep egress charges under control and design for low latency without over-engineering.
- PowerShell (AWS Tools) scripts to enumerate routes, TGW attachments, and check security surfaces.
1) Core Building Blocks (Quick Refresh)
VPC & Subnets
A Virtual Private Cloud (VPC) is your private CIDR space (e.g., 10.0.0.0/16) sliced into subnets (public with IGW routes, private via NAT/TGW/DX/VPN).
- Public subnet: route to Internet Gateway (IGW) for egress.
- Private subnet: route to NAT Gateway or TGW/DX/VPN for north-south traffic.
- Isolated subnet: no IGW/NAT—east-west only via peering or TGW.
Route Tables
Each subnet associates to one route table. Overlapping CIDRs or missing routes are the most common outage causes. Watch for black-holes when a target (e.g., NAT) is deleted.
Tip: turn on route propagation for TGW/VPN/DX where appropriate, but keep explicit static routes for control.
Security Groups vs NACLs
- Security Groups: stateful, instance-level, allow rules only.
- NACLs: stateless, subnet-level, ordered allow/deny for ingress/egress.
Start permissive at NACLs; enforce least-privilege in SGs. Double-denies (NACL+SG) cause silent drops.
Gateways & Connectors
- IGW: internet access for public subnets.
- NAT Gateway: egress for private subnets (per-AZ for resilience).
- VPC Peering: point-to-point, no transitive routing.
- Transit Gateway (TGW): hub-and-spoke, scalable, transitive.
- Site-to-Site VPN: IPSec tunnels over the internet.
- Direct Connect (DX): private dedicated link; stable latency, predictable egress pricing.
2) Multi-Account, Multi-VPC Architecture: Peering vs. Transit Gateway
Many teams start with VPC peering for a few accounts, then hit a wall when connectivity grows. AWS Transit Gateway becomes the central hub for hundreds of attachments, route domains (Routable & Blackhole routes), and inter-region peering between TGWs. Use these decision cues:
Choose VPC Peering if…
- ≤ 5–10 connections, simple point-to-point flows.
- No need for transitive routing (A cannot reach C via B).
- Primarily intra-Region traffic and consistent CIDR planning.
- Budget is tight—no TGW per-GB data processing fees.
Choose Transit Gateway if…
- Many accounts/VPCs with hub-and-spoke topologies.
- Need transitive routing across VPCs and on-prem.
- Desire centralized policy and segmented route domains.
- Growth to multi-Region with TGW peering.
Reference Patterns
- Spoke VPCs per workload Shared Services VPC Inspection VPC via GWLB or NVA.
- Segment routes with multiple TGW route tables: prod, non-prod, shared, inspection.
- Inter-Region: TGW-to-TGW peering + selective route propagation.
Gotchas (Save Hours Later)
- No transitivity in VPC peering: add explicit peering pairs or move to TGW.
- Overlapping CIDRs: block attachments, force NAT or renumbering.
- Asymmetric routing: inspection/firewall paths often cause returns to bypass state. Use symmetric design.
- NAT placement costs: NAT per-AZ improves availability but can multiply egress costs.
3) Step-by-Step Troubleshooting: VPC Connectivity (Routing, NAT, Peering, SG/NACL)
Use this structured runbook to reduce MTTR. Move top-to-bottom without skipping.
Step 0 — Scope & Baseline
- Define source (IP/ENI/Instance/ALB) and destination (IP/Port/Service).
- Confirm CIDRs and that they do not overlap across VPCs/on-prem.
- Collect instance-side evidence:
ping,traceroute,curl -v.
Step 1 — Security Surfaces
- Security Groups: allow inbound destination port/protocol and outbound return traffic.
- NACLs: ensure matching ingress/egress allows; avoid accidental DENY at high priority.
- OS Firewalls: iptables/windows firewall often overlooked.
Rule of thumb: Start permissive to prove path, then tighten to least-privilege.
Step 2 — Route Tables
- On the source subnet, verify a route exists to the destination CIDR via correct target (ENI/Peering/TGW/NAT/IGW/VPN/DX).
- On the return path, verify symmetric route back to source CIDR.
- Look for black-hole routes where targets are deleted.
Step 3 — Middleboxes & Policies
- If using Gateway Load Balancer or firewalls, ensure health checks are passing and routes steer both directions through the same inspection path.
- For VPC Peering, add explicit routes in both VPCs’ route tables.
- For TGW, confirm attachment is available and the right TGW route table is associated/propagated.
Step 4 — Underlay and Name Resolution
- DNS: Private hosted zones need VPC associations; enable
enableDnsHostnamesandenableDnsSupport. - MTU/Path-MTU: IPSec/GRE overlays reduce MTU (e.g., 1500 → 1436/1399). Symptoms: large payload stalls.
- Health of VPN tunnels: failover events can drop flows mid-session; retest after re-establish.
CLI/Scripted Checks (AWS Tools for PowerShell)
Install module: Install-Module -Name AWSPowerShell.NetCore -Scope CurrentUser
# List VPCs, CIDRs, and DNS settings
Get-EC2Vpc | Select-Object VpcId, CidrBlock, IsDefault, EnableDnsSupport, EnableDnsHostnames
# Show subnets + route table association
Get-EC2Subnet | Select-Object SubnetId, VpcId, CidrBlock, AvailabilityZone, RouteTableAssociationId
# Enumerate route tables and routes (spot black-holes)
Get-EC2RouteTable | ForEach-Object {
$_.Routes | ForEach-Object {
[PSCustomObject]@{
RouteTableId = $_.RouteTableId
Destination = $_.DestinationCidrBlock
State = $_.State
Target = ($_.GatewayId,$_.NatGatewayId,$_.TransitGatewayId,$_.VpcPeeringConnectionId,$_.NetworkInterfaceId) -join '/'
}
}
} | Sort-Object RouteTableId, Destination | Format-Table -AutoSize
# Check peering connections and status
Get-EC2VpcPeeringConnection | Select-Object VpcPeeringConnectionId, Status, RequesterVpcInfo, AccepterVpcInfo
# Check Transit Gateway attachments and route tables
Get-EC2TransitGatewayAttachment | Select-Object TransitGatewayAttachmentId, TransitGatewayId, ResourceType, ResourceId, State
Get-EC2TransitGatewayRouteTable | ForEach-Object {
$rt = $_
Get-EC2TransitGatewayRoute -TransitGatewayRouteTableId $rt.TransitGatewayRouteTableId | ForEach-Object {
[PSCustomObject]@{
TgwRouteTableId = $rt.TransitGatewayRouteTableId
Destination = $_.DestinationCidrBlock
Type = $_.Type
State = $_.State
Attachment = ($_.TransitGatewayAttachments | Select-Object -ExpandProperty TransitGatewayAttachmentId) -join ','
}
}
} | Format-Table -AutoSize
# VPN tunnels health (if using Virtual Private Gateways)
Get-EC2VpnConnection | Select-Object VpnConnectionId, State, CustomerGatewayConfiguration, VpnGatewayId, TransitGatewayId
Interpretation tips:
- Peering: Status must be active on both sides; ensure routes in both VPCs include each other’s CIDRs via the peering ID.
- TGW: Attachments should be available. The spoke VPC subnets must associate with the correct VPC route tables that point to the TGW.
- Black-hole: Route state shows blackhole when target is missing or down—fix target or delete/replace route.
4) Hybrid Connectivity: On-Prem to AWS (VPN vs. Direct Connect)
For migrations, the common path starts with Site-to-Site VPN (fast, internet-based), then moves to Direct Connect for predictable latency and cost. Many enterprises run both: DX for primary, VPN for backup.
When to Prefer VPN
- Need speed to market (days, not weeks).
- Lower throughput acceptable; dynamic routing with BGP.
- Budget constraints or temporary migration phases.
When to Prefer Direct Connect
- Stable, low-jitter latency is critical (databases, FIX/market feeds, HFT not typical on cloud, but consistency matters).
- Large/steady data transfer where DX egress pricing beats internet egress.
- Compliance or network segmentation needs a private underlay.
Design Notes
- Redundancy: two VPN tunnels per connection; for DX, use Link Aggregation Group (LAG) or dual circuits across different providers/POPs.
- Routing: Use BGP for failover and route summarization; watch for longer-prefix wins causing unintended path selection.
- MTU: Set consistent MTU across routers and EC2; enable jumbo frames where supported end-to-end.
PowerShell: Verify DX/VPN Advertised Routes via TGW
# List TGW route tables and identify learned (propagated) routes from on-prem
$tgwRts = Get-EC2TransitGatewayRouteTable
foreach($t in $tgwRts){
Write-Host "== TGW RT:" $t.TransitGatewayRouteTableId
Get-EC2TransitGatewayRoute -TransitGatewayRouteTableId $t.TransitGatewayRouteTableId |
Where-Object { $_.Type -eq 'propagated' } |
Select-Object DestinationCidrBlock, State, @( @{n='Attachments'; e={ ($_.TransitGatewayAttachments | % TransitGatewayAttachmentId) -join ',' } } ) |
Format-Table -AutoSize
}
Confirm your on-prem prefixes appear as propagated; if not, debug BGP neighbors on the CPE/DXGW side.
5) Data Transfer, Egress Fees & Latency-Optimized Design
Networking cost is often invisible until the bill arrives. Keep traffic in-Region, in-AZ where possible, and avoid unnecessary hairpins through inspection VPCs. When you must cross Regions or the public internet, pick the right connector and placement.
Cost-Control Checklist
- Prefer VPC endpoints (Gateway/Interface) over public egress to AWS APIs/S3.
- Place NAT Gateways per AZ only if necessary; otherwise consolidate to minimize per-GB charges (balance with resilience).
- Minimize cross-AZ and cross-Region chatter; co-locate tightly coupled tiers.
- For heavy steady flows, consider DX pricing vs. internet egress.
- For multi-account traffic, compare TGW data processing vs. peering for high-volume 1:1 flows.
Latency-First Placement
- Keep app ↔ DB in the same AZ where resilience model allows; otherwise use read replicas and asynchronous paths across AZs.
- Place caches (ElastiCache) near producers/consumers; avoid cross-AZ hot paths.
- Use Global Accelerator for TCP acceleration to the nearest AWS edge for global users.
6) End-to-End Runbook: “It works in VPC-A, but not from VPC-B”
- Identify path: VPC-B → TGW → VPC-A (App). Get source/destination IP:port.
- Check SG/NACL: Allow from VPC-B CIDR to target port. NACL allow both directions.
- Verify TGW association: VPC-B attachment associated to correct TGW route table.
- Check TGW routes: Is VPC-A CIDR present (static or propagated)? State = active.
- Spoke route tables: Subnets in VPC-B have a route to VPC-A CIDR via TGW; VPC-A subnets route back to VPC-B via TGW.
- DNS & MTU: Resolve to private IP; test with
ping -s/ping -l(size) to detect fragmentation issues. - Re-test with bypass: Temporarily bypass inspection/firewall to isolate whether middlebox is the culprit.
PowerShell Helper: Source-Dest Reachability Snapshot
# Requires EC2 Instance Connect or SSM Session Manager on the source host
# This uses SSM to run reachability checks without public SSH/RDP
$instanceId = "i-xxxxxxxxxxxx"
$dest = "10.1.20.15"
$port = 5432
Send-SSMCommand -InstanceId $instanceId -DocumentName "AWS-RunShellScript" -Parameter @{ commands = @(
"date",
"echo Testing to $dest:$port",
"timeout 3 bash -c '
Validate basic TCP reachability, curl headers, routing, and DNS on the instance—no public exposure needed.
7) Designing the Shared Services & Inspection VPCs
Central services (AD/DNS/PKI, logging, patch, CI/CD) often live in a shared VPC reached via TGW.
- DNS: Route 53 resolver endpoints (inbound/outbound) for hybrid name resolution.
- Inspection: Use GWLB or managed appliances; enforce symmetric routing; consider separate TGW route table.
- Logging: VPC Flow Logs to centralized S3/CloudWatch; tag by account/application.
Policy Guardrails (Org-level)
- Service Control Policies: restrict IGW/NAT creation to network accounts only.
- Mandatory VPC endpoints (S3/DynamoDB) via org-wide IaC.
- Baseline SGs/NACLs via StackSets; enforce tagging standards.
8) Practical IaC Snippets (for Consistency)
Terraform: Minimal TGW Hub
resource "aws_ec2_transit_gateway" "hub" {
description = "Org TGW"
default_route_table_association = "disable"
default_route_table_propagation = "disable"
dns_support = "enable"
vpn_ecmp_support = "enable"
}
resource "aws_ec2_transit_gateway_route_table" "prod" {
transit_gateway_id = aws_ec2_transit_gateway.hub.id
}
# Associate VPC attachments to prod table; add static routes if needed
CloudFormation: VPC Endpoint for S3
{
"Resources": {
"S3GatewayEndpoint": {
"Type": "AWS::EC2::VPCEndpoint",
"Properties": {
"ServiceName": { "Fn::Sub": "com.amazonaws.${AWS::Region}.s3" },
"VpcId": { "Ref": "VPC" },
"RouteTableIds": [{ "Ref": "PrivateRtA" }, { "Ref": "PrivateRtB" }],
"VpcEndpointType": "Gateway"
}
}
}
}
9) Frequently Searched Scenarios & Fixes
“AWS VPC peering issues”
- Both sides must add routes to each other’s CIDRs via the peering connection.
- SGs/NACLs must allow traffic; default SGs might be too restrictive.
- No transitive routing—if you need A↔C via B, you must add A↔C peering or adopt TGW.
“AWS Transit Gateway best practices”
- Separate TGW route tables per environment (prod/non-prod) and for inspection.
- Enable ECMP for VPN; use BGP for dynamic failover.
- Use TGW peering for inter-Region with selective route propagation to avoid flooding.
“AWS Direct Connect vs VPN”
- DX gives stable latency and better egress economics for steady traffic; VPN is fast to deploy and flexible.
- Many run DX primary + VPN backup. Validate BGP MED/local-pref to prefer DX.
- Consider DX Gateway for multi-Region and multi-account reachability.
10) Operational Excellence: Testing, Monitoring, Documentation
- Testing: Synthetic canaries for critical ports; Path MTU discovery tests; packet captures during changes.
- Monitoring: CloudWatch metrics for TGW/VPCE/VPN; alarms on BlackHole routes, VPN state changes.
- Documentation: Diagrams per account with CIDR inventories; change logs for route/table edits.
Flow Logs Quick Filters
# Find drops from a specific source to a destination in VPC Flow Logs (Athena)
SELECT * FROM vpc_flow_logs
WHERE srcaddr = '10.0.12.34'
AND dstaddr = '10.1.20.15'
AND action = 'REJECT'
AND start >= to_unixtime(current_timestamp - interval '1' hour);
If REJECT, identify whether SG/NACL caused it; if ACCEPT but app fails, check app-layer ACLs or health.
11) Quick Decision Guide (Cheat Sheet)
Peering
TGW
DX
12) Security Hardening for Connectivity
- Default-deny SGs; allow least ports by CIDR or SG references.
- NACLs: broad allows with explicit denies only when required (avoid accidental blocks).
- Central egress filtering via Inspection VPC + GWLB; maintain allowlists for AWS APIs via VPCEs.
- Rotate keys/certs for VPN/DX; enable CloudTrail and config rules for gateway changes.
13) Sample “Day-2” Command Set (Mix-n-Match)
# Identify subnets without explicit routes to TGW or IGW (potential isolation)
Get-EC2RouteTable | % {
$rt = $_
$hasTgwOrIgw = $rt.Routes | Where-Object { $_.GatewayId -like 'igw-*' -or $_.TransitGatewayId -like 'tgw-*' }
if(-not $hasTgwOrIgw){ "Isolated? $($rt.RouteTableId)" }
}
# Find overlapping CIDRs between VPCs quickly
$vpcs = Get-EC2Vpc | Select-Object VpcId, CidrBlock
foreach($a in $vpcs){ foreach($b in $vpcs){
if($a.VpcId -ne $b.VpcId){
# naive check; for production use a proper CIDR overlap function
if($a.CidrBlock.Split('/')[0] -eq $b.CidrBlock.Split('/')[0]){
"Possible overlap pattern: $($a.VpcId) and $($b.VpcId) ($($a.CidrBlock) vs $($b.CidrBlock))"
}
}
}}
# Report VPN connections that are not 'available'
Get-EC2VpnConnection | Where-Object { $_.State -ne 'available' } |
Select-Object VpnConnectionId, State, TransitGatewayId, VpnGatewayId
Automation Idea
Create a scheduled job to snapshot route tables/TGW routes to S3 daily. Diff changes to catch accidental edits causing outages.
14) Putting It All Together
Start small with VPC peering if your topology is simple. If you’re growing across teams/accounts, standardize on a Transit Gateway hub with segmented route tables and a dedicated network-services account. For hybrid, deploy VPN then add Direct Connect for stability and economics. Keep egress down with VPC endpoints, minimize cross-AZ chatter, and bake tests/alarms into Day-2.
15) Glossary (Fast Reminders)
- IGW: Internet Gateway.
- VGW/DXGW: Virtual/Direct Connect Gateways.
- GWLB: Gateway Load Balancer for inline appliances.
- TGW: Transit Gateway (hub for transitive routing).
- VPCe: VPC Endpoint (Gateway/Interface) to AWS services privately.
Further Learning on CloudKnowledge.in
Explore deep dives on AWS networking, hybrid connectivity, VPC security, and cost optimization.













Leave a Reply