Terraform AWS VPC Networking Guide: Subnets, NAT Gateway, Transit Gateway, and PrivateLink
Why VPC Networking Architecture Matters
The Virtual Private Cloud is the foundation of every AWS deployment. Every EC2 instance, RDS database, Lambda function with VPC access, EKS cluster, and ECS task runs within a VPC. Getting the networking architecture right from the start prevents painful migrations later. Poor CIDR planning leads to IP exhaustion, missing network segmentation creates security gaps, and incorrect routing causes connectivity failures that are difficult to diagnose.
Production VPC networking requires careful consideration of several interconnected concerns: IP address planning that accommodates growth, subnet tiers that enforce network segmentation, routing that balances connectivity with isolation, NAT configuration that controls egress costs, and monitoring that provides visibility into traffic patterns. Each of these decisions has implications for security, cost, and operational complexity.
In this guide, I walk through a complete production VPC architecture using Terraform, covering the full stack from CIDR planning through Transit Gateway connectivity. The patterns here come from real-world implementations at Citadel Cloud Management, where I design network architectures for enterprise clients running hundreds of AWS accounts in multi-region configurations. The complete Terraform modules are available in my terraform-aws-vpc-complete repository.
VPC Peering vs Transit Gateway vs PrivateLink Comparison
AWS provides three primary mechanisms for connecting VPCs. Each serves different use cases, and understanding their trade-offs is essential for designing a scalable network architecture. The following table provides a detailed comparison based on the AWS VPC documentation and real-world operational experience.
| Feature | VPC Peering | Transit Gateway | PrivateLink |
|---|---|---|---|
| Connectivity Model | Point-to-point between two VPCs | Hub-and-spoke or full mesh | Service consumer to provider |
| Transitive Routing | Not supported | Fully supported | Not applicable |
| Cross-Region | Supported (inter-region peering) | Supported (inter-region peering) | Supported via cross-region |
| Cross-Account | Supported | Supported via RAM sharing | Supported |
| CIDR Overlap | Not allowed | Not allowed on same route table | Allowed (uses ENI private IPs) |
| Bandwidth | No limit (inter-AZ: 5 Gbps, same AZ: 10+ Gbps) | Up to 50 Gbps per attachment | Dependent on NLB targets |
| Cost | Free (same AZ), data transfer (cross-AZ/region) | $0.05/hr per attachment + $0.02/GB | $0.01/hr per endpoint + $0.01/GB |
| Scale Limit | 125 peering connections per VPC | 5,000 attachments per TGW | Unlimited endpoints |
| Centralized Inspection | Not possible | Supported via inspection VPC | Not applicable |
| Best For | 2-3 VPCs, simple direct connectivity | Multi-VPC networks, centralized routing | Service exposure, AWS service access |
For most enterprise landing zones, Transit Gateway is the right choice. It provides transitive routing, centralized network management, and the ability to inspect traffic through a security VPC. VPC Peering remains useful for high-bandwidth, low-cost connections between a small number of VPCs. PrivateLink serves a different purpose entirely, providing secure service-level connectivity rather than network-level connectivity.
Multi-AZ Subnet Architecture Design
A production VPC subnet architecture uses three tiers distributed across at least three availability zones. This design provides both high availability and network segmentation.
Public Subnets for External-Facing Resources
Public subnets have a route to the Internet Gateway and are used exclusively for resources that must be directly reachable from the internet: Application Load Balancers, NAT Gateways, and bastion hosts (if you still use them). No application workloads should run in public subnets. Each public subnet gets a /24 CIDR block, providing 251 usable IP addresses per availability zone.
Private Subnets for Application Workloads
Private subnets route outbound internet traffic through NAT Gateways in the public tier. They host application workloads: EKS node groups, ECS tasks, EC2 instances, and Lambda functions with VPC access. Private subnets receive larger CIDR allocations, typically /20 or /19, to accommodate pod IP ranges for Kubernetes clusters. The route table for private subnets points the default route (0.0.0.0/0) to the NAT Gateway in the same availability zone.
Isolated Subnets for Data Stores
Isolated subnets have no route to the internet, neither inbound nor outbound. They host databases (RDS, ElastiCache, DocumentDB), sensitive data processing systems, and resources that should never communicate with the internet. Access to AWS APIs from isolated subnets goes through VPC endpoints, ensuring that all traffic stays within the AWS network. Isolated subnets typically use /24 CIDR blocks since they host fewer resources.
Terraform VPC with Public, Private, and Isolated Subnets
The following Terraform configuration creates a production-grade multi-AZ VPC with three subnet tiers, NAT Gateways, route tables, and Transit Gateway attachment. This code is derived from my terraform-aws-vpc-complete module and follows AWS networking best practices.
# providers.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.40"
}
}
backend "s3" {
bucket = "your-terraform-state-bucket"
key = "vpc/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "aws" {
region = var.region
}
# variables.tf
variable "region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
variable "vpc_cidr" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
variable "transit_gateway_id" {
description = "Transit Gateway ID for attachment"
type = string
default = ""
}
# locals.tf
locals {
name_prefix = "${var.environment}-vpc"
public_subnets = {
"${var.availability_zones[0]}" = "10.0.1.0/24"
"${var.availability_zones[1]}" = "10.0.2.0/24"
"${var.availability_zones[2]}" = "10.0.3.0/24"
}
private_subnets = {
"${var.availability_zones[0]}" = "10.0.16.0/20"
"${var.availability_zones[1]}" = "10.0.32.0/20"
"${var.availability_zones[2]}" = "10.0.48.0/20"
}
isolated_subnets = {
"${var.availability_zones[0]}" = "10.0.100.0/24"
"${var.availability_zones[1]}" = "10.0.101.0/24"
"${var.availability_zones[2]}" = "10.0.102.0/24"
}
common_tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = "network-foundation"
}
}
# vpc.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = merge(local.common_tags, {
Name = local.name_prefix
})
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-igw"
})
}
# ─── Public Subnets ───
resource "aws_subnet" "public" {
for_each = local.public_subnets
vpc_id = aws_vpc.main.id
cidr_block = each.value
availability_zone = each.key
map_public_ip_on_launch = false
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-public-${each.key}"
Tier = "public"
"kubernetes.io/role/elb" = "1"
})
}
# ─── Private Subnets ───
resource "aws_subnet" "private" {
for_each = local.private_subnets
vpc_id = aws_vpc.main.id
cidr_block = each.value
availability_zone = each.key
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-private-${each.key}"
Tier = "private"
"kubernetes.io/role/internal-elb" = "1"
})
}
# ─── Isolated Subnets ───
resource "aws_subnet" "isolated" {
for_each = local.isolated_subnets
vpc_id = aws_vpc.main.id
cidr_block = each.value
availability_zone = each.key
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-isolated-${each.key}"
Tier = "isolated"
})
}
# ─── Elastic IPs for NAT Gateways ───
resource "aws_eip" "nat" {
for_each = local.public_subnets
domain = "vpc"
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-nat-eip-${each.key}"
})
}
# ─── NAT Gateways (one per AZ for HA) ───
resource "aws_nat_gateway" "main" {
for_each = local.public_subnets
allocation_id = aws_eip.nat[each.key].id
subnet_id = aws_subnet.public[each.key].id
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-nat-${each.key}"
})
depends_on = [aws_internet_gateway.main]
}
# ─── Route Tables ───
# Public route table
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-public-rt"
})
}
resource "aws_route_table_association" "public" {
for_each = local.public_subnets
subnet_id = aws_subnet.public[each.key].id
route_table_id = aws_route_table.public.id
}
# Private route tables (one per AZ for AZ-local NAT)
resource "aws_route_table" "private" {
for_each = local.private_subnets
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[each.key].id
}
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-private-rt-${each.key}"
})
}
resource "aws_route_table_association" "private" {
for_each = local.private_subnets
subnet_id = aws_subnet.private[each.key].id
route_table_id = aws_route_table.private[each.key].id
}
# Isolated route table (no internet route)
resource "aws_route_table" "isolated" {
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-isolated-rt"
})
}
resource "aws_route_table_association" "isolated" {
for_each = local.isolated_subnets
subnet_id = aws_subnet.isolated[each.key].id
route_table_id = aws_route_table.isolated.id
}
# ─── VPC Flow Logs ───
resource "aws_flow_log" "main" {
vpc_id = aws_vpc.main.id
traffic_type = "ALL"
log_destination_type = "cloud-watch-logs"
log_destination = aws_cloudwatch_log_group.flow_logs.arn
iam_role_arn = aws_iam_role.flow_logs.arn
max_aggregation_interval = 60
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-flow-logs"
})
}
resource "aws_cloudwatch_log_group" "flow_logs" {
name = "/aws/vpc/flow-logs/${local.name_prefix}"
retention_in_days = 90
kms_key_id = var.kms_key_arn
tags = local.common_tags
}
resource "aws_iam_role" "flow_logs" {
name = "${local.name_prefix}-flow-logs-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "vpc-flow-logs.amazonaws.com"
}
}]
})
tags = local.common_tags
}
resource "aws_iam_role_policy" "flow_logs" {
name = "flow-logs-policy"
role = aws_iam_role.flow_logs.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
]
Effect = "Allow"
Resource = "*"
}]
})
}
# ─── VPC Endpoints for AWS Services ───
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = concat(
[aws_route_table.public.id],
[for rt in aws_route_table.private : rt.id],
[aws_route_table.isolated.id]
)
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-s3-endpoint"
})
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = concat(
[for rt in aws_route_table.private : rt.id],
[aws_route_table.isolated.id]
)
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-dynamodb-endpoint"
})
}
# ─── Transit Gateway Attachment (conditional) ───
resource "aws_ec2_transit_gateway_vpc_attachment" "main" {
count = var.transit_gateway_id != "" ? 1 : 0
subnet_ids = [for s in aws_subnet.private : s.id]
transit_gateway_id = var.transit_gateway_id
vpc_id = aws_vpc.main.id
dns_support = "enable"
ipv6_support = "disable"
transit_gateway_default_route_table_association = true
transit_gateway_default_route_table_propagation = true
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-tgw-attachment"
})
}
# ─── Outputs ───
output "vpc_id" {
value = aws_vpc.main.id
}
output "public_subnet_ids" {
value = [for s in aws_subnet.public : s.id]
}
output "private_subnet_ids" {
value = [for s in aws_subnet.private : s.id]
}
output "isolated_subnet_ids" {
value = [for s in aws_subnet.isolated : s.id]
}
output "nat_gateway_ips" {
value = [for eip in aws_eip.nat : eip.public_ip]
}
This configuration creates a VPC with 65,536 IP addresses (/16), distributed across three availability zones. The private subnets use /20 blocks with 4,096 IPs each, providing ample capacity for Kubernetes pod networking. The isolated subnets use smaller /24 blocks since they host fewer resources. Each AZ gets its own NAT Gateway for high availability, ensuring that an AZ failure does not impact outbound connectivity for other AZs.
NAT Gateway Configuration and Cost Optimization
NAT Gateways are one of the most significant cost drivers in AWS networking. Each NAT Gateway costs $0.045 per hour ($32.40/month) plus $0.045 per GB of data processed. For a production environment processing 1 TB of outbound traffic per month, the data processing cost alone is $45 per AZ, or $135 across three AZs.
Cost Reduction Strategies for NAT Gateway
The most effective way to reduce NAT Gateway costs is to eliminate traffic that flows through it unnecessarily. VPC Gateway endpoints for S3 and DynamoDB route traffic directly to these services over the AWS backbone network at no additional cost. Interface endpoints for services like ECR, CloudWatch, STS, and Secrets Manager similarly bypass the NAT Gateway. My terraform-aws-privatelink module provisions these endpoints with proper security group configurations.
For development and staging environments where high availability is less critical, using a single NAT Gateway shared across all private subnets reduces the fixed cost by two-thirds. The trade-off is that a single NAT Gateway creates a cross-AZ dependency and a single point of failure. For cost-sensitive development environments, NAT instances on small EC2 instances can further reduce costs, though they require more operational overhead.
Transit Gateway for Multi-VPC Connectivity
AWS Transit Gateway acts as a regional network hub that interconnects VPCs and on-premises networks through a single gateway. It eliminates the complexity of managing full-mesh VPC peering connections, which grows quadratically as you add VPCs. With Transit Gateway, each VPC needs only one attachment, and routing is managed centrally through Transit Gateway route tables.
Transit Gateway Architecture Patterns
The most common Transit Gateway architecture uses multiple route tables for network segmentation. A shared services route table connects VPCs that host DNS, monitoring, CI/CD, and other platform services. A production route table connects production workload VPCs. A development route table connects development and testing VPCs. Route table associations and propagations control which VPCs can communicate, enforcing segmentation without complex security group rules.
For centralized network inspection, route traffic through an inspection VPC hosting AWS Network Firewall or third-party appliances before allowing it to reach its destination. This pattern enables deep packet inspection, intrusion detection, and URL filtering for all inter-VPC and internet-bound traffic. The Transit Gateway routing is configured in my terraform-aws-transit-gateway module.
Transit Gateway with AWS Network Firewall
Combining Transit Gateway with AWS Network Firewall creates a centralized inspection architecture. All traffic between VPCs and between VPCs and the internet routes through the firewall VPC. Network Firewall evaluates traffic against stateful and stateless rule groups, providing IDS/IPS capabilities, domain filtering, and protocol-level inspection. This architecture is particularly important for regulated industries that require network-level compliance controls.
PrivateLink and VPC Endpoints
AWS PrivateLink provides private connectivity to services without traversing the public internet. It creates elastic network interfaces (ENIs) with private IP addresses in your VPC subnets, making the target service appear as if it is running within your own VPC. This eliminates the need for internet gateways, NAT devices, or firewall rules for service access.
Gateway Endpoints vs Interface Endpoints
AWS offers two types of VPC endpoints. Gateway endpoints are available for S3 and DynamoDB only. They are free, highly available, and route traffic through prefix lists added to route tables. Interface endpoints use PrivateLink and are available for over 100 AWS services. They cost $0.01 per hour per AZ plus data processing charges but provide private DNS integration and security group control.
For production environments, I recommend deploying interface endpoints for the following critical services: ecr.api, ecr.dkr, logs (CloudWatch Logs), monitoring (CloudWatch), sts, secretsmanager, and ssm. These endpoints eliminate NAT Gateway costs for AWS API traffic and provide more reliable connectivity since they do not depend on NAT Gateway availability.
VPC Flow Logs and Network Monitoring
VPC Flow Logs capture information about the IP traffic going to and from network interfaces in your VPC. They are essential for security analysis, troubleshooting connectivity issues, and understanding traffic patterns for cost optimization. Flow logs can be published to CloudWatch Logs, S3, or Kinesis Data Firehose.
Flow Log Analysis for Security and Cost Optimization
Flow logs reveal several types of valuable information. Rejected traffic patterns indicate potential scanning or misconfigured security groups. Large outbound data transfers to unexpected destinations may indicate data exfiltration. High-volume traffic to NAT Gateways identifies candidates for VPC endpoint optimization. Cross-AZ traffic patterns highlight opportunities to co-locate services to reduce data transfer costs.
For cost-effective flow log storage, publish to S3 with Parquet format and use Athena for ad-hoc queries. The Parquet format reduces storage costs by 50-70% compared to plain text and significantly improves Athena query performance. Set the aggregation interval to 60 seconds for security monitoring or 600 seconds for cost optimization analysis where real-time granularity is less critical.
AWS Network Firewall Integration
AWS Network Firewall provides managed network traffic filtering for VPCs. It supports stateful inspection, intrusion prevention, and domain-based filtering. When deployed in a centralized inspection architecture with Transit Gateway, it provides consistent security controls across all VPCs without requiring security appliances in each VPC.
Rule Group Design for Enterprise Networks
Network Firewall rule groups should be organized by function. A baseline stateless rule group handles fast-path decisions like blocking known malicious IP ranges and allowing established connections. Stateful rule groups implement domain filtering (allow only approved domains for outbound HTTPS), protocol inspection (ensure TLS version compliance), and IPS signatures (detect and block known exploitation patterns). My terraform-aws-network-firewall module provides reusable rule group templates for common enterprise requirements.
Best Practices for AWS VPC Networking
These best practices reflect years of production network architecture experience across enterprise AWS environments and align with the guidance in the AWS Well-Architected Framework.
- Plan CIDR ranges for 3-5 years of growth. IP exhaustion is one of the most disruptive networking problems to remediate. Use
/16VPC CIDRs for production environments. Reserve CIDR blocks across your organization using an IPAM solution or a simple spreadsheet. AWS VPC IPAM automates this for large-scale deployments. Never use/24or smaller VPCs for any environment that might grow. - Deploy NAT Gateways per AZ for production. A single NAT Gateway creates a cross-AZ dependency. If the AZ hosting the NAT Gateway fails, all private subnets in other AZs lose internet connectivity. The additional cost of per-AZ NAT Gateways is justified for production workloads. For non-production, a single NAT Gateway is acceptable.
- Use VPC endpoints for all high-volume AWS service traffic. Gateway endpoints for S3 and DynamoDB are free and should be deployed in every VPC without exception. Interface endpoints for ECR, CloudWatch, and STS eliminate NAT Gateway data processing charges and improve reliability. Calculate the break-even point: if endpoint hourly costs are less than NAT data processing costs, deploy the endpoint.
- Enable VPC flow logs in every VPC. Flow logs provide the network visibility required for security monitoring, troubleshooting, and cost optimization. Publish to S3 in Parquet format for cost-effective long-term storage. Set up Athena tables for ad-hoc analysis and CloudWatch Metrics filters for automated alerting on rejected traffic spikes.
- Implement three subnet tiers: public, private, isolated. Public subnets for load balancers and NAT Gateways only. Private subnets for application workloads with NAT egress. Isolated subnets for databases with no internet route. Each tier gets its own route table. This segmentation limits blast radius and simplifies security group management.
- Use Transit Gateway for environments with more than three VPCs. The operational complexity of VPC peering grows quadratically. Transit Gateway centralizes routing, enables network segmentation through route tables, and supports centralized inspection through firewall VPCs. The cost is predictable and scales linearly with the number of attachments.
- Tag subnets for Kubernetes integration. EKS requires specific subnet tags for automatic load balancer placement. Tag public subnets with
kubernetes.io/role/elb = 1for internet-facing ALBs and private subnets withkubernetes.io/role/internal-elb = 1for internal ALBs. Include the cluster name tag for multi-cluster VPCs. - Avoid overlapping CIDRs across your organization. CIDR conflicts prevent VPC peering, Transit Gateway connectivity, and VPN routing. Maintain a central CIDR registry and enforce allocation through Terraform modules that validate against the registry before creating subnets.
- Use Network ACLs as a secondary defense layer. While security groups provide primary instance-level filtering, Network ACLs add subnet-level stateless filtering. Use them to block entire CIDR ranges, enforce deny rules that security groups cannot express, and provide an additional layer of defense for compliance requirements.
- Implement DNS resolution across VPCs with Route 53 Resolver. When using Transit Gateway or VPC peering, centralize DNS resolution with Route 53 Resolver endpoints. Create resolver rules for private hosted zones shared across VPCs, on-premises DNS forwarding, and split-horizon DNS for hybrid environments.
Frequently Asked Questions
What is the best VPC subnet architecture for production AWS workloads?
The recommended production VPC architecture uses three subnet tiers across three or more availability zones: public subnets for load balancers and NAT Gateways, private subnets for application workloads like ECS/EKS and EC2 instances, and isolated subnets for databases and sensitive data stores with no internet access. Each tier has its own route table with appropriate routing rules. Use /16 VPC CIDRs to allow for growth.
When should I use Transit Gateway instead of VPC Peering?
Use Transit Gateway when connecting more than 3-4 VPCs, when you need transitive routing between VPCs, or when centralizing network inspection through a firewall VPC. VPC Peering is simpler and free for data transfer within the same AZ, making it ideal for connecting 2-3 VPCs with direct communication requirements. Transit Gateway charges per attachment and per GB processed but provides centralized management.
How do I reduce NAT Gateway costs in AWS?
Reduce NAT Gateway costs by using VPC endpoints for AWS service traffic (S3, DynamoDB, ECR, CloudWatch), which eliminates NAT Gateway data processing charges for those services. Use a single NAT Gateway per AZ rather than per subnet. Consider NAT instances for development environments. Enable VPC flow logs to identify unexpected outbound traffic patterns driving costs. Interface endpoints for high-volume services often pay for themselves within days.
What is AWS PrivateLink and when should I use it?
AWS PrivateLink provides private connectivity between VPCs, AWS services, and on-premises networks without exposing traffic to the public internet. Use PrivateLink when consuming AWS services from private subnets, when exposing your services to other AWS accounts securely, or when connecting to SaaS providers that support PrivateLink. It creates elastic network interfaces in your VPC with private IP addresses, avoiding NAT Gateway costs and internet exposure.
How do I implement network segmentation with Terraform in AWS VPC?
Implement network segmentation using multiple route tables (one per subnet tier), security groups for stateful instance-level filtering, network ACLs for stateless subnet-level filtering, and AWS Network Firewall for centralized inspection. Define each component as a Terraform module with clear input variables for CIDR ranges, allowed ports, and routing destinations. Use flow logs to validate that segmentation rules work as expected.
Need Help with AWS Network Architecture?
Citadel Cloud Management designs and implements enterprise network architectures across AWS, Azure, and Google Cloud. From VPC design to Transit Gateway deployment, we help organizations build secure, scalable, and cost-optimized network foundations.
Get in Touch