Spinning up a VPC is a few clicks in the console. Building one that survives production traffic, a security audit, and a cost review six months later is a different problem entirely. This article focuses on the three structural decisions that determine everything else: (1) how to split public and private subnets, (2) how to handle outbound internet for private resources, and (3) where the load balancer sits.
All examples use Terraform, but the logic applies equally to CloudFormation, CDK, or Pulumi.
1. Public/Private Subnet Pattern
The single most important architectural decision in a VPC is which resources get a route to the internet gateway and which do not. Getting this wrong is easy to detect in theory and surprisingly hard to fix in production.
The Core Rule
A public subnet has a route to an Internet Gateway (IGW) in its route table. A private subnet does not. That is the only meaningful distinction — “public” and “private” are not AWS concepts, they are naming conventions for route table configurations.
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
# no default route — this is what makes it private
}
Multi-AZ Layout
A production VPC spans at least two AZs. Three is better because some AWS services (RDS Multi-AZ, EKS managed node groups) benefit from odd-numbered AZ counts for quorum. Each AZ gets its own public subnet and its own private subnet.
locals {
azs = ["ap-southeast-5a", "ap-southeast-5b", "ap-southeast-5c"]
public_cidrs = ["10.0.0.0/24", "10.0.1.0/24", "10.0.2.0/24"]
private_cidrs = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
}
resource "aws_subnet" "public" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
cidr_block = local.public_cidrs[count.index]
availability_zone = local.azs[count.index]
map_public_ip_on_launch = true
tags = { Name = "public-${local.azs[count.index]}" }
}
resource "aws_subnet" "private" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
cidr_block = local.private_cidrs[count.index]
availability_zone = local.azs[count.index]
map_public_ip_on_launch = false
tags = { Name = "private-${local.azs[count.index]}" }
}
What Goes Where
| Layer | Subnet | Why |
|---|---|---|
| ALB (internet-facing) | Public | Needs a public IP to receive traffic from the internet |
| EC2 app servers / ECS tasks | Private | No direct inbound exposure; ALB forwards to them |
| RDS, ElastiCache, OpenSearch | Private | No inbound internet path, ever |
| NAT Gateway | Public | Needs an Elastic IP and IGW route to forward private traffic out |
| Bastion / EC2 Instance Connect Endpoint | Public (or none, if using EIC endpoint) | Needs reachability for SSH |
The common mistake is placing EC2 app instances in public subnets because “they need outbound internet for package updates.” They do not need a public subnet — they need a NAT Gateway in a public subnet. The instance itself should stay private.
CIDR Planning
Reserve more space than presumably needed. A /24 gives 251 usable IPs (AWS reserves 5 per subnet). For private subnets hosting ASGs or EKS node groups, a /22 or larger is safer — EKS assigns a secondary ENI per pod in the default VPC CNI mode, which can exhaust a /24 with a handful of nodes.
A practical starting layout for a production VPC:
10.0.0.0/16 — VPC supernet
10.0.0.0/24 — public-ap-southeast-5a
10.0.1.0/24 — public-ap-southeast-5b
10.0.2.0/24 — public-ap-southeast-5c
10.0.10.0/22 — private-ap-southeast-5a (1019 IPs)
10.0.14.0/22 — private-ap-southeast-5b
10.0.18.0/22 — private-ap-southeast-5c
10.0.100.0/24 — reserved for future database subnet tier
2. NAT Strategy Tradeoffs
Private subnet resources that need outbound internet access (pulling packages, calling external APIs, reaching AWS services without VPC endpoints) require a path out. There are three options, each with a different cost and availability profile.
Option A: NAT Gateway (Managed)
The standard choice for production. AWS manages the NAT device, it scales automatically, and it is highly available within a single AZ.
resource "aws_eip" "nat" {
count = length(local.azs)
domain = "vpc"
}
resource "aws_nat_gateway" "main" {
count = length(local.azs)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = { Name = "nat-${local.azs[count.index]}" }
}
resource "aws_route_table" "private" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
}
The critical detail: one NAT Gateway per AZ, each private subnet routes to its own AZ’s NAT Gateway. A single shared NAT Gateway is a single point of failure and creates cross-AZ data transfer charges on every outbound byte from the other AZs.
Cost: $0.05/hour per NAT Gateway (~$36/month each). Three NAT Gateways for three AZs = ~$108/month before any traffic and data processing.
Option B: NAT Instance (Self-Managed)
A regular EC2 instance with Source/destination check disabled and iptables configured for masquerading. Cheaper, but it needs own patching, HA, and failover setup.
resource "aws_instance" "nat" {
ami = data.aws_ami.nat.id # community NAT AMI or fck-nat
instance_type = "t4g.nano"
subnet_id = aws_subnet.public[0].id
source_dest_check = false
vpc_security_group_ids = [aws_security_group.nat.id]
}
The fck-nat project provides a maintained ARM64 AMI with HA support via an Auto Scaling group of size 1. For low-traffic environments or cost-sensitive workloads, a t4g.nano NAT instance at ~$3/month is hard to argue against.
When to use: Dev/staging environments, workloads with low egress volume, teams with the operational capacity to manage it.
When not to use: High-throughput production workloads where NAT instance bandwidth caps become a bottleneck.
Option C: VPC Endpoints (Avoid NAT Entirely)
For AWS service traffic, a VPC endpoint eliminates the NAT Gateway entirely. Gateway endpoints for S3 and DynamoDB are free. Interface endpoints for other services cost ~$0.01/hour per AZ but are cheaper than NAT Gateway data charges at volume.
# Free — no per-GB charge, no hourly charge
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
# Interface endpoint — hourly charge, but eliminates NAT for SSM traffic
resource "aws_vpc_endpoint" "ssm" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ssm"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
Practical recommendation: Deploy gateway endpoints for S3 and DynamoDB unconditionally — they cost nothing and reduce NAT Gateway data processing charges immediately.
Decision Matrix
| Scenario | Recommended NAT Strategy |
|---|---|
| Production, high availability required | NAT Gateway, one per AZ |
| Production, cost is a constraint | NAT Gateway for critical paths + VPC endpoints for AWS services |
| Dev/staging, low traffic | Single NAT Gateway or fck-nat instance |
| Workloads that only call AWS services | VPC endpoints only, no NAT needed |
| Egress-heavy (video, large payloads) | NAT Gateway + audit with Cost Explorer |
3. ALB Placement: Public vs Private
The Application Load Balancer sits at the entry point of the request path. Where it is placed determines who can reach the application and how traffic flows through the VPC.
Internet-Facing ALB (Public Subnets)
The standard pattern for any application serving external users. The ALB receives a public DNS name and public IP addresses, sits in the public subnets, and forwards to targets in private subnets.
resource "aws_lb" "public" {
name = "app-alb-public"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb_public.id]
subnets = aws_subnet.public[*].id
enable_deletion_protection = true
}
resource "aws_security_group" "alb_public" {
name = "alb-public"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
}
The app security group should allow inbound only from the ALB security group, not from 0.0.0.0/0:
resource "aws_security_group_rule" "app_from_alb" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
source_security_group_id = aws_security_group.alb_public.id
security_group_id = aws_security_group.app.id
}
Internal ALB (Private Subnets)
An internal ALB (internal = true) receives only a private DNS name and is reachable only within the VPC. Use this for service-to-service communication, internal APIs, or a second-tier load balancer in a layered architecture.
resource "aws_lb" "internal" {
name = "app-alb-internal"
internal = true
load_balancer_type = "application"
security_groups = [aws_security_group.alb_internal.id]
subnets = aws_subnet.private[*].id
}
ALB and ACM Certificates
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.public.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate_validation.main.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
resource "aws_lb_listener" "http_redirect" {
load_balancer_arn = aws_lb.public.arn
port = 80
protocol = "HTTP"
default_action {
type = "redirect"
redirect {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
}
Use ELBSecurityPolicy-TLS13-1-2-2021-06 — it enforces TLS 1.2 minimum and prefers TLS 1.3.
Putting It Together
A complete multi-AZ production VPC has:
- A
/16CIDR with room for future subnet tiers - Three public subnets (one per AZ) for the ALB and NAT Gateways
- Three private subnets (one per AZ,
/22or larger) for application workloads - Three NAT Gateways (one per AZ), each with its own Elastic IP
- Private route tables that direct
0.0.0.0/0to the AZ-local NAT Gateway - Gateway VPC endpoints for S3 and DynamoDB
- An internet-facing ALB in public subnets, forwarding to targets in private subnets
- Security groups that chain: ALB → app → database, with no
0.0.0.0/0on app or database tiers