E2E IaC for Multi-AZ Production VPC

· 8 min read ·
AWSIaCNetworkingDevOps

Spinning up a VPC is a few clicks in the console. Building one that survives production traffic, a security audit, and a cost review six months later is a different problem entirely. This article focuses on the three structural decisions that determine everything else: (1) how to split public and private subnets, (2) how to handle outbound internet for private resources, and (3) where the load balancer sits.

All examples use Terraform, but the logic applies equally to CloudFormation, CDK, or Pulumi.


1. Public/Private Subnet Pattern

The single most important architectural decision in a VPC is which resources get a route to the internet gateway and which do not. Getting this wrong is easy to detect in theory and surprisingly hard to fix in production.

The Core Rule

A public subnet has a route to an Internet Gateway (IGW) in its route table. A private subnet does not. That is the only meaningful distinction — “public” and “private” are not AWS concepts, they are naming conventions for route table configurations.

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id
  # no default route — this is what makes it private
}

Multi-AZ Layout

A production VPC spans at least two AZs. Three is better because some AWS services (RDS Multi-AZ, EKS managed node groups) benefit from odd-numbered AZ counts for quorum. Each AZ gets its own public subnet and its own private subnet.

locals {
  azs             = ["ap-southeast-5a", "ap-southeast-5b", "ap-southeast-5c"]
  public_cidrs    = ["10.0.0.0/24", "10.0.1.0/24", "10.0.2.0/24"]
  private_cidrs   = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
}

resource "aws_subnet" "public" {
  count             = length(local.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = local.public_cidrs[count.index]
  availability_zone = local.azs[count.index]

  map_public_ip_on_launch = true

  tags = { Name = "public-${local.azs[count.index]}" }
}

resource "aws_subnet" "private" {
  count             = length(local.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = local.private_cidrs[count.index]
  availability_zone = local.azs[count.index]

  map_public_ip_on_launch = false

  tags = { Name = "private-${local.azs[count.index]}" }
}

What Goes Where

LayerSubnetWhy
ALB (internet-facing)PublicNeeds a public IP to receive traffic from the internet
EC2 app servers / ECS tasksPrivateNo direct inbound exposure; ALB forwards to them
RDS, ElastiCache, OpenSearchPrivateNo inbound internet path, ever
NAT GatewayPublicNeeds an Elastic IP and IGW route to forward private traffic out
Bastion / EC2 Instance Connect EndpointPublic (or none, if using EIC endpoint)Needs reachability for SSH

The common mistake is placing EC2 app instances in public subnets because “they need outbound internet for package updates.” They do not need a public subnet — they need a NAT Gateway in a public subnet. The instance itself should stay private.

CIDR Planning

Reserve more space than presumably needed. A /24 gives 251 usable IPs (AWS reserves 5 per subnet). For private subnets hosting ASGs or EKS node groups, a /22 or larger is safer — EKS assigns a secondary ENI per pod in the default VPC CNI mode, which can exhaust a /24 with a handful of nodes.

A practical starting layout for a production VPC:

10.0.0.0/16   — VPC supernet
  10.0.0.0/24   — public-ap-southeast-5a
  10.0.1.0/24   — public-ap-southeast-5b
  10.0.2.0/24   — public-ap-southeast-5c
  10.0.10.0/22  — private-ap-southeast-5a  (1019 IPs)
  10.0.14.0/22  — private-ap-southeast-5b
  10.0.18.0/22  — private-ap-southeast-5c
  10.0.100.0/24 — reserved for future database subnet tier

2. NAT Strategy Tradeoffs

Private subnet resources that need outbound internet access (pulling packages, calling external APIs, reaching AWS services without VPC endpoints) require a path out. There are three options, each with a different cost and availability profile.

Option A: NAT Gateway (Managed)

The standard choice for production. AWS manages the NAT device, it scales automatically, and it is highly available within a single AZ.

resource "aws_eip" "nat" {
  count  = length(local.azs)
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = length(local.azs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = { Name = "nat-${local.azs[count.index]}" }
}

resource "aws_route_table" "private" {
  count  = length(local.azs)
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
}

The critical detail: one NAT Gateway per AZ, each private subnet routes to its own AZ’s NAT Gateway. A single shared NAT Gateway is a single point of failure and creates cross-AZ data transfer charges on every outbound byte from the other AZs.

Cost: $0.05/hour per NAT Gateway (~$36/month each). Three NAT Gateways for three AZs = ~$108/month before any traffic and data processing.

Option B: NAT Instance (Self-Managed)

A regular EC2 instance with Source/destination check disabled and iptables configured for masquerading. Cheaper, but it needs own patching, HA, and failover setup.

resource "aws_instance" "nat" {
  ami                    = data.aws_ami.nat.id  # community NAT AMI or fck-nat
  instance_type          = "t4g.nano"
  subnet_id              = aws_subnet.public[0].id
  source_dest_check      = false
  vpc_security_group_ids = [aws_security_group.nat.id]
}

The fck-nat project provides a maintained ARM64 AMI with HA support via an Auto Scaling group of size 1. For low-traffic environments or cost-sensitive workloads, a t4g.nano NAT instance at ~$3/month is hard to argue against.

When to use: Dev/staging environments, workloads with low egress volume, teams with the operational capacity to manage it.

When not to use: High-throughput production workloads where NAT instance bandwidth caps become a bottleneck.

Option C: VPC Endpoints (Avoid NAT Entirely)

For AWS service traffic, a VPC endpoint eliminates the NAT Gateway entirely. Gateway endpoints for S3 and DynamoDB are free. Interface endpoints for other services cost ~$0.01/hour per AZ but are cheaper than NAT Gateway data charges at volume.

# Free — no per-GB charge, no hourly charge
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

# Interface endpoint — hourly charge, but eliminates NAT for SSM traffic
resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ssm"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

Practical recommendation: Deploy gateway endpoints for S3 and DynamoDB unconditionally — they cost nothing and reduce NAT Gateway data processing charges immediately.

Decision Matrix

ScenarioRecommended NAT Strategy
Production, high availability requiredNAT Gateway, one per AZ
Production, cost is a constraintNAT Gateway for critical paths + VPC endpoints for AWS services
Dev/staging, low trafficSingle NAT Gateway or fck-nat instance
Workloads that only call AWS servicesVPC endpoints only, no NAT needed
Egress-heavy (video, large payloads)NAT Gateway + audit with Cost Explorer

3. ALB Placement: Public vs Private

The Application Load Balancer sits at the entry point of the request path. Where it is placed determines who can reach the application and how traffic flows through the VPC.

Internet-Facing ALB (Public Subnets)

The standard pattern for any application serving external users. The ALB receives a public DNS name and public IP addresses, sits in the public subnets, and forwards to targets in private subnets.

resource "aws_lb" "public" {
  name               = "app-alb-public"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb_public.id]
  subnets            = aws_subnet.public[*].id

  enable_deletion_protection = true
}

resource "aws_security_group" "alb_public" {
  name   = "alb-public"
  vpc_id = aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }
}

The app security group should allow inbound only from the ALB security group, not from 0.0.0.0/0:

resource "aws_security_group_rule" "app_from_alb" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.alb_public.id
  security_group_id        = aws_security_group.app.id
}

Internal ALB (Private Subnets)

An internal ALB (internal = true) receives only a private DNS name and is reachable only within the VPC. Use this for service-to-service communication, internal APIs, or a second-tier load balancer in a layered architecture.

resource "aws_lb" "internal" {
  name               = "app-alb-internal"
  internal           = true
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb_internal.id]
  subnets            = aws_subnet.private[*].id
}

ALB and ACM Certificates

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.public.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate_validation.main.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

resource "aws_lb_listener" "http_redirect" {
  load_balancer_arn = aws_lb.public.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"
    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

Use ELBSecurityPolicy-TLS13-1-2-2021-06 — it enforces TLS 1.2 minimum and prefers TLS 1.3.


Putting It Together

A complete multi-AZ production VPC has:

Further Reading