@shinyaz

Aurora Blue/Green in Practice — Measuring Actual Downtime During Maintenance

Table of Contents

Introduction

If you run Aurora PostgreSQL in production, maintenance downtime is an unavoidable concern. Minor version updates, major version upgrades, parameter changes — without knowing how long things will be down, you can't plan your maintenance windows.

AWS added Blue/Green deployment support for Aurora in 2023, and in February 2026 announced the AWS JDBC Driver Blue/Green plugin. The official blog claims "near-zero downtime," but what does that look like in practice?

In this article, I measure the following using an Aurora PostgreSQL 16.9 cluster:

  1. How long a standard failover takes
  2. How long a Blue/Green Switchover takes (PG 16.9 → 17.6 major upgrade)
  3. How DNS behavior affects downtime

This is Part 1 of the "Aurora Blue/Green in Practice" series. Part 2 will test the AWS JDBC Driver Blue/Green plugin to see how much it improves things.

Test Environment

ItemValue
Regionap-northeast-1 (Tokyo)
EngineAurora PostgreSQL 16.9 (Blue) → 17.6 (Green)
Instance classdb.r6g.large
TopologyWriter × 1 + Reader × 1
VPCDefault VPC (3 AZs)
Test interval1 second (psql running SELECT inet_server_addr())

Prerequisites:

  • AWS CLI configured (rds:*, ec2:* permissions)
  • psql (PostgreSQL client)
  • dig (DNS lookup)

The full setup-to-switchover process is documented for reproducibility. Skip to Summary if you only want the findings.

Step 1: Building the Aurora Cluster

Subnet Group and Cluster Creation

Cluster setup commands (subnet group → cluster → instances)
Terminal
# Create subnet group (default VPC, 3 AZs)
aws rds create-db-subnet-group \
  --db-subnet-group-name bg-test-subnet-group \
  --db-subnet-group-description "Subnet group for Blue/Green deployment test" \
  --subnet-ids '["subnet-xxxxx","subnet-yyyyy","subnet-zzzzz"]' \
  --region ap-northeast-1
 
# Create Aurora PostgreSQL 16.9 cluster
aws rds create-db-cluster \
  --db-cluster-identifier bg-test-apg \
  --engine aurora-postgresql \
  --engine-version 16.9 \
  --master-username postgres \
  --master-user-password '<your-password>' \
  --db-subnet-group-name bg-test-subnet-group \
  --storage-encrypted \
  --no-deletion-protection \
  --region ap-northeast-1
 
# Writer instance
aws rds create-db-instance \
  --db-instance-identifier bg-test-apg-writer \
  --db-cluster-identifier bg-test-apg \
  --db-instance-class db.r6g.large \
  --engine aurora-postgresql \
  --no-auto-minor-version-upgrade \
  --region ap-northeast-1
 
# Reader instance
aws rds create-db-instance \
  --db-instance-identifier bg-test-apg-reader \
  --db-cluster-identifier bg-test-apg \
  --db-instance-class db.r6g.large \
  --engine aurora-postgresql \
  --no-auto-minor-version-upgrade \
  --region ap-northeast-1

Cluster and instance creation took about 10–15 minutes. Check the status with:

Terminal (status check)
aws rds describe-db-instances \
  --filters Name=db-cluster-id,Values=bg-test-apg \
  --query 'DBInstances[].{Id:DBInstanceIdentifier,Status:DBInstanceStatus}' \
  --region ap-northeast-1

Once both show available, configure network access for local connectivity.

Local connectivity setup (SG inbound rule + public access)
Terminal
# Get the security group ID used by the cluster
SG_ID=$(aws rds describe-db-clusters \
  --db-cluster-identifier bg-test-apg \
  --query 'DBClusters[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
  --output text --region ap-northeast-1)
 
# Get your public IP
MY_IP=$(curl -s https://checkip.amazonaws.com)
 
# Allow PostgreSQL port from your IP only
aws ec2 authorize-security-group-ingress \
  --group-id "${SG_ID}" \
  --protocol tcp --port 5432 \
  --cidr "${MY_IP}/32" \
  --region ap-northeast-1
 
# Make both instances publicly accessible
aws rds modify-db-instance \
  --db-instance-identifier bg-test-apg-writer \
  --publicly-accessible --apply-immediately \
  --region ap-northeast-1
 
aws rds modify-db-instance \
  --db-instance-identifier bg-test-apg-reader \
  --publicly-accessible --apply-immediately \
  --region ap-northeast-1

Public access takes 1–2 minutes to propagate. After that, DNS will resolve to public IPs.

Verify the connection.

Terminal
psql -h bg-test-apg.cluster-xxxxx.ap-northeast-1.rds.amazonaws.com \
  -U postgres -d postgres -c "SELECT version();"
Output
PostgreSQL 16.9 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 10.5.0, 64-bit

Checking Initial DNS State

Aurora's cluster endpoint is a CNAME record pointing to the writer instance endpoint.

Terminal
dig +noall +answer bg-test-apg.cluster-xxxxx.ap-northeast-1.rds.amazonaws.com A
Output
bg-test-apg.cluster-xxxxx...rds.amazonaws.com. 60 IN CNAME bg-test-apg-writer.xxxxx...rds.amazonaws.com.
bg-test-apg-writer.xxxxx...rds.amazonaws.com.  60 IN A     172.31.37.206

The key observation: TTL is 60 seconds. After a failover or switchover updates DNS, clients may continue resolving to the old IP for up to 60 seconds. This is the primary cause of downtime.

Step 2: Measuring Failover Downtime

Connectivity Test Script

I prepared a script that connects to the cluster endpoint every second, recording success/failure, latency, and the server IP.

connectivity-test.sh (full script)
connectivity-test.sh
#!/usr/bin/env bash
set -euo pipefail
 
ENDPOINT="${1:?Usage: $0 <cluster-endpoint> <password> [interval_ms]}"
PASSWORD="${2:?Password required}"
INTERVAL_MS="${3:-500}"
INTERVAL_SEC=$(echo "scale=3; ${INTERVAL_MS}/1000" | bc)
 
TOTAL=0; SUCCESS=0; FAIL=0
FIRST_FAIL_TS=""; LAST_FAIL_TS=""
OUTFILE="/tmp/bg-test/connectivity-$(date '+%Y%m%d-%H%M%S').csv"
 
echo "timestamp,query_num,status,latency_ms,server_ip,error" | tee "${OUTFILE}"
 
cleanup() {
  echo ""
  echo "=== Summary ==="
  echo "Total: ${TOTAL} | Success: ${SUCCESS} | Failed: ${FAIL}"
  if [ -n "${FIRST_FAIL_TS}" ]; then
    echo "First failure: ${FIRST_FAIL_TS}"
    echo "Last failure: ${LAST_FAIL_TS}"
  fi
}
trap cleanup EXIT
 
export PGCONNECT_TIMEOUT=3
 
while true; do
  TOTAL=$((TOTAL + 1))
  TS=$(date '+%Y-%m-%dT%H:%M:%S.%3N')
  START_NS=$(date '+%s%N')
 
  RESULT=$(PGPASSWORD="${PASSWORD}" psql -h "${ENDPOINT}" -p 5432 -U postgres -d postgres \
    -t -A -c "SELECT inet_server_addr()::text" 2>&1) && STATUS="OK" || STATUS="FAIL"
 
  END_NS=$(date '+%s%N')
  LATENCY_MS=$(( (END_NS - START_NS) / 1000000 ))
 
  if [ "${STATUS}" = "OK" ]; then
    SUCCESS=$((SUCCESS + 1))
    echo "${TS},${TOTAL},OK,${LATENCY_MS},${RESULT}," | tee -a "${OUTFILE}"
  else
    FAIL=$((FAIL + 1))
    ERROR=$(echo "${RESULT}" | tr '\n' ' ' | cut -c1-100)
    [ -z "${FIRST_FAIL_TS}" ] && FIRST_FAIL_TS="${TS}"
    LAST_FAIL_TS="${TS}"
    echo "${TS},${TOTAL},FAIL,${LATENCY_MS},,${ERROR}" | tee -a "${OUTFILE}"
  fi
 
  sleep "${INTERVAL_SEC}"
done

Failover Execution and Results

With the connectivity test running in the background, I triggered a failover targeting the reader instance.

Terminal
# Start connectivity test (background)
./connectivity-test.sh bg-test-apg.cluster-xxxxx.rds.amazonaws.com '<password>' 1000 &
 
# Trigger failover
aws rds failover-db-cluster \
  --db-cluster-identifier bg-test-apg \
  --target-db-instance-identifier bg-test-apg-reader \
  --region ap-northeast-1

Results:

Output (connectivity test during failover)
10:01:29.278  #18  OK    124ms  172.31.37.206  ← Blue Writer
10:01:30.406  #19  FAIL  634ms                 ← Failures begin
10:01:32.050  #20  FAIL   54ms
10:01:33.113  #21  FAIL   55ms
10:01:34.177  #22  FAIL   52ms
10:01:35.238  #23  FAIL   58ms
10:01:36.304  #24  FAIL   53ms
10:01:37.364  #25  FAIL   54ms
10:01:38.427  #26  FAIL  742ms
10:01:40.178  #27  OK    141ms  172.31.37.206  ← Recovery (same IP)
  ... ~36 seconds normal ...
10:02:16.658  #59  FAIL 3113ms                 ← Second outage (timeouts)
10:02:20.780  #60  FAIL 3052ms
10:02:24.841  #61  FAIL 3099ms
10:02:28.949  #62  FAIL 3043ms                 ← Test stopped (recovery not captured)

Failover Analysis

MetricValue
Failed connections12
First outage~10 seconds (10:01:30–10:01:40, 8 connection refusals)
Second outage~12+ seconds (10:02:16–10:02:28, 4 timeouts)
Failure patternTwo separate outage windows

Two things stand out:

  1. Two distinct outage windows — The first ~10 seconds saw immediate connection refusals as the instance stopped accepting connections during failover processing. After recovery, four consecutive 3-second timeouts occurred ~36 seconds later. Since the client was still connecting to the old writer's IP (172.31.37.206) after the first recovery, the second outage likely occurred when the DNS cache expired and re-resolution hit an unstable state during endpoint switchover.
  2. Same IP after recovery — After failover, the cluster endpoint CNAME updates to point to the new writer, but DNS cache keeps resolving to the old IP. Since the old writer was still running, read queries succeeded — but a write workload would have hit read-only errors. This is a fundamental limitation of DNS-based failover.

Failover shows that DNS cache causes split outage windows and risks connecting to the wrong instance after recovery. How does Blue/Green deployment compare?

Step 3: Blue/Green Deployment and Switchover

Enabling Logical Replication

Blue/Green deployments require logical replication. Create a custom parameter group, apply it to the cluster, and reboot all instances.

Logical replication setup (parameter group → apply → reboot)
Terminal
# Create custom parameter group for Blue environment
aws rds create-db-cluster-parameter-group \
  --db-cluster-parameter-group-name bg-test-apg16-params \
  --db-parameter-group-family aurora-postgresql16 \
  --description "Custom params for Blue/Green deployment test"
 
# Enable logical replication
aws rds modify-db-cluster-parameter-group \
  --db-cluster-parameter-group-name bg-test-apg16-params \
  --parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"
 
# Apply to cluster
aws rds modify-db-cluster \
  --db-cluster-identifier bg-test-apg \
  --db-cluster-parameter-group-name bg-test-apg16-params \
  --apply-immediately
 
# Reboot all instances (required to move from pending-reboot to in-sync)
aws rds reboot-db-instance --db-instance-identifier bg-test-apg-writer
aws rds reboot-db-instance --db-instance-identifier bg-test-apg-reader

Wait for the parameter group status to show in-sync:

Terminal (status check)
aws rds describe-db-clusters \
  --db-cluster-identifier bg-test-apg \
  --query 'DBClusters[0].DBClusterMembers[].{Id:DBInstanceIdentifier,ParamStatus:DBClusterParameterGroupStatus}' \
  --region ap-northeast-1

Once in-sync, verify logical replication is enabled.

Terminal
psql -h <cluster-endpoint> -U postgres -c "SHOW rds.logical_replication;"
Output
 rds.logical_replication
-------------------------
 on

One gotcha: if you use a custom parameter group, you must also create one for the Green environment. Since we're upgrading PG 16 → 17, we need an aurora-postgresql17 family parameter group.

Green environment parameter group creation
Terminal
aws rds create-db-cluster-parameter-group \
  --db-cluster-parameter-group-name bg-test-apg17-params \
  --db-parameter-group-family aurora-postgresql17 \
  --description "Custom params for PG17 green environment"
 
aws rds modify-db-cluster-parameter-group \
  --db-cluster-parameter-group-name bg-test-apg17-params \
  --parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"

Creating the Blue/Green Deployment

Terminal
aws rds create-blue-green-deployment \
  --blue-green-deployment-name bg-test-upgrade \
  --source arn:aws:rds:ap-northeast-1:<account-id>:cluster:bg-test-apg \
  --target-engine-version 17.6 \
  --target-db-cluster-parameter-group-name bg-test-apg17-params \
  --region ap-northeast-1

Green environment provisioning progresses through three phases. Check progress with:

Terminal (progress check)
aws rds describe-blue-green-deployments \
  --blue-green-deployment-identifier bgd-xxxxx \
  --query 'BlueGreenDeployments[0].{Status:Status,Tasks:Tasks[].{Name:Name,Status:Status}}' \
  --region ap-northeast-1
PhaseDescriptionDuration (measured)
CREATING_READ_REPLICA_OF_SOURCECreate replica of Blue cluster~18 min
DB_ENGINE_VERSION_UPGRADEUpgrade PG 16.9 → 17.6~8 min
CREATE_DB_INSTANCES_FOR_CLUSTERCreate Green instances~4 min
Total~30 min

During these 30 minutes, the Blue environment remains fully operational. Zero impact on the application.

Switchover Execution and Results

Once the Green environment reaches AVAILABLE, start the connectivity test and execute the switchover.

Terminal
aws rds switchover-blue-green-deployment \
  --blue-green-deployment-identifier bgd-xxxxx \
  --switchover-timeout 300 \
  --region ap-northeast-1

Results:

Output (connectivity test during switchover)
10:55:55.518  #24  OK    134ms  172.31.26.53   ← Blue Writer (PG 16.9)
10:55:56.658  #25  FAIL 3046ms                 ← Timeouts begin
10:56:00.713  #26  FAIL 3055ms
10:56:04.777  #27  FAIL 3050ms
10:56:08.837  #28  FAIL 3049ms
10:56:12.895  #29  FAIL 3045ms
10:56:16.953  #30  FAIL 3045ms
10:56:21.008  #31  OK    249ms  172.31.26.53   ← Brief Blue connection
10:56:22.263  #32  OK    204ms  172.31.21.178  ← Green Writer (PG 17.6)

Switchover Analysis

MetricValue
Failed connections6
Downtime~26 seconds (10:55:56–10:56:22)
Failure patternAll timeouts (~3s each)
IP transition172.31.26.53 → 172.31.21.178

The differences from failover are clear:

  1. All failures are timeouts — During failover, some connections were immediately refused. During switchover, all failures were 3-second timeouts. The timeout (rather than immediate refusal) occurs because the Blue instance stops during switchover and no longer responds at the TCP level.
  2. Clean IP transition — Unlike failover where the same IP persisted, switchover shows a clear transition to the Green environment's new IP. Notably, #31 briefly connected to the Blue IP (172.31.26.53) — likely because DNS hadn't fully updated yet and the Blue instance momentarily accepted the connection.
  3. Version upgraded — Post-switchover connections return PG 17.6. The major version upgrade completed successfully.
Terminal (post-switchover verification)
psql -h bg-test-apg.cluster-xxxxx.rds.amazonaws.com -U postgres \
  -c "SELECT version();"
Output
PostgreSQL 17.6 on aarch64-unknown-linux-gnu

Summary

Results Comparison

MetricFailoverBlue/Green Switchover
Downtime~10s + ~12s (two separate windows)~26s (continuous)
Failed connections126
Failure patternRefusals + timeouts mixedTimeouts only
Version changeNonePG 16.9 → 17.6
Preparation timeNone~30 min (Green provisioning)
App changesNoneNone

Key Takeaways

  • DNS TTL of 60s is the biggest bottleneck — In both failover and switchover, DNS cache returning stale IPs is the primary cause of downtime. Reducing DNS cache TTL on the application side or switching to IP-based routing could help.
  • Blue/Green Switchover downtime is "longer but predictable" — Failover had connection refusals and timeouts mixed across two separate windows, while switchover showed a continuous 26-second window of uniform timeouts, making retry logic easier to design.
  • Green environment takes 30 minutes to build — With a major version upgrade, Green provisioning takes about 30 minutes. The Blue environment stays fully operational during this time, but it needs to be factored into maintenance window planning.
  • Logical replication must be enabled beforehand — Blue/Green deployments require rds.logical_replication = 1, which needs a custom parameter group and a full instance reboot. Enable this in production environments ahead of time.

Next Up

In this test, a simple psql connectivity test showed 26 seconds of downtime during switchover. The AWS official blog claims the AWS JDBC Driver Blue/Green plugin can achieve "near-zero downtime."

In Part 2, I'll test this plugin and compare it against HikariCP retry-only configurations to measure the actual improvement.

Cleanup

Resource deletion commands
Terminal
# Delete Blue/Green deployment
aws rds delete-blue-green-deployment \
  --blue-green-deployment-identifier bgd-xxxxx \
  --delete-target \
  --region ap-northeast-1
 
# Delete old Blue instances (-old1 suffix)
aws rds delete-db-instance \
  --db-instance-identifier bg-test-apg-reader-old1 \
  --skip-final-snapshot \
  --region ap-northeast-1
aws rds delete-db-instance \
  --db-instance-identifier bg-test-apg-writer-old1 \
  --skip-final-snapshot \
  --region ap-northeast-1
 
# Delete old Blue cluster
aws rds delete-db-cluster \
  --db-cluster-identifier bg-test-apg-old1 \
  --skip-final-snapshot \
  --region ap-northeast-1
 
# Delete new cluster instances (promoted Green)
aws rds delete-db-instance \
  --db-instance-identifier bg-test-apg-reader \
  --skip-final-snapshot \
  --region ap-northeast-1
aws rds delete-db-instance \
  --db-instance-identifier bg-test-apg-writer \
  --skip-final-snapshot \
  --region ap-northeast-1
 
# Delete new cluster
aws rds delete-db-cluster \
  --db-cluster-identifier bg-test-apg \
  --skip-final-snapshot \
  --region ap-northeast-1
 
# Delete parameter groups (after cluster deletion)
aws rds delete-db-cluster-parameter-group \
  --db-cluster-parameter-group-name bg-test-apg16-params
aws rds delete-db-cluster-parameter-group \
  --db-cluster-parameter-group-name bg-test-apg17-params
 
# Delete subnet group
aws rds delete-db-subnet-group \
  --db-subnet-group-name bg-test-subnet-group
 
# Remove security group inbound rule
aws ec2 revoke-security-group-ingress \
  --group-id sg-xxxxx \
  --protocol tcp --port 5432 \
  --cidr <your-ip>/32

Share this post

Shinya Tahara

Shinya Tahara

Solutions Architect @ AWS

I'm a Solutions Architect at AWS, providing technical guidance primarily to financial industry customers. I share learnings about cloud architecture and AI/ML on this site.The views and opinions expressed on this site are my own and do not represent the official positions of my employer.

Related Posts