When people talk about “high availability Postgres”, they usually wave their hands at a diagram with a primary, a couple of replicas, and a magic box in the middle that routes traffic and handles failover. That magic box is the hard part. Over the last couple of years I’ve built and maintained exactly that box — a two-image, container-based Postgres cluster fronted by Pgpool-II, with automatic failover, a floating virtual IP, and self-healing replicas. This post walks through how it actually works, the decisions behind it, and the sharp edges I hit along the way.
The whole thing lives in two repositories: one that builds the PostgreSQL container and one that builds the Pgpool-II container. They’re deliberately separate because they have very different jobs and very different runtime requirements, and keeping them apart made each one simpler to reason about.
ping.
The big picture
The cluster is two PostgreSQL nodes in streaming replication — one primary, one standby — with Pgpool-II sitting in front of them. Clients never talk to Postgres directly. They connect to a virtual IP (VIP) on port 9999, which Pgpool owns. Pgpool decides which backend is the primary, routes writes there, can load-balance reads, and — most importantly — promotes the standby and moves the VIP if the primary dies.
Pgpool runs in streaming replication mode. This is the mode where Postgres itself handles the actual data replication (via WAL streaming), and Pgpool just observes, routes, and orchestrates. Pgpool is not replicating anything itself; it’s the conductor, not the orchestra.
What base images did I use?
This is where the two-repo split gets interesting, because the two images start from completely different places.
The Postgres image is built straight on top of the official Debian-based postgres image — currently postgres:15.17. That gives me a battle-tested Postgres install, the standard docker-entrypoint.sh machinery, and a familiar Debian userland. On top of that I layer the things a clustered, self-recovering node needs: openssh-server (nodes SSH into each other during recovery), supervisor (to run both sshd and Postgres in one container), plus sudo, iproute2, arping, and expect. I also compile the pgpool-recovery SQL extension and the Pgpool client tools from source against the matching Pgpool version, because the recovery flow needs them on the database node, not just on the Pgpool node.
The Pgpool image is leaner and uses a multi-stage trick. The final image is built on Alpine (alpine:3.23.3), which keeps it small, but I still need the Postgres client binaries (psql, pg_basebackup, etc.) available for health checks and scripting. So the Dockerfile pulls them out of the official Alpine Postgres image:
FROM postgres:15.17-alpine AS postgresSrc
FROM alpine:3.23.3
...
# Add postgres15 binaries
COPY --from=postgresSrc /usr/local/bin/* /usr/local/bin/
Pgpool itself (4.7.1) is compiled from the official source tarball inside the image, with --with-openssl so TLS works.
If you dig through the git history, you can watch these versions march forward in lockstep. The cluster started life on alpine:3.16.3 / postgres:15.1 / pgpool 4.4.0 back in December 2023, and has been bumped roughly quarterly ever since — Alpine 3.19, 3.20, 3.21, 3.22, 3.23; Pgpool 4.5, 4.6, and now 4.7.1. There’s one entertaining detour in the log: in December 2024 I tried jumping to postgres:16.6, and then reverted to 15.10 the same day. Major-version Postgres upgrades are not a “bump the tag and rebuild” affair when you have replication and on-disk data directories involved, so 15.x it stayed.
One subtle but important detail: both images are pinned to the same Pgpool version. Pgpool’s client tools and the pgpool-recovery extension talk a protocol that expects matching versions, so the two repos are upgraded together, every time.
How does Pgpool discover the primary vs. the replicas?
This trips a lot of people up, because Pgpool’s answer is delightfully boring: it just asks Postgres.
In streaming replication mode, every backend is listed in pgpool.conf as backend_hostname0, backend_hostname1, and so on. Pgpool doesn’t assume any of them is the primary. Instead, on a configurable interval (the streaming replication check, sr_check_period), it connects to each backend as the sr_check_user and effectively runs SELECT pg_is_in_recovery(). A node that answers false is the primary; nodes that answer true are standbys still replaying WAL. Pgpool also reads pg_stat_replication to understand replication lag and confirm the topology.
That’s the entire discovery mechanism — there is no config file declaring “node 0 is the boss.” The roles are discovered at runtime from the database’s own view of reality, which is exactly what you want, because after a failover the boss changes and Pgpool needs to notice on its own.
The credentials for these checks are wired up at container start. My entrypoint takes the database password from the environment and injects it into the config:
sed -i \
-e "s|#SR_PASS_HERE|sr_check_password = '$POSTGRES_PASSWORD'|g" \
-e "s|#H_PASS_HERE|health_check_password = '$POSTGRES_PASSWORD'|g" \
.../pgpool.conf
So sr_check (which discovers roles) and health_check (which detects when a node has gone away) both authenticate as a real Postgres user. On the database side, the init script creates a dedicated pgpool role and grants it pg_monitor, which is precisely the privilege set those system queries need.
Alongside Pgpool’s runtime discovery, each Postgres node also keeps a little breadcrumb file at /var/lib/postgresql/.node_role containing MASTER or BACKUP. That file isn’t how Pgpool decides anything — it’s how the container remembers what it is across restarts, and the failover scripts rewrite it when roles change. It’s a belt-and-suspenders thing, and it matters for the bootstrap story below.
How does failover actually work?
Here’s the part everyone actually cares about. Say the primary dies — the process crashes, the node is fenced, whatever. The sequence that follows is a small choreography between Pgpool’s watchdog, a pile of shell scripts, and SSH.
1. Detection. Pgpool’s health check stops getting answers from the primary. It declares the node down and triggers failover_command, which runs my failover.sh.
2. Promotion. failover.sh first sanity-checks that there’s still a viable node to promote (if every node is down, it bails rather than thrashing). Then it SSHes into the new main node and promotes it:
ssh -p 2222 ... postgres@${NEW_MAIN_NODE_HOST} \
-i ~/.ssh/id_rsa_pgpool ${PGHOME}/bin/pg_ctl -D ${NEW_MAIN_NODE_PGDATA} -w promote
Notice the SSH on port 2222 and the dedicated key — more on that in the gotchas. After promotion, it rewrites the .node_role breadcrumb files: MASTER on the newly promoted node, BACKUP on the old primary, so the cluster’s idea of itself stays consistent.
3. The VIP moves. This is what makes failover invisible to clients. Pgpool runs a watchdog that owns a delegated virtual IP. When a Pgpool node escalates to active, it runs escalation.sh, which SSHes to the other Pgpool nodes and tears the VIP off their interfaces first (so two nodes never claim the same IP), then brings the VIP up locally via ifupdown.sh:
sudo /sbin/ip addr add $POSTGRES_VIP dev $VIP_INTERFACE_NAME label $VIP_INTERFACE_NAME:0
It then fires off a gratuitous ARP (arping -U) so the local network updates its ARP tables and traffic to the VIP starts landing on the new active node. From a client’s perspective, the connection blips and the same IP keeps working.
4. Re-syncing the survivors. A promoted standby is now the primary, but any other standbys are still following the old, dead primary’s timeline. Pgpool runs follow_primary.sh for them. This script tries to be gentle first: it runs pg_rewind to reconcile the standby with the new primary’s timeline, rewriting the recovery config to point primary_conninfo at the new primary and creating a fresh physical replication slot. If pg_rewind fails (timelines too divergent, missing WAL, etc.), it falls back to a full pg_basebackup. Once the standby is caught up, pcp_attach_node brings it back into Pgpool’s rotation.
5. Rebuilding a dead node from scratch. When a node that was completely lost comes back, it bootstraps as a BACKUP. The entrypoint writes a .do_not_start marker so Postgres doesn’t try to come up with stale data, waits until it can see a healthy primary through the VIP (show pool_nodes reporting up ... primary), wipes its data directory, and triggers an online recovery via pcp_recovery_node. That recovery runs recovery_1st_stage on the primary, which pg_basebackups a fresh copy over to the recovering node, writes its recovery config, and then pgpool_remote_start starts it back up as a standby.
The thing I like about this design is that it’s all driven by Postgres’ own tooling — pg_ctl promote, pg_rewind, pg_basebackup, replication slots. Pgpool is the brain that decides when and where; Postgres does the actual data movement.
What was the trickiest part to get working?
Hands down: the watchdog ping in a container.
Pgpool’s watchdog uses ICMP ping to decide whether the upstream network (and its peer Pgpool nodes) are reachable. That reachability check is what gates VIP escalation — if the watchdog thinks the network is down, it won’t bring up the VIP. The watchdog parses the output of ping to extract the average round-trip time. The relevant code in src/watchdog/wd_ping.c walks the summary line looking for the average value after the fourth /, because on a normal Linux box ping prints:
rtt min/avg/max/mdev = 0.045/0.045/0.046/0.006 ms
But inside the container — on Alpine, with BusyBox’s ping — the summary line looks like this instead:
round-trip min/avg/max = 0.103/0.112/0.127 ms
Three values, three slashes, no mdev. Pgpool’s parser counted to the fourth /, ran off the end of the string, failed to read an RTT, and concluded the network was unreachable. The watchdog would then refuse to manage the VIP correctly. Everything looked configured right, and nothing worked.
The fix was a small source patch (fix_rtt_issue.patch) applied at build time, changing the loop to stop at the third slash:
- for (i = 0; i < 4; i++)
+ for (i = 0; i < 3; i++)
That two-character change was the difference between a cluster that failed over cleanly and one that silently refused to. It’s a perfect example of the kind of bug you only hit when you take software written with one environment in mind (a full Linux distro) and run it in another (a minimal container). There’s a second, more mundane patch in the build too — fix_compile_error.patch adds a missing #include <sys/time.h> so Pgpool’s tooling compiles cleanly against the newer toolchain — but the RTT one is the bug that ate the most days.
Gotchas someone trying this would hit
If you’re going to build something like this yourself, here are the landmines I stepped on so you don’t have to.
SSH runs on port 2222, not 22. The recovery and failover scripts SSH between nodes constantly — to promote, to pg_rewind, to pg_basebackup, to start and stop Postgres. I run sshd on 2222 inside the Postgres container, and every script has to know that. Early versions of these scripts assumed port 22 and just hung. If you see failover stall, check the SSH port first.
Passwordless SSH has to actually work, in both directions. Every script begins with a ssh ... ls /tmp probe and bails loudly if it fails, which is a hint at how often this is the root cause. The cluster ships a dedicated key (id_rsa_pgpool) baked into the images. The nodes have to trust each other before you ever need a failover, because that’s the worst possible time to discover the key isn’t set up.
The VIP needs sudo and a flat L2 network. Bringing the VIP up and down means ip addr add/del, and the gratuitous ARP means arping. Both need root, so the images grant the postgres user passwordless sudo for exactly those two binaries:
postgres ALL=NOPASSWD: /sbin/ip
postgres ALL=NOPASSWD: /usr/sbin/arping
The bigger gotcha is the VIP itself. A floating IP that migrates via ARP only works when all the nodes are on the same layer-2 segment. On a flat LAN or a bridged network this is great. In most public clouds, where you can’t just claim an arbitrary IP and ARP for it, this approach won’t work and you’d need a cloud load balancer or a different VIP mechanism instead.
pcp_recovery_node needs the -W flag. There’s a literal one-character commit in the history titled “fixed bug pcp_recovery_node” that adds -W to the recovery command. Without it, the PCP client doesn’t prompt for the password the expect script is waiting to send, and node recovery silently fails. Tiny flag, big consequences.
Password encryption is fiddly. The cluster uses scram-sha-256 end to end. Pgpool needs a pool_passwd file and a .pgpoolkey to decrypt credentials, and these are generated at container start with pg_enc. If the key and the encrypted passwords ever get out of sync — say you regenerate one but not the other — authentication fails in confusing ways. Keep them generated together, from the same source of truth.
The two images are not symmetric, and that’s on purpose. It’s tempting to think “it’s all Postgres, use one image.” But the Postgres node needs a full Debian userland, supervisor, and sshd; the Pgpool node wants to be a lean Alpine box. Trying to merge them just gives you the worst of both. Keeping them separate — but version-locked — has been the right call.
Closing thoughts
None of the individual pieces here are exotic: streaming replication, a connection pooler, a virtual IP, some shell glue. What makes a setup like this work is getting the seams right — the SSH between nodes, the credentials, the exact format a parser expects, the order in which a VIP comes up and down. Most of my time wasn’t spent on the happy path; it was spent on the failure paths, because those are the only paths that matter when a primary goes down at 3 a.m.
If you take one thing away, let it be this: when you run mature software in an environment it wasn’t originally written for, the assumptions it makes — about ping output, about SSH ports, about who’s root — are exactly where it’ll break. Test your failover for real, repeatedly, before you trust it.