Custom DB2 Image Running as a Cluster in Docker

IBM Db2 is not the first database you think of when someone says “let’s run it in containers.” It’s big, it’s enterprise, it ships as a CentOS-based image that expects to own its host, and its high-availability story (HADR — High Availability Disaster Recovery) was designed for long-lived servers, not ephemeral containers. So naturally, that’s exactly what I built: a custom Db2 image that runs as a two-node, automatically-failing-over cluster in Docker, with a floating virtual IP in front of it. This post is the tour — how the image is put together, how the two nodes figure out who’s in charge, how failover actually fires, and the sharp edges that took the most iterations to file down.

TL;DR

One Docker image (built on IBM’s official Db2 community image) runs in two modes — standalone or a two-node cluster. The cluster uses Db2 HADR for replication, a small shell “supervisor” loop on each node to watch the other and trigger takeover, a plain-text state file plus SSH to agree on who is Primary, and a script-managed virtual IP so clients always hit the active node. No external arbiter, no orchestrator — just two containers talking to each other.

The shape of the thing

Two containers, each on its own host, each running the same image. One is Primary, one is Secondary. Db2 HADR ships the transaction log from Primary to Secondary so the standby is a warm, continuously-updated copy. Clients never connect to a node’s real IP — they connect to a virtual IP (VIP) that lives on whichever node is currently Primary. If the Primary dies, the Secondary forces a takeover, grabs the VIP, and the world keeps turning.

The interesting part is that there is no Pacemaker, no etcd, no Kubernetes operator, no Db2 cluster manager doing the orchestration. The coordination is a few hundred lines of bash running as a supervised process inside each container — either charmingly simple or slightly terrifying depending on your temperament. For a two-node setup, it’s been solid.

What base image did I use?

The foundation is IBM’s official community image, icr.io/db2_community/db2:11.5.9.0, pulled straight from IBM’s container registry. That gets you a real, licensed-for-use Db2 11.5.9.0 install with all the instance-creation machinery IBM bakes in. You accept the license with an environment variable (LICENSE=accept) and away you go.

On top of that base I layer only what the clustering needs:

FROM icr.io/db2_community/db2:11.5.9.0

RUN yum install -y netcat iproute

netcat is the health probe, and iproute gives me ip for managing the VIP.

There’s one unavoidable wrinkle with this base: it’s built on CentOS, and CentOS has reached end of life, which means the default mirrorlist URLs its package manager points at are dead. If you try to yum install anything without fixing that first, the build just fails on a network error. So the very first thing the Dockerfile does is repoint the repos at the CentOS vault archive:

RUN sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/centos*; \
    sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/centos*

This isn’t in any tutorial and costs you an afternoon the first time you hit it. The same image serves both cluster nodes and a standalone, non-clustered deployment — the DB2_CLUSTER environment variable flips between the two, so there’s only one artifact to build and ship.

The clustering didn’t exist when I started — the first versions were just a tidied-up standalone Db2 image. I built the cluster support up one piece at a time: the virtual IP, the takeover logic, the “wait for the other node” handshakes. Fully automatic, configuration-driven failover came after that, and it took a lot of iterating before I trusted it. The image is on its 3.16 release now, still running Db2 11.5.9.0.

How does a node know if it’s Primary or Secondary?

This is where a containerized Db2 cluster differs sharply from, say, a Postgres-with-a-pooler setup where the proxy interrogates the database to discover roles. Here, role is tracked in two layers, and it’s worth keeping them straight.

The source of truth for cluster role is a small plain-text state file on shared-nothing local storage (persisted in each node’s data volume). It looks essentially like this:

PRIMARY_STATUS=started_ok
SECONDARY_STATUS=started_ok
PRIMARY=172.16.1.1
SECONDARY=172.16.1.2

When a container boots, it figures out its own IP and checks which line it matches. If its IP is on the PRIMARY= line, it’s the Primary; if it’s on the SECONDARY= line, it’s the Secondary. The NODE_TYPE environment variable you pass at deploy time (Primary or Secondary) is only used for the very first bootstrap — after that, the file wins, because roles change at runtime during a failover and the env var would be stale.

The source of truth for database state is Db2 itself. At any moment you can ask a database what HADR role it’s playing:

# HADR_ROLE = PRIMARY   (or STANDBY)
db2pd -db <DBNAME> -hadr | grep HADR_ROLE

and whether replication is healthy:

# HADR_CONNECT_STATUS = CONNECTED
db2pd -db <DBNAME> -hadr | grep HADR_CONNECT_STATUS

The cluster scripts lean on both. The state file tells a node what it’s supposed to be; db2pd tells it what Db2 actually thinks it is. A lot of the trickier logic exists to reconcile those two views — for example, a node that the file says is Primary will SSH to its peer and check the peer’s real HADR role before doing anything drastic, precisely so two nodes don’t both decide they’re the boss.

How does failover actually work?

Each node runs a long-lived monitoring loop (started and kept alive by supervisor, so if it ever crashes it comes right back). Stripped to its essence, the loop does this, forever:

1. Probe the peer. It uses a plain TCP knock with netcat against a port on the other node that you nominate via OTHER_NODE_MONITOR_PORT, retrying a few times before declaring anything wrong:

retry=5
NODE_DOWN=true
while [ $retry -ne 0 ]; do
    nc -z $OTHER_NODE_IP $OTHER_NODE_MONITOR_PORT &>/dev/null
    if [ $? -eq 0 ]; then NODE_DOWN=false; break; fi
    ((retry--)); sleep 2
done

2. React based on who I am. If the peer is down and I am the Secondary, that’s a Primary failure, and I take over by force:

db2 takeover hadr on database <DBNAME> by force

takeover ... by force is the HADR command that promotes a standby to primary even when it can’t reach the old primary to do a graceful handoff. The script runs it for every database, then verifies each one actually came up as PRIMARY via db2pd.

3. Grab the virtual IP. Once the databases are promoted, the new Primary brings the VIP up on its own network interface:

NET_IF=$(ip r | grep default | awk '{ print $5 }')
ip addr add ${CLUSTER_VIRTUAL_IP} dev ${NET_IF} label ${NET_IF}:1

From the client’s perspective the IP they’ve always been talking to simply keeps answering — now from a different machine.

4. Rewrite the world. The new Primary updates the state file to put its own IP on the PRIMARY= line and the peer’s on SECONDARY=, then tries to SSH into the (possibly still-flapping) old Primary to flip its role too. It also starts a log-cleaning maintenance service that should only ever run on the active node.

There’s a second, gentler failover path that handles the “I came back” case. Say the original Primary crashed, the Secondary took over, and now the original node restarts. It boots, reads the state file, and discovers it’s now listed as Secondary even though it was configured to be Primary. Rather than fighting, it waits, confirms both nodes are healthy and replication is CONNECTED, and then does a graceful takeover (not “by force”) to reclaim the Primary role — and signals the peer, via a small flag file dropped over SSH, to restart its databases cleanly as the standby. The VIP is always stripped from any node that finds itself in the Secondary role, so it can never live in two places at once.

Bootstrapping a brand-new cluster uses the same machinery in reverse. The two nodes hand-shake through the state file’s status fields — init_wait_for_secondary, started_empty, backup_copied, started_ok — each waiting for the other to reach the expected status before proceeding. The Primary backs up each database, scps the backups to the Secondary, which restores from the latest one (picked by parsing the timestamp out of the filenames) and then starts HADR as the standby.

That log-cleaning service I keep mentioning is the one piece of maintenance that only ever runs on the active node, which is why it gets added and removed as roles change. On a loop, it takes a fresh backup of each database and prunes the old transaction logs on a schedule you control (LOG_CLEAN_FREQUENCY, default every 24 hours), and it sweeps away backup files older than BACKUP_RETENTION_PERIOD (default 30 days). The important detail is that it refuses to prune logs unless HADR reports the standby as CONNECTED and in sync — because the Primary has to retain any logs the Secondary hasn’t replayed yet. Trimming logs while the standby is behind would strand it and break replication, so the in-sync check is a hard gate, not a nicety.

What was the trickiest part to get working?

Getting two completely independent containers — no shared brain, no quorum service — to reliably agree on a single, consistent answer to “who is the Primary?” through crashes, restarts, and network blips. That’s the whole ballgame, and the git history is basically an archaeological record of me getting it wrong and fixing it: bug after bug in the “wait for the other node” logic, the takeover logic, the role reconciliation.

The hard cases aren’t the happy path. They’re things like: both nodes boot at once and race to write the state file; the old Primary comes back to life mid-failover and briefly thinks it’s still in charge; the peer is reachable by SSH but its databases are in a weird half-state. The defenses that emerged are the bidirectional state file (each node writes the role change to both its own copy and the peer’s, over SSH), the “check the peer’s real HADR role before acting” guard, and the explicit status-field handshake during bootstrap so a node never proceeds on a stale assumption. None of those were in the first version. They’re all scar tissue.

A close runner-up — and a great example of how a container’s environment quietly breaks assumptions — was simply getting each node’s own IP address right. The original code used hostname -i. Inside a container running with host networking, that can hand you the wrong address (or several space-separated addresses), which then fails to match either line in the state file, and suddenly a node has no idea what role it is. The fix was to derive the real outbound IP from the routing table instead:

MY_IP=$(ip route get 1 | awk '{print $7; exit}')

That one-liner — “what source IP would I use to reach the internet?” — is far more reliable than hostname -i in a container.

Gotchas someone trying this would hit

The container needs privileged: true and host networking. Bringing a virtual IP up and down with ip addr add requires capabilities a normal container doesn’t have, and HADR wants to bind its replication ports on the host’s network. The cluster compose file runs with network_mode: "host" and privileged: true. If you forget either, the VIP silently won’t come up and you’ll spend an hour wondering why failover “worked” but nobody can connect.

SSH can’t use port 22. Because the containers run with host networking, the host’s own SSH daemon already owns port 22. The container’s sshd has to listen somewhere else — I expose it on a configurable SERVER_SSH_PORT (e.g. 5022). Every cross-node command (status checks, state-file edits, backup copies, role flips) goes over that port using a baked-in key, and root login is restricted to key auth only. Passwordless SSH between the two nodes has to be working before you ever need a failover — that’s the worst possible moment to discover the key isn’t trusted.

The CentOS base is EOL. As mentioned, the stock package mirrors are gone. If you rebuild from this base and skip the vault repo rewrite, every yum install fails. Pin it once in the Dockerfile and forget about it.

“Takeover by force” is exactly as blunt as it sounds. The hard-failover path promotes the standby even though it can’t confirm the old primary is truly dead — and the only health signal it has is a single TCP port being unreachable. If the old primary is alive but merely unreachable on that one port, you’ve got the ingredients for split-brain. The reconciliation path and the “check the peer’s HADR role first” guard minimize this, but the honest takeaway is that a two-node, no-quorum cluster trades away the split-brain safety a third arbiter node would give you. After a forced takeover, you may still need to manually reconcile the recovered node back into the Secondary role before restarting it.

Closing thoughts

What I like about this build is that it demystifies “database HA.” There’s no magic — just log shipping (Db2 HADR), a health check (nc), a promotion command (takeover ... by force), a floating address (ip addr add), and a lot of careful bookkeeping to make sure two machines never disagree about reality. Putting it in a container forced every one of those assumptions into the open, because a container is a hostile environment for software that expects to be a pet server: the IP isn’t what you think, port 22 is taken, and you don’t get a cluster manager handed to you.

If you build something like this, spend your time on the failure paths and test them for real — pull the plug on the Primary, restart it, pull the plug again — because those are the only paths that matter at 3 a.m. The happy path writes itself; the cluster is only as good as how it behaves when a node disappears.

Constantin Manea

Senior DevOps Engineer · Bucharest

I help EU teams modernize legacy infrastructure and run stateful workloads on Kubernetes — including the gnarly ones with clustered databases. If you’re dealing with Db2 in containers, database HA, or legacy modernization headaches, I take on small consulting engagements.