OOMKILLED EXIT 137

Dispatch No. 003

initdb ate two of our three Postgres replicas. The one we scaled away saved us.

· SEV-1 · POSTGRES / KUBERNETES / STATEFULSET · 6 min on the line

The pods came back healthy. That was the worst part.

We’d done a routine rolling restart of a Postgres StatefulSet — config change, nothing exotic. postgres-0 and postgres-1 terminated, rescheduled, went green. Readiness probes passed. And then the application started throwing errors that made no sense: tables didn’t exist. Not “permission denied”, not “connection refused”. Did not exist.

The database was up. It was just empty. A brand new, freshly initialized cluster, sitting on top of the volumes that thirty minutes ago held production data.

A database can be perfectly healthy and perfectly empty at the same time. Those are not mutually exclusive — and your readiness probe will happily tell you everything is fine.

What we were running

A three-node Postgres StatefulSet — postgres-0, -1, -2 — one streaming primary and two replicas, each with its own PersistentVolumeClaim (data-postgres-0, -1, -2) backed by node-local disks. Standard pattern: stable network identity, stable storage, ordered rollout.

A few weeks earlier we’d scaled it from 3 to 2. The third node was overprovisioned headroom we weren’t using, and someone — me — wanted the capacity back. kubectl scale statefulset postgres --replicas=2. postgres-2 drained, the pod disappeared, the replica count dropped. Clean. I didn’t think about it again.

That decision is the only reason this story has a happy ending.

How a restart becomes a reinitialize

We were running the Bitnami postgresql-ha chart — Postgres plus repmgr for replication and failover, fronted by Pgpool. The part that matters lives in the repmgr container’s entrypoint, in a function called postgresql_repmgr_initialize. It runs every time a pod starts. Every time.

On startup the container doesn’t just start Postgres — it works out who it is. It queries the repmgr cluster to find the current primary, then branches. If it decides it’s the primary and the data directory looks uninitialized, it runs a fresh initdb. If it decides it’s a standby, it runs repmgr_clone_primary, which tries pg_rewind and, when that fails, falls back to pg_basebackup --force. That --force is not subtle about what it does:

NOTICE: -F/--force provided - deleting existing data directory "/bitnami/postgresql/data"

It deletes the data directory before it clones. By design.

Now stack the failure. We restarted the StatefulSet, so the two remaining pods went down close together. When they came back, repmgr couldn’t reach a primary inside its connection timeout — and the fallback path is the one that ends careers:

Can not find new primary ... There are no nodes with primary role. Assuming the primary role

A node that can’t find a primary assumes it is one. So postgres-0 bootstrapped itself as a fresh, empty primary. postgres-1 came up as a standby, tried pg_rewind — which fails out of the box on this image, because the config lives in /bitnami/postgresql/conf, not in the data directory, so rewind can’t find postgresql.conf — and fell straight through to pg_basebackup --force. It deleted its own intact data directory and cloned the empty primary.

Two of our three copies of the database were destroyed by the database’s own startup logic. Replication didn’t save us — it did its job perfectly, and replicated empty.

The copy nobody touched

While I was working out how thoroughly we were finished, someone ran kubectl get pvc and we all stared at the output:

NAME              STATUS   VOLUME      CAPACITY
data-postgres-0   Bound    pvc-a1b2…   200Gi
data-postgres-1   Bound    pvc-c3d4…   200Gi
data-postgres-2   Bound    pvc-e5f6…   200Gi   # ← still here

data-postgres-2 was still there. Still Bound. Untouched since the day we scaled down.

This is the thing a lot of people don’t know until it saves or burns them: scaling a StatefulSet down does not delete the PVCs. Kubernetes removes the pod and leaves the volume exactly where it is. (Since 1.27 there’s persistentVolumeClaimRetentionPolicy to change that, and even then whenScaled defaults to Retain.) We had told Kubernetes to remove the third replica. It quietly kept the third replica’s data, on a disk, on a node, for weeks — and that murderous entrypoint never ran against it, because no pod ever mounted it again. Neither initdb nor pg_basebackup --force ever got the chance to touch it.

The replica I deleted held the only intact copy of the database.

Getting it back

Because the volumes were node-local, the data wasn’t in some abstracted block-storage API — it was a directory on a specific machine. That made recovery manual, and it made recovery possible.

The path:

  1. Find the volume. kubectl get pv pvc-e5f6… gave us the node affinity and the on-disk path — node-local volumes pin to a node and live at a known location on its filesystem.
  2. Get onto that node and confirm the PGDATA was intact and the right major version. It was. The files were exactly as postgres-2 left them.
  3. Bring it up in isolation. We mounted that PVC into a throwaway recovery pod — not part of the StatefulSet, no init script, just a plain Postgres pointed at the recovered data directory. It started clean and the data was all there.
  4. Restore forward. pg_dumpall from the recovered instance, restore into a rebuilt postgres-0 as the new primary, then let postgres-1 re-clone from it via a fresh base backup. Replication healthy, application errors gone.

Hours, not minutes. But the data came back whole.

What I actually took away

The clean version of this story is “Kubernetes saved us by retaining a PVC.” The honest version is “we survived on luck, and luck is not an architecture.”

  • A restart can be a reinitialize. Any startup logic that picks a node role and then runs initdb or pg_basebackup --force based on what it can reach in a few seconds is one network blip away from erasing you. The failure mode you want is “refuse to start and page a human”, never “assume primary and start fresh”.
  • Never restart all the pods at once on repmgr-style HA. Simultaneous restarts are exactly what triggers the “can’t find a primary, I’ll be the primary” fallback. Roll one pod at a time, wait for it to rejoin and catch up, then touch the next. A kubectl rollout restart across the whole StatefulSet is how you get here.
  • Replication is not a backup. It replicates your mistakes at the speed of light. An empty primary gives you empty replicas.
  • Backups you have never restored are not backups. We had backups. We didn’t reach for them first because we hadn’t rehearsed a restore and didn’t trust the timing. The thing that saved us was an accident, not the recovery plan we were supposed to have.
  • Know where your data physically lives. Node-local volumes made this recovery hands-on but tractable. With some abstracted storage we might have had a faster path — or no path at all. Either way: knowing exactly where the bytes sit is what let us move fast under pressure.
  • persistentVolumeClaimRetentionPolicy is a real decision now. The retention that saved us is the default today, but it’s configurable. Set it on purpose. Don’t let “what happens to the data when I scale down” be something you discover during an incident.

We kept the third PVC around by accident and it brought the database back from the dead. I’d rather you keep yours around on purpose — and test the restore before you need it.


If you’re running this chart, the force-clone-on-restart behavior isn’t a secret — it’s all over the issue tracker, and it’s worth reading before it reads you: containers#52213 (standby always full-resyncs on restart), charts#20998 (restarting node ignores the new primary and assumes the role), charts#14044 (data loss on Postgres HA).