# The diagram said the app talked to three things. The wire said eleven.

Source: https://oomkilled.com/blog/map-the-wire-before-you-move/
Published: 2026-06-23
Tags: migration, observability, ebpf, networking, platform-engineering

The architecture diagram is a story the team tells itself about the network. The network does not read the diagram.

We learned this the slow way, on a different migration than the [GitLab one](/blog/700-repos-off-gitlab/) — same year, same instinct to move something heavy over a weekend. A batch of apps was leaving one cluster for another, and the plan, on paper, looked clean. Each service had a little box with two or three arrows coming out of it: a database, a cache, maybe a message queue. Move the boxes, redraw the arrows, cut over. Done by Sunday.

Except the arrows were a guess. Somebody drew them eighteen months ago, and the apps had been quietly making friends ever since.

> Every silent dependency you don't find before a cutover, you find *during* one — usually at 2am, usually with a customer on the phone.

So before we moved anything, we did the unglamorous thing. We watched.

## You can't migrate what you can't see leaving

The question we needed answered wasn't "what should this app talk to." It was "what is this app, right now, actually opening connections to." Those are different questions and the distance between them is where outages live.

The docs couldn't answer it. The diagram couldn't answer it. The developers — bless them — answered with confidence and were wrong about a third of it, not because they were careless but because the people who wrote those integrations had left, and the code that made the calls was three dependencies deep in a library nobody reads.

The only honest source of truth is the egress path itself. So we instrumented it.

We didn't reach for a service mesh. Standing up Istio to answer a question is like buying a restaurant to find out if you like the soup. We didn't want sidecars on everything and we didn't have the weekend to babysit a mesh rollout on top of the migration we were already doing.

We went lower. **eBPF**, watching `connect()` and `accept()` at the kernel, on every node, with zero changes to the apps:

```bash
# Cilium was already on the cluster, so Hubble was a flag away —
# every L3/L4 flow, by pod, by destination, no app changes.
hubble observe --since 24h --type trace \
  --namespace payments \
  -o jsonpath='{.destination.identity} {.IP.destination} {.l4.TCP.destination_port}' \
  | sort | uniq -c | sort -rn
```

On the nodes that weren't meshed, we got the same picture from the kernel directly — a small BPF program on the connect tracepoint, or just `conntrack` and flow logs when we needed something dumber and faster:

```bash
# The lazy cross-check: who is this pod actually connected to, right now.
conntrack -L -p tcp | awk '{print $5, $7}' | sort | uniq -c | sort -rn
```

Then we let it run. Not for an hour — for a **full week**, so we'd catch the things that only happen on a schedule. The nightly batch. The Sunday reconciliation job. The monthly invoice run that talks to a payment gateway exactly once every thirty days and would have been invisible to anything shorter.

## What the wire said that the diagram didn't

The diagram, for a representative service, claimed three dependencies: its database, Redis, and an internal auth service.

The wire said eleven. Here's the gap, and the gap is the whole article:

- **A hardcoded IP to a host that no longer existed.** The app retried it on every startup, swallowed the timeout, and moved on. Nobody noticed because nobody was looking — but that IP was about to belong to *something else* in the new network, and a silent connection to a dead host becomes a noisy connection to the wrong host the moment the address gets recycled.
- **A second database nobody mentioned.** A reporting read-replica, wired in years ago for one dashboard, still being hit a few hundred times an hour. Not on the diagram. Not in the runbook. Would have 500'd a director's morning dashboard on Monday and we'd have spent the cutover chasing a symptom three hops from the cause.
- **A call straight out to the public internet.** A vendor SDK phoning a SaaS API over the open internet, from a service everyone *swore* only talked to internal systems. The new cluster's egress policy was going to be stricter. That call would have been dropped on the floor, silently, and surfaced as a vague "feature X is slow sometimes" ticket weeks later.
- **The Data team. Again.** Of course it was the Data team. A job reaching into **GCP** — the same external gravity well that bit us in the GitLab move. Different pipeline, same lesson: the data factory always has one more rope tied to something outside the building than anyone remembers.
- **A cache that was really two caches.** The app talked to the Redis on the diagram *and* a second Redis it had been failing over to during an incident months ago and never failed back from. It was running on the fallback. The "primary" on the diagram was decorative.

None of these were exotic. None required a genius to find. They required *looking at the actual traffic instead of the drawing of the traffic* — and being patient enough to watch for a week instead of an afternoon.

## The dependency map, but earned

By the end of the week we didn't have the diagram's version of the dependency map. We had the real one — built from flows, not from memory. Every destination an app opened, how often, on what port, and critically, *on what schedule*.

That artifact did three things:

1. **It made the cutover boring.** We pre-created every egress rule, every firewall hole, every DNS entry the apps actually needed — including the eight per service nobody had documented — before we moved a single pod. Nothing discovered a missing path at runtime, because we'd already found them all at watch-time.
2. **It killed dead weight.** The connection to the host that didn't exist? Deleted, not migrated. The cache failover nobody reverted? Reverted. You don't carry a corpse to the new house.
3. **It became an early-warning signal.** We kept the egress watch running *after* the move. A new destination showing up in the flows that wasn't in the map meant either a legitimate new integration nobody told the platform team about, or a problem. Either way we wanted to know on the day it started, not the day it broke.

> A dependency map you generate once is documentation. A dependency map you keep generating is monitoring.

## What I actually took away

- **Instrument egress *before* the migration, not during the incident.** The cheapest time to discover that an app talks to a dead host is a week before you move it. The most expensive time is while it's down.
- **Watch for a full cycle, not a moment.** An hour of capture finds the chatty dependencies. A week finds the dangerous ones — the batch jobs, the monthly runs, the things that only wake up on a schedule. The rare call is the one that takes the cutover down, precisely because it's rare enough that nobody remembers it.
- **Reach for the lightest tool that sees the wire.** If you already run Cilium, Hubble is a flag. If you don't, the kernel will still tell you everything via `conntrack`, flow logs, or a few lines of eBPF. You do not need a service mesh to answer "what is this thing connected to." Don't buy the restaurant to taste the soup.
- **The diagram is a hypothesis. The flow log is the evidence.** Trust the second one. When they disagree — and they will — the network is right and the drawing is old.
- **Keep the watch running.** The map's real value isn't the migration. It's that an unexpected new arrow in the flows, the week after, is the first thing to break announcing itself early.

We left that migration with fewer surprises than the GitLab one, and the difference was entirely the week we spent watching before we touched anything. The diagram said three. The wire said eleven. We moved the eleven, and Monday was quiet.

Quiet is the whole job. Nobody gives you a medal for a migration nobody noticed — which is exactly how you know you did it right.
