OOMKILLED EXIT 137

Dispatch No. 001

Why this blog is called OOMKilled

· NOTE · SRE / CULTURE / MANIFESTO · 2 min on the line

If you have run anything on Kubernetes for more than a week, you have met OOMKilled.

It is the status your pod gets when it asks for more memory than the node will give it, and the kernel’s OOM killer steps in and ends the process. No graceful shutdown. No goodbye. Exit code 137. A dashboard turning red while you are mid-sentence in a meeting.

I named this blog after it because that moment — the abrupt, slightly humiliating failure that you should have seen coming — is the most honest thing about this job. We spend our days drawing clean architecture diagrams, and production spends its days finding the one box we drew too small.

What this blog is

War stories with the numbers left in. Real incidents, real misconfigurations, real “we set the memory limit to a round number nobody justified and paid for it three months later.” Less vendor brochure, more postmortem.

I freelance across DevOps, cloud, and SRE work, which means I get to see the same mistakes in a lot of different shapes. The companies change; the failure modes rhyme. Those rhymes are what I want to write down.

What it is not

It is not a tutorial farm. There are enough “10 kubectl commands you must know” posts. If a topic is already well covered by the docs, I will link the docs.

It is also not going to pretend systems are tidy. The interesting part is always the gap between the design and what actually shipped.

The deal

Every post here should leave you with one of two things: a failure you can now avoid, or a way of thinking about your systems that survives contact with production. If a post does neither, it was a waste of your scroll, and I would rather not have written it.

Welcome. Mind the memory limits.