Every Graceful Shutdown Path You Never Tested
Your web service has a shutdown handler. It flushes buffers, closes connections, writes checkpoints. You tested it once, maybe. In production, it probably runs once a year during a planned deploy. The rest of the time, your service dies from an OOM kill, a node eviction, a power loss, or a deploy that times out and gets SIGKILL.
Crash-only software flips this. There is no graceful shutdown. There is only crash and recover. The same path runs after a SIGKILL, a segfault, or a kernel panic. If your recovery path works, your service is safe. If it doesn’t, you find out immediately, not during a 3 AM outage.
What Crash-Only Actually Means
The term comes from Candea and Fox’s 2003 paper on crash-only computing. Their argument was simple: if components are going to crash anyway, design them so crash and recovery are the only states. No warm shutdown. No clean exit. No “please finish your requests first.”
For a web service, this means three things:
-
All durable state lives outside the process. Memory is ephemeral by definition. Anything you need after a restart must be in a database, a write-ahead log, or durable message queue before you acknowledge the work.
-
Recovery is the only startup path. The same code runs whether you’re starting fresh or restarting after a crash. There is no special “restore from checkpoint” mode. There is only “read the log and catch up.”
-
Requests are atomic or idempotent. A client retries. A partial request leaves no corrupted state. The service doesn’t care whether the previous attempt finished, crashed, or was killed mid-flight.
Why Graceful Shutdown Hides Bugs
Graceful shutdown gives you a false sense of safety. You believe your service shuts down cleanly because your handler runs. But the handler is a fantasy. Linux sends SIGKILL after 30 seconds whether you’re done or not. Kubernetes evicts pods without warning. Your data center loses power.
When you have a shutdown path, you end up with two code paths: the happy one you test, and the crash one that actually runs. They diverge. Bugs hide in the gap. I’ve seen services that “gracefully” flushed a buffer to disk but never fsync’d, so the file was empty after a power loss. The shutdown handler looked correct. It just wasn’t the path that mattered.
Crash-only removes the fantasy. There is only one path. If it’s wrong, you know immediately because your service doesn’t start.
What This Looks Like in Practice
Here’s a minimal crash-only HTTP worker in Python. It pulls jobs from a Redis queue, processes them, and stores results. There is no shutdown handler.
import json
import redis
import sqlite3
from http.server import HTTPServer, BaseHTTPRequestHandler
DB_PATH = "/data/results.db"
REDIS_URL = "redis://queue:6379"
# Recovery: the ONLY startup path.
def recover():
conn = sqlite3.connect(DB_PATH)
conn.execute("""
CREATE TABLE IF NOT EXISTS results (
job_id TEXT PRIMARY KEY,
result TEXT,
processed_at INTEGER
)
""")
conn.commit()
return conn
# Every job is identified by a client-generated UUID.
# If we crash mid-processing, the client retries with the same ID.
# INSERT OR REPLACE makes the store idempotent.
def process_job(conn, job_id, payload):
result = f"processed-{payload['value'] * 2}"
conn.execute(
"INSERT OR REPLACE INTO results (job_id, result, processed_at) VALUES (?, ?, strftime('%s','now'))",
(job_id, result)
)
conn.commit()
return result
class Handler(BaseHTTPRequestHandler):
def do_POST(self):
body = self.rfile.read(int(self.headers.get("Content-Length", 0)))
job = json.loads(body)
process_job(db_conn, job["id"], job["payload"])
self.send_response(200)
self.end_headers()
self.wfile.write(b"ok")
# Crash-only: we do not register signal handlers for graceful shutdown.
# We connect, we recover, we serve. When we die, we die.
db_conn = recover()
server = HTTPServer(("0.0.0.0", 8080), Handler)
server.serve_forever()
The key details:
recover()runs on every start. It is not a special case.- Jobs use client-generated IDs and
INSERT OR REPLACE. Retries are safe. - There is no
atexit, no SIGTERM handler, no connection draining. The process can die at any point and restart safely.
Compare this to a typical service with graceful shutdown:
# The trap: this code makes you feel safe while hiding the real bug.
def graceful_shutdown(signum, frame):
# What if we're killed here, before the commit?
db_conn.commit()
db_conn.close()
server.shutdown()
signal.signal(signal.SIGTERM, graceful_shutdown)
If the process receives SIGKILL between db_conn.commit() and db_conn.close(), nothing terrible happens. But if it dies between two writes that should be atomic, you’ve corrupted state. The shutdown handler gives you confidence you haven’t earned.
The Trade-Offs Nobody Talks About
Crash-only is not free. The cost shows up in three places.
Storage amplification. Every mutation must be durable before you acknowledge it. That means fsyncs, write-ahead logs, or replicated writes. Your latency goes up. A memory-only update that took microseconds now takes milliseconds.
Idempotency is mandatory, not optional. Every operation that changes state must handle retries. This is extra code and extra thinking. A naïve INSERT becomes INSERT ... ON CONFLICT. A file write becomes a temp-file-and-rename dance.
Recovery time is unbounded. If your durable log grows, startup slows down. You need log compaction, snapshots, or chunked replay. Those mechanisms are code you have to write and test. They are also, ironically, paths that must be crash-only themselves.
Crash-only simplifies your failure modes but does not eliminate them. It converts unpredictable shutdown bugs into predictable recovery latency. That is usually a good trade. But it is a trade.
How to Move a Service Toward Crash-Only
You do not have to rebuild everything. You can shift incrementally.
Audit your shutdown handlers. If you have a SIGTERM handler, ask: what happens if it doesn’t run? If the answer is data loss or corruption, that is your real bug. Fix the recovery path, not the shutdown handler.
Make your state machine explicit. Write down every in-memory structure that would be lost on crash. For each one, decide: reconstruct from log, reload from database, or accept the loss. “Accept the loss” is a valid answer for caches and metrics.
Use idempotency keys. Every mutating endpoint should accept a client-generated idempotency key. The server stores (key, result) and returns the stored result on retry. Stripe wrote the book on this. Most web frameworks have middleware for it now.
Test the crash path, not the shutdown path. In your integration tests, send SIGKILL to your service mid-request. Restart it. Assert that the system is consistent. If you only test graceful shutdown, you are testing fiction.
When Crash-Only Is Overkill
Not every process needs this. A static file server can die and restart with no recovery logic at all. A one-off CLI tool doesn’t need idempotency keys. If your service is stateless and every request is self-contained, you already are crash-only. Don’t add complexity for the aesthetic.
The goal is not purity. The goal is having one tested path instead of one tested path and one imaginary path.
FAQ
Does crash-only mean I ignore SIGTERM?
No. You can still exit on SIGTERM. Just don’t do non-trivial work in the handler. Close a socket if you want, but don’t flush state you haven’t already made durable.
What about connection draining?
Load balancers need to stop sending traffic before a pod dies. That happens at the infrastructure layer, not in your process. Keep the drain short. Kubernetes defaults to 30 seconds. After that, you get SIGKILL anyway.
Does this apply to databases?
Databases are the original crash-only systems. PostgreSQL’s WAL, SQLite’s rollback journal, MySQL’s redo log, they all assume the process can die at any point. The recovery code is the startup code. Database engineers have known this for decades.
What about in-progress HTTP requests?
Clients should retry on failure. If your endpoint is idempotent, retries are safe. If it is not idempotent, crash-only won’t save you. The fix is idempotency, not a longer shutdown timeout.
Start by Deleting Your Shutdown Handler
The fastest way to find your hidden crash bugs is to remove the fiction. Comment out your SIGTERM handler. Run your integration tests. Send SIGKILL mid-request. See what breaks. Fix those things. That is your real system. Everything else is a comfort blanket that hides the holes.
Crash-only does not make failures disappear. It makes them boring. And boring failures are the ones you can sleep through.