Your Unit Tests Pass. Your Production Code Is Still Broken.

You have 90% code coverage and still got paged at 2 a.m.

The unit tests passed. CI was green. The bug made it to production anyway. Coverage didn’t lie, but it didn’t tell the truth either. It measured which lines executed, not which behaviors were actually verified.

Most teams figure this out the hard way. They write hundreds of unit tests, watch the coverage badge turn green, and assume the fortress is secure. The fortress has walls. It just doesn’t have a roof.

Unit Tests Only Test What You Imagine Can Go Wrong

Unit tests validate your assumptions about your code. The problem is that bugs don’t care about your assumptions.

Consider a simple pricing function:

def calculate_total(items, tax_rate):
    subtotal = sum(item["price"] * item["quantity"] for item in items)
    tax = subtotal * tax_rate
    return round(subtotal + tax, 2)

A typical unit test suite looks solid:

def test_calculate_total_with_tax():
    items = [{"price": 10.00, "quantity": 2}]
    assert calculate_total(items, 0.08) == 21.60

def test_calculate_total_empty_cart():
    assert calculate_total([], 0.08) == 0.00

Both pass. Coverage is 100%. The function ships.

Then a customer in Japan checks out with three items priced at ¥100, ¥100, and ¥100. The tax rate is 0.10. The expected total is ¥330. The function returns ¥330.00. Fine.

A customer in Switzerland buys one item at CHF 12.35 with 7.7% VAT. Expected: CHF 13.30. Actual: CHF 13.30. Still fine.

Then a customer buys two items at $0.01 each in Oregon, where the tax rate is 0.0. Expected: $0.02. Actual: $0.02. Pass.

The bug shows up when a customer in a jurisdiction with a tax rate of None (because the tax service returned a null for an unrecognized zip code) tries to check out. The function multiplies subtotal * None and throws a TypeError. Your unit tests never passed None as a tax rate because you assumed it would always be a float.

This is the fundamental limitation. Unit tests exercise the paths you thought to test. Bugs live in the paths you didn’t.

The Four Places Unit Tests Can’t Reach

Integration Boundaries

Unit tests replace external dependencies with mocks. Mocks are polite. They do exactly what you tell them. Real APIs are not polite.

Your mock database returns rows in milliseconds. Production returns them in seconds, or times out, or returns duplicate rows because of a read replica lag you didn’t know existed.

Your mock HTTP client returns clean JSON. The real service returns a 200 with an empty body on Tuesdays.

Mocks test your code against your assumptions about other systems. Production tests your code against reality. These are different test suites with different pass rates.

Stateful and Temporal Bugs

Unit tests run in isolation. Each test gets a fresh state. Production is a long-running process where state accumulates, leaks, and interacts with itself.

A memory cache that evicts entries under load. A connection pool that exhausts itself after 10,000 requests. A timestamp comparison that fails when the test runs across a daylight saving boundary. These bugs require time, volume, or sequence to manifest. Unit tests have none of these.

Concurrency and Race Conditions

Two users update the same record simultaneously. One request reads a balance, another debits it, the first writes back the stale value. Money disappears. Your unit tests run sequentially in a single thread. They can’t catch this.

You can write unit tests for individual locking primitives. You cannot write a unit test that proves your entire system is race-free. The state space is too large and the timing too non-deterministic.

The Environment Itself

Your tests run on Ubuntu 22.04 with Python 3.11, 4GB of RAM, and no firewall rules. Production runs on Alpine Linux with Python 3.11, 512MB of RAM, and a security group that drops idle TCP connections after 60 seconds.

The socket module behaves differently. The mmap limits are lower. The locale settings cause strftime to format dates in ways your parser doesn’t expect. These aren’t code bugs. They’re context bugs. Unit tests have no context.

Why Coverage Percentage Misleads

Coverage tools measure line execution, not assertion quality. A test can execute every line of a function and verify nothing meaningful.

def test_poor_coverage_quality():
    result = calculate_total([{"price": 1.0, "quantity": 1}], 0.0)
    # Executed 100% of lines. Verified almost nothing.
    assert result is not None

This test gives you 100% line coverage and zero confidence. Many teams optimize for the metric because it’s easy to measure. Confidence is hard to measure. So they measure coverage instead and hope the two correlate.

They don’t.

What to Test Instead (Or In Addition)

This isn’t an argument against unit tests. Unit tests are fast, deterministic, and excellent for verifying algorithmic logic. They’re just incomplete.

Here’s what fills the gaps without turning your CI pipeline into a 45-minute liability.

Test at System Boundaries, Not Just Internals

Instead of mocking the database, write tests that hit a real test database. These are slower, so run them selectively. But they catch the mismatch between your ORM queries and the query planner’s actual behavior.

Instead of mocking the HTTP client, spin up the downstream service in a container. This catches schema drift, timeout behavior, and retry logic that only triggers on actual connection failures.

Add Contract Tests for External Services

If you can’t run the real dependency in CI, use contract tests. These verify that your consumer expectations match the provider’s actual API schema.

Tools like Pact record the interactions between your service and its dependencies. If the provider changes a field type or drops an endpoint, the contract test fails before the code deploys. It’s not as good as integration testing, but it’s much better than hoping your mocks are accurate.

Use Property-Based Testing for Edge Cases

Property-based testing tools like Hypothesis (Python) or fast-check (JavaScript) generate thousands of random inputs and verify that your invariants hold.

from hypothesis import given, strategies as st

@given(
    st.lists(st.fixed_dictionaries({
        "price": st.decimals(min_value=0, max_value=10000, places=2),
        "quantity": st.integers(min_value=0, max_value=1000)
    })),
    st.one_of(st.none(), st.decimals(min_value=0, max_value=1, places=4))
)
def test_calculate_total_invariants(items, tax_rate):
    if tax_rate is None:
        with pytest.raises(TypeError):
            calculate_total(items, tax_rate)
        return

    result = calculate_total(items, tax_rate)
    assert result >= 0
    assert result == round(result, 2)

This test would have caught the None tax rate bug without you thinking to write that specific case. It generates inputs you’d never consider: empty lists, giant lists, zero prices, maximum precision decimals. It finds the edges of your logic without requiring you to imagine them first.

Monitor Production Like It’s a Test Environment

The most honest test suite is production traffic. If you can’t catch a bug before it ships, catch it before it hurts.

Use feature flags to roll out changes to 1% of users first. Watch error rates, latency percentiles, and business metrics. A unit test tells you if the code behaves as expected in isolation. A production monitor tells you if the code behaves as expected in reality.

Set up alerts on anomalies, not just hard failures. A 5% increase in 500 errors after a deploy is often the only signal that a race condition or resource leak has started. Unit tests will never show you this.

The Honest Trade-off

Unit tests are cheap, fast, and good for developer feedback loops. Integration tests are expensive, slow, and good for catching the bugs that matter.

You need both. The trap is thinking that 100% unit test coverage means you can skip the rest. It means you’ve tested the easy parts thoroughly. The hard parts, the ones that wake you up at night, live where your tests aren’t looking.

Start with unit tests for logic and algorithms. Add integration tests at every system boundary. Use property-based testing to find the inputs you didn’t think of. Monitor production to catch what every test missed.

Coverage is a vanity metric. The only metric that matters is whether you sleep through the night.

If you’re trying to catch the bugs your tests miss, start by looking at your error data. Sentry shows you what breaks in production, with the stack traces and context your unit tests never had.