Auth code needs 90% mutation coverage. Your string utils don't.

Enforcing a single mutation score across your entire codebase is a great way to make your team hate testing.

Run PIT or Stryker against a typical repo and you’ll see the same pattern: authentication modules score 40%, string utilities hit 95%, and your ORM layer sits somewhere in the 60s. The knee-jerk response is to set a global gate at, say, 70% and block every PR that dips below it. Two sprints later, someone disables the check in CI and blames it on “flaky mutators.”

The real problem isn’t the tools. It’s pretending all code carries the same blast radius when a mutant survives.

What mutation testing actually measures

Code coverage tells you which lines ran. Mutation testing tells you whether your tests would notice if those lines changed.

A mutation framework introduces small faults (mutants) into your source. It might flip a > to a <, remove a method call, or change a return value. If your test suite catches the change, the mutant is killed. If it passes anyway, the mutant survives. Your mutation score is the percentage of mutants killed.

A surviving mutant in a password hash comparison is a security bug waiting for production. A surviving mutant in a capitalizeFirstLetter helper is, at worst, a slightly weird UI label.

Treating them the same is where teams go wrong.

Why auth deserves a 90%+ gate

Authentication and authorization code has two properties that make it ideal for aggressive mutation testing.

First, the logic is usually discrete and state-machiney. Did the token expire? Is the role in the allowed set? Did the signature verify? Each branch has a clear security implication, and each one should be tested.

Second, the cost of a surviving mutant is catastrophic. A single flipped boolean in a role check can expose admin endpoints. A missed not in a token validation routine can accept forged JWTs. These aren’t theoretical. CVE databases are full of auth bypasses caused by logic errors that mutation testing would have caught.

At Sentry, we enforce a 90% mutation score on anything in the authn/ and authz/ modules. Anything below that fails CI. No overrides, no “we’ll fix it in the next sprint.” The module is small enough that this is achievable without writing 40 lines of test for every line of production code.

Here’s what that looks like in practice. This is a simplified JWT validation routine:

import time
from typing import Optional

def verify_token(token: dict, expected_aud: str, leeway: int = 30) -> bool:
    now = time.time()

    if token.get("aud") != expected_aud:
        return False

    exp = token.get("exp")
    if exp is not None and now > exp + leeway:
        return False

    return True

A mutation framework might flip > to >= in the expiry check. Without a test that uses a token expiring exactly at now + leeway, that mutant survives. That means your tests don’t actually verify the boundary. At 90% mutation coverage, that test exists.

Utility code can live with 60%

Your StringUtils, DateHelpers, and MathExtensions are the opposite end of the spectrum.

These modules tend to be pure, heavily reused, and easy to reason about. A surviving mutant in truncate(str, maxLen) that changes > to >= might clip one extra character. That’s a UI quirk, not a security incident.

The risk-reward math shifts. These modules often have dozens of small functions. Chasing 90% mutation coverage means writing tests for every off-by-one variant in padLeft. The tests become longer than the code they protect, and the maintenance burden starts to outweigh the value.

We set a 60% floor for utility modules. That catches the obvious gaps (missing null checks, wrong return values) without forcing the team to exhaustively test every string slicing permutation.

The key is being honest about what 60% means. It means “we’ve tested the common cases and the obvious failures.” It does not mean “this code doesn’t matter.” If a utility function is used in a security-sensitive path, it inherits the higher threshold from its consumer.

The middle ground: business logic

Most of your code sits between these poles. Payment processing, data validation, workflow orchestration. These modules affect correctness and user trust, but a single surviving mutant won’t typically hand an attacker your database.

We use a tiered system:

Module type	Mutation threshold	Rationale
AuthN / AuthZ	90%	High blast radius, discrete logic
Business logic	75%	Correctness-critical, moderate complexity
Utilities / helpers	60%	Low blast radius, high reuse, simple functions
Generated / boilerplate	Excluded	Don’t test code you didn’t write

This isn’t a rigid rule. A payment calculation module might bump to 85%. A widely-used JSON helper might graduate to 75% if it’s consumed by auth code. The tiers are a starting point, not a cage.

How to implement tiered mutation gates

Stryker and PIT both support per-module configuration. Here’s how we wire it into a Python project using mutmut with a custom config:

# mutation_config.py
THRESHOLDS = {
    "src/authn/": 90,
    "src/authz/": 90,
    "src/billing/": 85,
    "src/workflows/": 75,
    "src/utils/": 60,
}

EXCLUDE_PATHS = [
    "src/generated/",
    "src/migrations/",
]

In CI, a small script reads this config and runs the mutation tester per module:

#!/usr/bin/env bash
# ci/check-mutation.sh
set -e

python -m mutmut run --paths-to-mutate=src/authn/
python -m mutmut results || true
python -m mutmut run --paths-to-mutate=src/utils/
python -m mutmut results || true

python ci/verify_thresholds.py

The verification script checks each module’s score against its threshold. If src/authn/ scores 87%, the build fails with a clear message: authn/ scored 87%, threshold is 90%.

For Stryker (JavaScript/TypeScript), use stryker.conf.js with mutator groups:

// stryker.conf.js
module.exports = {
  thresholds: {
    high: 90,
    low: 75,
    break: null, // we handle this per-module
  },
  mutate: [
    "src/auth/**/*.ts",
    "src/billing/**/*.ts",
    "src/utils/**/*.ts",
  ],
  ignorePatterns: ["src/generated/**"],
};

We wrap Stryker in a script that runs it three times with different path globs and enforces the per-directory threshold after each run. It’s a little clunky, but it works.

The trap of chasing 100%

Some teams see mutation testing as a game to win. They write tests that exist only to kill mutants, not to verify behavior.

The worst example is testing that a specific exception message contains a substring, just so a mutant that changes the message text gets killed. That test adds no value. It doesn’t verify the exception is thrown at the right time, or that the right type is raised. It only verifies the string.

If you find yourself writing tests purely to nudge a percentage, you’ve inverted the goal. Mutation testing is a diagnostic tool, not a leaderboard. The score tells you where to look. It doesn’t tell you when you’re done.

What we learned the hard way

We started with a global 80% gate. Within a month, three teams had disabled it in feature branches “temporarily.” Two of those temporary disables became permanent.

The problem wasn’t the number. It was that 80% was too low for auth code (we missed a role-check bug that made it to staging) and too high for a 4,000-line utility module (the team spent two weeks writing tests for isValidEmail variants).

After we split into tiers, adoption stuck. Auth teams accepted the 90% bar because the scope was bounded. Platform teams accepted 60% for utilities because it was achievable without madness. The tiered approach turned mutation testing from a punishment into a conversation about risk.

Where to start

If you’re introducing mutation testing to an existing codebase, don’t set any gates in week one. Run the tool, look at the scores, and ask: where would a surviving mutant hurt the most?

Start with auth. Set 90% there, get it green, and prove the value. Expand to business logic once the team trusts the signal. Keep utilities at a lower bar or exclude them entirely until you’ve built the habit.

And remember: a 60% score with honest tests beats a 95% score with tests written to game the mutator. The goal is catching real bugs, not impressing your metrics dashboard.

If you want to try this yourself, mutmut for Python and Stryker for JavaScript both support the per-directory patterns described above. Start small. One auth module. One week. See what survives.