100 Test Runs Is a Lie: How to Actually Size Your Property-Based Tests

If you’re running property-based tests with the default 100 examples, you’re getting the worst of both worlds. Your CI is slower than it needs to be, and you’re still not catching the bugs that matter.

The number isn’t magic. Most libraries, Hypothesis included, default to 100 because it’s a round number that feels safe. But “feels safe” is not a testing strategy.

What property-based testing actually promises

Property-based testing flips the script on unit tests. Instead of hand-writing inputs and expected outputs, you define a property. A rule that should always hold. The framework generates inputs to break it.

from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_reversing_a_list_twice_gives_the_original(lst):
    assert lst == list(reversed(list(reversed(lst))))

The framework runs this function many times with random lists of integers. If it finds a counterexample, it shrinks the input to the smallest version that still fails. A 47-element list that triggers a bug is useless for debugging. A 3-element list is gold.

This is powerful. It is also probabilistic. Property-based testing cannot prove correctness. It can only raise your confidence that a bug doesn’t exist, or find a bug if one does. That probabilistic nature is what makes the run count so important.

Why 100 is arbitrary

Let’s be honest about where 100 comes from. In Hypothesis, it’s a default chosen in 2015 because it was a nice round number that caught most bugs without making tests unbearably slow. It was a social compromise, not a statistical one.

The probability of finding a bug depends on two things. How common the bug is in the input space, and how many samples you take. If a bug only triggers when the input is a palindrome of length greater than 20, and palindromes are 0.01% of all lists, 100 runs gives you roughly a 1% chance of catching it. That’s not a test. That’s a lottery ticket.

Most bugs are not that rare. Many properties break on empty lists, single elements, or simple duplicates. A well-tuned generator catches those quickly. But the default of 100 assumes your generators are perfect and your bugs are shallow. Both assumptions are wrong.

What run count actually buys you, statistically

If we model bug discovery as sampling with replacement from an input space where the bug has probability p of appearing, the probability of missing the bug after n runs is (1 - p)^n.

For p = 0.01, 100 runs gives you a 37% chance of missing the bug. For p = 0.001, 100 runs gives you a 90% chance of missing it. To get 99% confidence of catching a 0.1% bug, you need about 4,600 runs.

import math

def runs_for_confidence(p, confidence=0.99):
    """Returns the runs needed to catch a bug with probability `p`
    at the given confidence level."""
    return math.ceil(math.log(1 - confidence) / math.log(1 - p))

print(runs_for_confidence(0.01))    # 459
print(runs_for_confidence(0.001))   # 4603
print(runs_for_confidence(0.0001))  # 46050

This is the part that makes people uncomfortable. If you want high confidence in rare bugs, you need tens of thousands of runs. Nobody wants to wait that long in CI.

Shrinking changes the cost equation

The 100-run default was set before shrinking was as good as it is today. Modern property-based testing frameworks don’t just find bugs. They find minimal bugs.

That means you can think in terms of budget, not just count. If you run 1,000 examples and find a bug on the 847th run, shrinking might take another 200 to 300 executions to minimize the counterexample. The total cost is 1,100 or more runs for one bug. But if you run 10,000 examples and find nothing, you spent 10,000 runs for peace of mind.

The trick is to separate discovery from validation. Run a small, fast suite in CI for immediate feedback. Run a larger, slower suite overnight or on release branches for deeper confidence.

from hypothesis import given, settings, strategies as st
import json

# Fast feedback in CI
@given(st.dictionaries(st.text(), st.integers()))
@settings(max_examples=100)
def test_json_roundtrip_fast(d):
    assert json.loads(json.dumps(d)) == d

# Deeper confidence on main
@given(st.dictionaries(st.text(), st.integers()))
@settings(max_examples=5000, deadline=None)
def test_json_roundtrip_thorough(d):
    assert json.loads(json.dumps(d)) == d

This is not just about speed. It’s about information density. A 100-run test that passes tells you almost nothing. A 5,000-run test that passes tells you slightly more. A 100-run test that fails tells you exactly where to look.

How we split property-based tests into tiers

In our experience, the best approach is to stop treating all properties as equal. We split them into three tiers.

Fast properties run on every pull request. These are the mechanical ones. Round-trip serialization, idempotency of deduplication, basic invariants on data structures. We run 100 to 200 examples. They complete in under a second.

Deep properties run on every merge to main. These target complex state machines, event processing pipelines, and anything with combinatorial explosion. We run 2,000 to 10,000 examples. They take minutes, not hours.

Exploratory properties run manually before releases. These are the ones where we crank max_examples to 50,000 or more and let the machine grind while we review the changelog. We’ve found race conditions and integer overflow edge cases this way that no amount of unit testing would have caught.

What to do instead of guessing

Stop treating max_examples as a dial you set once and forget. Treat it as a configuration that belongs to the property, not the framework.

Ask three questions for every property you write.

How expensive is this test to run? If each example takes 50ms, 10,000 runs is 8 minutes. If it takes 5ms, it’s under a minute.

How bad is the bug if we miss it? A formatting bug in a log message is not the same as a data corruption bug in a payment pipeline.

How rare is the triggering condition? If the bug only appears on leap years, or when two UUIDs collide, or at exactly INT_MAX, you need more runs or a smarter generator.

Smarter generators almost always beat more runs. If you’re testing a JSON parser, don’t generate random strings and hope they parse. Generate valid objects and then mutate them.

from hypothesis import given, settings, strategies as st
import json

# Bad: most random strings aren't valid JSON
@settings(max_examples=10000)
@given(st.text())
def test_parse_json_bad(s):
    try:
        json.loads(s)
    except json.JSONDecodeError:
        pass  # Most inputs hit this immediately

# Good: generate valid objects, then edge cases
@settings(max_examples=500)
@given(st.dictionaries(st.text(), st.integers()))
def test_parse_json_good(d):
    assert json.loads(json.dumps(d)) == d

500 runs with a good generator beats 10,000 runs with a bad one. Every time.

Common questions about sizing property-based tests

Doesn’t more runs always mean better coverage?

Not exactly. Property-based testing doesn’t have a coverage metric in the traditional sense. More runs increase the probability of finding a bug, but diminishing returns set in fast. Doubling from 100 to 200 runs is meaningful. Doubling from 10,000 to 20,000 rarely is.

What about fuzzing? Isn’t that just property-based testing with millions of runs?

Fuzzing is adjacent but different. Fuzzers typically run millions of inputs with no semantic understanding of the domain. Property-based testing uses structured generators and shrinking. You can think of PBT as smart fuzzing, or fuzzing as brute-force PBT. The run count calculus is different because the cost per run and the information per run are not the same.

Should I set max_examples higher for CI or lower?

Higher for CI, lower for local development. Your laptop is for speed. Your CI is for confidence. Use a settings profile or environment variable to switch between them.

import os
from hypothesis import settings

CI = os.environ.get("CI", "false").lower() == "true"

settings.register_profile("ci", max_examples=5000, deadline=None)
settings.register_profile("dev", max_examples=100)

settings.load_profile("ci" if CI else "dev")

How do I know if my generator is good enough?

Run your test with max_examples set very high, say 50,000, and watch the coverage report. If branches are missing, your generator is not exercising them. Fix the generator before you lower the run count.

Stop searching for the perfect run count and start measuring

There is no universal right number of test runs for property-based testing. There is only the right number for your property, your generators, your CI budget, and the cost of the bug you’re trying to prevent.

Start with 100 if you must. But size it up for properties that guard critical paths, and size it down for properties that are just sanity checks. Measure how long your tests take. Profile your generators. And remember: a property-based test that passes 100 times is not proof. It’s just evidence.

If you want to go deeper, Hypothesis’s documentation on test statistics and targeted property-based testing is worth reading. The hypothesis CLI can show you exactly which examples your tests are spending time on. That’s the first place to look when you’re deciding whether to turn the dial up or down.