Every production circuit breaker I’ve reviewed eventually spawns a background thread. It might be a Go goroutine, a Java ScheduledExecutorService, or a Rust tokio task. The job is always the same: wake up every few seconds, check if the downstream service has recovered, and transition from OPEN back to CLOSED.

That design is wrong. It leaks resources at scale, complicates shutdown, and creates race conditions that are genuinely hard to test. Worse, the background work is completely unnecessary. You can build a circuit breaker that never wakes up on its own, never allocates a timer, and still correctly detects recovery.

The hidden cost of health-check goroutines

A circuit breaker tracks failures. After enough consecutive errors, it trips OPEN and starts rejecting requests immediately. The goal is to give the failing service a break instead of drowning it in retry traffic.

The tricky part is deciding when to try again. Most libraries solve this with a setTimeout or time.AfterFunc. In Go, a typical implementation looks like this:

func (cb *CircuitBreaker) Trip() {
    cb.state.Store(StateOpen)
    time.AfterFunc(cb.timeout, func() {
        cb.state.Store(StateHalfOpen)
    })
}

This works for a single breaker. It does not work for ten thousand.

If you create one circuit breaker per downstream host (a common pattern in microservices), you now have ten thousand goroutines sleeping in the background. Each goroutine costs ~2 KB of stack space and adds scheduling overhead. On container restarts, those goroutines race against shutdown deadlines. On timeouts, they fire at exactly the wrong moment and create flapping.

The background thread is solving a problem that does not exist. Recovery does not need to be detected proactively. It can be detected lazily, on the request path.

How lazy recovery works

Instead of a timer that transitions the breaker, store a single timestamp: the moment the breaker tripped OPEN. On every incoming request, compare now against that timestamp plus the configured timeout. If enough time has elapsed, allow a single probe through. If the probe succeeds, close the breaker. If it fails, update the timestamp and stay OPEN.

The state machine stays identical. Only the transition trigger changes.

  • CLOSED: requests pass through. Failures increment a counter. When the counter hits the threshold, atomically swap to OPEN and record trippedAt.
  • OPEN: every incoming request checks time.Now() > trippedAt + timeout. If false, fail fast. If true, atomically swap to HALF-OPEN and let this one request through.
  • HALF-OPEN: exactly one request is in flight. If it succeeds, swap to CLOSED and reset the failure counter. If it fails, swap back to OPEN and update trippedAt.

No goroutine ever wakes up. No timer is allocated. The breaker is entirely passive until a request arrives.

A working implementation in Go

Here is a complete, zero-background circuit breaker. It uses only sync/atomic for state transitions and stores the tripped timestamp as a nanosecond counter.

package breaker

import (
	"errors"
	"sync/atomic"
	"time"
)

type State int32

const (
	StateClosed State = iota
	StateOpen
	StateHalfOpen
)

type CircuitBreaker struct {
	// state is accessed with atomic operations.
	state      int32
	failures   int32
	threshold  int32
	timeout    time.Duration
	trippedAt  int64 // nanoseconds since Unix epoch
}

func New(threshold int, timeout time.Duration) *CircuitBreaker {
	return &CircuitBreaker{
		threshold: int32(threshold),
		timeout:   timeout,
	}
}

func (cb *CircuitBreaker) State() State {
	return State(atomic.LoadInt32(&cb.state))
}

// Allow reports whether the current request may proceed.
// It returns a done function that must be called with the outcome.
func (cb *CircuitBreaker) Allow() (done func(success bool), err error) {
	switch State(atomic.LoadInt32(&cb.state)) {
	case StateClosed:
		return cb.trackClosed, nil

	case StateOpen:
		// Lazy recovery check: has the timeout elapsed?
		if time.Now().UnixNano()-atomic.LoadInt64(&cb.trippedAt) < int64(cb.timeout) {
			return nil, errors.New("circuit breaker is open")
		}
		// Race: multiple goroutines may see this simultaneously.
		// Only one wins the CAS to HALF-OPEN.
		if atomic.CompareAndSwapInt32(&cb.state, int32(StateOpen), int32(StateHalfOpen)) {
			return cb.trackHalfOpen, nil
		}
		// Another goroutine won the race; fail fast this request.
		return nil, errors.New("circuit breaker is open")

	case StateHalfOpen:
		// Only one probe at a time. Every other request fails fast.
		return nil, errors.New("circuit breaker is half-open")
	}

	return nil, errors.New("unknown circuit breaker state")
}

func (cb *CircuitBreaker) trackClosed(success bool) {
	if success {
		atomic.StoreInt32(&cb.failures, 0)
		return
	}

	// Increment failures and trip if threshold reached.
	if atomic.AddInt32(&cb.failures, 1) >= cb.threshold {
		// Record the trip time before switching state so readers
		// never see OPEN with a stale trippedAt.
		atomic.StoreInt64(&cb.trippedAt, time.Now().UnixNano())
		atomic.StoreInt32(&cb.state, int32(StateOpen))
	}
}

func (cb *CircuitBreaker) trackHalfOpen(success bool) {
	if success {
		atomic.StoreInt32(&cb.failures, 0)
		atomic.StoreInt32(&cb.state, int32(StateClosed))
		return
	}

	atomic.StoreInt64(&cb.trippedAt, time.Now().UnixNano())
	atomic.StoreInt32(&cb.state, int32(StateOpen))
}

The key invariant: trippedAt is always written before the state transitions to OPEN. Readers in Allow() can then safely read trippedAt after seeing OPEN, knowing it is fresh. On the return path from HALF-OPEN, we update trippedAt before dropping back to OPEN so that the cooldown restarts from zero.

Why most libraries don’t do this

The lazy design has one apparent downside: recovery is only detected when a request arrives. If your service receives no traffic for an hour, the breaker stays OPEN for an hour.

This sounds bad. It is not.

If there are no requests, there is nothing to protect. The breaker exists to prevent cascading failures during traffic, not to maintain a real-time health dashboard. When the next request does arrive, the timeout check runs in nanoseconds and the probe fires immediately. The effective recovery latency is bounded by max(timeout, time-between-requests).

For high-traffic services, the gap between requests is negligible. For low-traffic services, the timeout dominates anyway. The background timer almost never improves real recovery time in practice.

The other reason libraries use timers is historical. The circuit breaker pattern was popularized in environments (Java with Hystrix, .NET with Polly) where a single breaker instance guarded a whole service dependency, not a per-host connection. One background thread was acceptable. In modern distributed systems, where you might have a breaker per upstream endpoint, that assumption breaks down.

Testing the race conditions

The CAS loop on the OPEN to HALF-OPEN transition is the only place where goroutines contend. If two requests arrive simultaneously after the timeout, only one proceeds as a probe. The other fails fast and retries on the next request. This is correct behavior. You never want multiple probes in flight during recovery, because a single failure among several successes could still flip you back to OPEN.

Testing is straightforward because there are no asynchronous timers. You can write a unit test that manipulates trippedAt directly (or uses a time wrapper) without sleeping:

func TestLazyRecovery(t *testing.T) {
	cb := New(1, time.Minute)

	// Trip the breaker.
	done, _ := cb.Allow()
	done(false)

	if cb.State() != StateOpen {
		t.Fatal("expected OPEN")
	}

	// Simulate timeout by winding back trippedAt.
	atomic.StoreInt64(&cb.trippedAt, time.Now().Add(-2*time.Minute).UnixNano())

	done, err := cb.Allow()
	if err != nil {
		t.Fatalf("expected probe to be allowed: %v", err)
	}

	// Success closes the breaker.
	done(true)
	if cb.State() != StateClosed {
		t.Fatal("expected CLOSED after successful probe")
	}
}

No time.Sleep. No sync.WaitGroup for goroutines. The test is deterministic because the implementation is synchronous.

What we gave up

There is one real loss: you cannot eagerly pre-warm a breaker before sending traffic. If you need to probe a dependency on a fixed schedule (say, every 5 seconds) to keep a connection pool warm, you still need a timer. But that timer belongs to your connection pool or health checker, not to the circuit breaker. The breaker should protect the pool. It should not manage it.

Keep the concerns separate. Health checking warms connections. Circuit breaking prevents cascading overload. When you merge them, you get complexity in both places.

A pattern for other languages

The same structure works anywhere you have atomic compare-and-swap and a monotonic clock. In Rust with std::sync::atomic, in Java with AtomicIntegerFieldUpdater and System.nanoTime(), in C++ with std::atomic and a custom enum. The implementation is under a hundred lines in every case.

If your language does not expose CAS, a sync.Mutex (or equivalent) is still cheaper than a background thread. The mutex is only held for nanoseconds per request, and only during state transitions. It never blocks for I/O or sleeps.

Try it

The full implementation above is production-ready as a starting point. Add metrics, logging, and adaptive thresholds on top. But leave out the goroutine. Your runtime scheduler will thank you, your tests will run faster, and you will stop wondering why that one container refuses to shut down cleanly.