无 goroutine、无定时器、无后台开销的熔断器

我审查过的每一个生产环境熔断器，最终都会拉起一条后台线程。它可能是 Go 的 goroutine、Java 的 ScheduledExecutorService，或是 Rust 的 tokio task。干的事情永远一样：每隔几秒唤醒一次，检查下游服务是否已经恢复，然后把状态从 OPEN 切回 CLOSED。

这种设计是错的。它在规模化时泄漏资源，让优雅下线变得复杂，还会引入 genuinely hard to test 的竞态条件。更糟的是，后台工作完全没必要。你可以构建一个从不自行唤醒、从不分配定时器，却仍然能正确探测恢复的熔断器。

健康检查 goroutine 的隐形成本

熔断器负责追踪失败。当连续错误达到一定阈值，它会跳闸到 OPEN 状态，并立刻开始拒绝请求。目的是给故障服务喘息空间，而不是用重试流量把它淹死。

棘手之处在于决定何时再次尝试。大多数库用 setTimeout 或 time.AfterFunc 来解决。在 Go 里，典型实现长这样：

func (cb *CircuitBreaker) Trip() {
    cb.state.Store(StateOpen)
    time.AfterFunc(cb.timeout, func() {
        cb.state.Store(StateHalfOpen)
    })
}

单个熔断器这么写没问题。一万个就不行了。

如果你为每个下游主机创建一个熔断器（微服务里的常见模式），你背后就有一万个 goroutine 在睡觉。每个 goroutine 消耗约 2 KB 栈空间，还带来调度开销。容器重启时，这些 goroutine 要和下线截止时间赛跑。超时触发时，它们总是在最糟糕的时点 fire，造成状态抖动（flapping）。

后台线程在解决一个不存在的问题。恢复不需要被主动探测，它可以被惰性探测，就在请求路径上完成。

惰性恢复如何工作

不要用一个定时器去驱动状态迁移，而是存一个时间戳：熔断器跳闸到 OPEN 的那一刻。每个新请求进来时，把 now 跟那个时间戳加上配置的超时时间做比较。如果时间已到，就放一个探测请求通过。探测成功，就关闭熔断器；探测失败，更新时间戳，继续停在 OPEN。

状态机本身完全不变，变的只是迁移的触发方式。

CLOSED：请求直接通过。失败会累加计数器。计数器到达阈值时，原子地切换到 OPEN，并记录 trippedAt。
OPEN：每个进来的请求都检查 time.Now() > trippedAt + timeout。如果不满足，快速失败。如果满足，原子地切换到 HALF-OPEN，并把当前这个请求放过去当探测。
HALF-OPEN：恰好只有一个请求在飞。如果成功，切换到 CLOSED 并重置失败计数器。如果失败，切回 OPEN 并更新 trippedAt。

没有任何 goroutine 会醒来。没有分配任何定时器。熔断器完全被动，直到有请求到达。

一份完整的 Go 实现

下面是一个完整的、零后台开销的熔断器。它只用 sync/atomic 做状态迁移，并把跳闸时间戳存成纳秒计数器。

package breaker

import (
	"errors"
	"sync/atomic"
	"time"
)

type State int32

const (
	StateClosed State = iota
	StateOpen
	StateHalfOpen
)

type CircuitBreaker struct {
	// state is accessed with atomic operations.
	state      int32
	failures   int32
	threshold  int32
	timeout    time.Duration
	trippedAt  int64 // nanoseconds since Unix epoch
}

func New(threshold int, timeout time.Duration) *CircuitBreaker {
	return &CircuitBreaker{
		threshold: int32(threshold),
		timeout:   timeout,
	}
}

func (cb *CircuitBreaker) State() State {
	return State(atomic.LoadInt32(&cb.state))
}

// Allow reports whether the current request may proceed.
// It returns a done function that must be called with the outcome.
func (cb *CircuitBreaker) Allow() (done func(success bool), err error) {
	switch State(atomic.LoadInt32(&cb.state)) {
	case StateClosed:
		return cb.trackClosed, nil

	case StateOpen:
		// Lazy recovery check: has the timeout elapsed?
		if time.Now().UnixNano()-atomic.LoadInt64(&cb.trippedAt) < int64(cb.timeout) {
			return nil, errors.New("circuit breaker is open")
		}
		// Race: multiple goroutines may see this simultaneously.
		// Only one wins the CAS to HALF-OPEN.
		if atomic.CompareAndSwapInt32(&cb.state, int32(StateOpen), int32(StateHalfOpen)) {
			return cb.trackHalfOpen, nil
		}
		// Another goroutine won the race; fail fast this request.
		return nil, errors.New("circuit breaker is open")

	case StateHalfOpen:
		// Only one probe at a time. Every other request fails fast.
		return nil, errors.New("circuit breaker is half-open")
	}

	return nil, errors.New("unknown circuit breaker state")
}

func (cb *CircuitBreaker) trackClosed(success bool) {
	if success {
		atomic.StoreInt32(&cb.failures, 0)
		return
	}

	// Increment failures and trip if threshold reached.
	if atomic.AddInt32(&cb.failures, 1) >= cb.threshold {
		// Record the trip time before switching state so readers
		// never see OPEN with a stale trippedAt.
		atomic.StoreInt64(&cb.trippedAt, time.Now().UnixNano())
		atomic.StoreInt32(&cb.state, int32(StateOpen))
	}
}

func (cb *CircuitBreaker) trackHalfOpen(success bool) {
	if success {
		atomic.StoreInt32(&cb.failures, 0)
		atomic.StoreInt32(&cb.state, int32(StateClosed))
		return
	}

	atomic.StoreInt64(&cb.trippedAt, time.Now().UnixNano())
	atomic.StoreInt32(&cb.state, int32(StateOpen))
}

关键不变量：trippedAt 总是在状态切换到 OPEN 之前写入。这样 Allow() 里的读取方在读到 OPEN 之后，再去读 trippedAt，就能确定它是新鲜的。从 HALF-OPEN 返回时，我们在落回 OPEN 之前更新 trippedAt，从而让冷却时间从零重新开始。

为什么大多数库不这么做

惰性设计有一个看似明显的缺点：只有请求到达时，恢复才会被探测到。如果你的服务一小时没流量，熔断器就会在一小时内保持 OPEN。

听起来很糟。其实不然。

如果没有请求，那就没有东西需要保护。熔断器存在的意义是在有流量时防止级联故障，而不是维护一个实时健康仪表盘。当下一个请求真正到来时，超时检查在纳秒级完成，探测会立刻发射。实际恢复延迟的上界是 max(timeout, time-between-requests)。

对高流量服务来说，请求间隔可以忽略不计。对低流量服务来说，超时本身占主导。后台定时器在实践中几乎从不改善真实恢复时间。

库使用定时器的另一个原因是历史包袱。熔断器模式在 Java Hystrix、.NET Polly 等环境中流行时，一个熔断器实例守护的是整个服务依赖，而不是每台主机的连接。一条后台线程是可以接受的。在现代分布式系统里，你可能为每个上游端点配一个熔断器，这个假设就崩塌了。

测试竞态条件

OPEN 到 HALF-OPEN 的 CAS 循环是唯一存在 goroutine 竞争的地方。如果超时后两个请求同时到达，只有一个会成为探测请求。另一个快速失败，等下次请求再重试。这是正确的行为。恢复期间你永远不希望在飞多个探测，因为若干成功里夹杂一次失败，仍然可能把你打回 OPEN。

测试非常直接，因为没有异步定时器。你可以写单元测试直接操纵 trippedAt（或者用一个 time wrapper），完全不用 sleep：

func TestLazyRecovery(t *testing.T) {
	cb := New(1, time.Minute)

	// Trip the breaker.
	done, _ := cb.Allow()
	done(false)

	if cb.State() != StateOpen {
		t.Fatal("expected OPEN")
	}

	// Simulate timeout by winding back trippedAt.
	atomic.StoreInt64(&cb.trippedAt, time.Now().Add(-2*time.Minute).UnixNano())

	done, err := cb.Allow()
	if err != nil {
		t.Fatalf("expected probe to be allowed: %v", err)
	}

	// Success closes the breaker.
	done(true)
	if cb.State() != StateClosed {
		t.Fatal("expected CLOSED after successful probe")
	}
}

没有 time.Sleep。没有给 goroutine 用的 sync.WaitGroup。测试是确定性的，因为实现本身是同步的。

我们放弃了什么

有一个真正的损失：你无法在发送流量之前预先热启动熔断器。如果你需要按固定 schedule（比如每 5 秒）探测依赖，以保持连接池温热，那你仍然需要一个定时器。但这个定时器属于连接池或健康检查器，而不是熔断器。熔断器应该保护连接池，不该管理它。

把职责分开。健康检查负责预热连接。熔断器负责防止级联过载。你把它们混在一起，两边都会变得复杂。

其他语言也能用同样的模式

只要你的语言有原子 compare-and-swap 和单调时钟，这套结构就通用。Rust 用 std::sync::atomic，Java 用 AtomicIntegerFieldUpdater 和 System.nanoTime()，C++ 用 std::atomic 和自定义枚举。每种实现都不超过一百行。

如果你的语言不暴露 CAS，一条 sync.Mutex（或等价物）仍然比后台线程便宜。互斥锁每次请求只持有纳秒级时间，且只在状态迁移时上锁。它从不阻塞 I/O，也从不睡眠。

试试看

上面的完整实现作为起点已经具备生产可用性。在此基础上添加指标、日志和自适应阈值。但把 goroutine 留在外面。你的运行时调度器会感谢你，你的测试会跑得更快，你也不会再纠结为什么某个容器就是不肯干净地下线。