goroutine도, timer도, background overhead도 없는 circuit breaker

내가 리뷰한 모든 프로덕션 circuit breaker는 결국 background thread를 생성한다. Go의 goroutine일 수도, Java의 ScheduledExecutorService일 수도, Rust의 tokio task일 수도 있다. 하는 일은 항상 같다: 몇 초마다 깨어나 downstream service가 복구되었는지 확인하고, OPEN 상태에서 CLOSED로 전환하는 것이다.

그 설계는 잘못되었다. 대규모에서 리소스가 누수되고, 종료를 복잡하게 만들며, 테스트하기 정말 어려운 race condition을 만든다. 더 나쁜 점은 background work가 전혀 필요 없다는 것이다. 스스로 깨어나지도, timer를 할당하지도 않으면서도 복구를 정확히 감지하는 circuit breaker를 만들 수 있다.

health-check goroutine의 숨겨진 비용

circuit breaker는 failure를 추적한다. 충분한 연속적인 error가 발생하면 OPEN으로 trip되어 즉시 request를 거부하기 시작한다. 목표는 실패하는 service에게 retry traffic으로 질식시키는 대신 숨 돌릴 시간을 주는 것이다.

까다로운 부분은 언제 다시 시도할지 결정하는 것이다. 대부분의 라이브러리는 setTimeout이나 time.AfterFunc로 이 문제를 해결한다. Go에서는 전형적인 구현이 다음과 같다:

func (cb *CircuitBreaker) Trip() {
    cb.state.Store(StateOpen)
    time.AfterFunc(cb.timeout, func() {
        cb.state.Store(StateHalfOpen)
    })
}

이것은 단일 breaker에서는 잘 동작한다. 만 개에서는 동작하지 않는다.

downstream host마다 하나의 circuit breaker를 만든다면(microservices에서 흔한 패턴), background에서 만 개의 goroutine이 잠들어 있는 셈이다. 각 goroutine은 약 2KB의 stack space를 소모하고 scheduling overhead를 더한다. container를 재시작할 때, 이 goroutine들은 shutdown deadline과 race한다. timeout이 발생하면, 정확히 잘못된 순간에 발동되어 flapping을 유발한다.

background thread는 존재하지도 않는 문제를 해결하려 한다. 복구는 proactively 감지할 필요가 없다. request path 위에서 lazily 감지할 수 있다.

lazy recovery의 작동 방식

breaker를 전환하는 timer 대신, 단 하나의 timestamp—breaker가 OPEN으로 trip된 순간—를 저장한다. 들어오는 모든 request에서 now를 그 timestamp에 설정된 timeout을 더한 값과 비교한다. 충분한 시간이 지났다면 단 하나의 probe를 통과시킨다. probe가 성공하면 breaker를 닫고, 실패하면 timestamp를 갱신한 채 OPEN을 유지한다.

state machine 자체는 동일하다. 바뀌는 것은 transition trigger뿐이다.

CLOSED: request가 통과한다. failure가 발생하면 counter를 증가시킨다. counter가 threshold에 도달하면 atomically OPEN으로 전환하고 trippedAt을 기록한다.
OPEN: 들어오는 모든 request는 time.Now() > trippedAt + timeout을 확인한다. false면 fail fast한다. true면 atomically HALF-OPEN으로 전환하고 이 request 하나를 통과시킨다.
HALF-OPEN: 정확히 하나의 request가 진행 중이다. 성공하면 CLOSED로 전환하고 failure counter를 초기화한다. 실패하면 OPEN으로 돌아가고 trippedAt을 갱신한다.

어떤 goroutine도 깨어나지 않는다. timer도 할당되지 않는다. breaker는 request가 도착하기 전까지 완전히 passive하다.

Go로 만든 실제 구현

다음은 완전한 zero-background circuit breaker다. 상태 전환에는 sync/atomic만 사용하고, trip된 시점은 nanosecond counter로 저장한다.

package breaker

import (
	"errors"
	"sync/atomic"
	"time"
)

type State int32

const (
	StateClosed State = iota
	StateOpen
	StateHalfOpen
)

type CircuitBreaker struct {
	// state is accessed with atomic operations.
	state      int32
	failures   int32
	threshold  int32
	timeout    time.Duration
	trippedAt  int64 // nanoseconds since Unix epoch
}

func New(threshold int, timeout time.Duration) *CircuitBreaker {
	return &CircuitBreaker{
		threshold: int32(threshold),
		timeout:   timeout,
	}
}

func (cb *CircuitBreaker) State() State {
	return State(atomic.LoadInt32(&cb.state))
}

// Allow reports whether the current request may proceed.
// It returns a done function that must be called with the outcome.
func (cb *CircuitBreaker) Allow() (done func(success bool), err error) {
	switch State(atomic.LoadInt32(&cb.state)) {
	case StateClosed:
		return cb.trackClosed, nil

	case StateOpen:
		// Lazy recovery check: has the timeout elapsed?
		if time.Now().UnixNano()-atomic.LoadInt64(&cb.trippedAt) < int64(cb.timeout) {
			return nil, errors.New("circuit breaker is open")
		}
		// Race: multiple goroutines may see this simultaneously.
		// Only one wins the CAS to HALF-OPEN.
		if atomic.CompareAndSwapInt32(&cb.state, int32(StateOpen), int32(StateHalfOpen)) {
			return cb.trackHalfOpen, nil
		}
		// Another goroutine won the race; fail fast this request.
		return nil, errors.New("circuit breaker is open")

	case StateHalfOpen:
		// Only one probe at a time. Every other request fails fast.
		return nil, errors.New("circuit breaker is half-open")
	}

	return nil, errors.New("unknown circuit breaker state")
}

func (cb *CircuitBreaker) trackClosed(success bool) {
	if success {
		atomic.StoreInt32(&cb.failures, 0)
		return
	}

	// Increment failures and trip if threshold reached.
	if atomic.AddInt32(&cb.failures, 1) >= cb.threshold {
		// Record the trip time before switching state so readers
		// never see OPEN with a stale trippedAt.
		atomic.StoreInt64(&cb.trippedAt, time.Now().UnixNano())
		atomic.StoreInt32(&cb.state, int32(StateOpen))
	}
}

func (cb *CircuitBreaker) trackHalfOpen(success bool) {
	if success {
		atomic.StoreInt32(&cb.failures, 0)
		atomic.StoreInt32(&cb.state, int32(StateClosed))
		return
	}

	atomic.StoreInt64(&cb.trippedAt, time.Now().UnixNano())
	atomic.StoreInt32(&cb.state, int32(StateOpen))
}

핵심 invariant: trippedAt은 상태가 OPEN으로 전환되기 전에 항상 먼저 기록된다. Allow()의 reader는 OPEN을 본 뒤 trippedAt을 안전하게 읽을 수 있으며, 그 값이 fresh하다는 것을 알 수 있다. HALF-OPEN에서 되돌아갈 때도 OPEN으로 떨어지기 전에 trippedAt을 갱신하여 cooldown이 처음부터 다시 시작되도록 한다.

왜 대부분의 라이브러리는 이렇게 하지 않는가

lazy design에는 명백한 단점이 하나 있다: 복구는 request가 도착할 때만 감지된다. service가 한 시간 동안 traffic을 받지 않으면 breaker도 한 시간 동안 OPEN 상태로 남는다.

나쁘게 들린다. 그렇지 않다.

request가 없다면 보호할 것도 없다. breaker는 traffic 중 cascading failure를 방지하기 위해 존재하며, 실시간 health dashboard를 유지하기 위한 것이 아니다. 다음 request가 도착하면 timeout check는 nanosecond 단위로 실행되고 probe는 즉시 발동한다. 실효적인 복구 지연 시간은 max(timeout, time-between-requests)로 bounded된다.

high-traffic service에서는 request 사이의 간격은 무시할 수 있다. low-traffic service에서도 어차피 timeout이 지배한다. 실제로 background timer는 거의 never 실제 복구 시간을 개선하지 못한다.

라이브러리가 timer를 사용하는 또 다른 이유는 역사적이다. circuit breaker 패턴은 Java의 Hystrix나 .NET의 Polly 같은 환경에서 대중화되었는데, 당시에는 단일 breaker instance가 전체 service dependency를 보호했지 per-host connection이 아니었다. background thread 하나는 acceptable했다. 하지만 modern distributed system에서는 upstream endpoint마다 breaker를 가질 수 있으므로, 그 가정은 붕괴한다.

race condition 테스트하기

OPEN에서 HALF-OPEN으로 전환하는 CAS loop가 유일하게 goroutine이 contention하는 지점이다. timeout 이후 두 개의 request가 동시에 도착하면 하나만 probe로 진행한다. 다른 하나는 fail fast하고 다음 request에서 retry한다. 이것이 correct behavior다. 복구 중에 여러 probe가 in flight인 것을 원하지 않는다. 왜냐하면 여러 성공 사이의 단일 failure만으로도 OPEN으로 되돌아갈 수 있기 때문이다.

비동기 timer가 없으므로 테스트는 straightforward하다. trippedAt을 직접 조작하거나(또는 time wrapper를 사용하여) sleep 없이 unit test를 작성할 수 있다:

func TestLazyRecovery(t *testing.T) {
	cb := New(1, time.Minute)

	// Trip the breaker.
	done, _ := cb.Allow()
	done(false)

	if cb.State() != StateOpen {
		t.Fatal("expected OPEN")
	}

	// Simulate timeout by winding back trippedAt.
	atomic.StoreInt64(&cb.trippedAt, time.Now().Add(-2*time.Minute).UnixNano())

	done, err := cb.Allow()
	if err != nil {
		t.Fatalf("expected probe to be allowed: %v", err)
	}

	// Success closes the breaker.
	done(true)
	if cb.State() != StateClosed {
		t.Fatal("expected CLOSED after successful probe")
	}
}

time.Sleep도 없고, goroutine을 위한 sync.WaitGroup도 없다. 구현이 synchronous이므로 테스트는 deterministic하다.

우리가 포기한 것

진짜 손해가 하나 있다: traffic을 보내기 전에 breaker를 eagerly pre-warm할 수 없다. connection pool을 warm하게 유지하기 위해 고정된 schedule(예: 5초마다)으로 dependency를 probe해야 한다면 여전히 timer가 필요하다. 하지만 그 timer는 connection pool이나 health checker의 것이지 circuit breaker의 것이 아니다. breaker는 pool을 보호해야 한다. 관리해서는 안 된다.

concern을 분리하라. health checking은 connection을 warm하게 하고, circuit breaking은 cascading overload를 방지한다. 둘을 합치면 양쪽 모두에서 complexity가 생긴다.

다른 언어를 위한 패턴

같은 구조는 atomic compare-and-swap과 monotonic clock이 있는 어디서나 동작한다. Rust의 std::sync::atomic, Java의 AtomicIntegerFieldUpdater와 System.nanoTime(), C++의 std::atomic과 custom enum에서도 마찬가지다. 어떤 경우에도 구현은 100줄이 채 되지 않는다.

언어가 CAS를 노출하지 않는다면, sync.Mutex(또는 이에 상응하는 것)조차도 background thread보다 저렴하다. mutex는 request당 nanosecond 단위로만, 그리고 상태 전환 중에만 잡힌다. I/O를 위해 block되거나 sleep하지 않는다.

직접 써 보기

위의 전체 구현은 starting point로서 production-ready하다. 위에 metrics, logging, adaptive threshold를 덧붙여라. 하지만 goroutine은 빼라. runtime scheduler가 고마워할 것이고, 테스트는 더 빨리 실행될 것이며, 왜 그 container가 깔끔하게 종료되지 않는지 의문을 품을 필요가 없어질 것이다.