AI Safety Stack: types、contracts、property tests、mutation gates

Safety stack のない AI speed は fragility に変わる

AI-generated code の危険なところは、常に間違っていることではありません。

危険なのは、merge できてしまう程度には正しく見えることです。

そこが本当のリスクです。明らかに壊れている code は止まります。もっともらしく見え、いくつかの happy-path tests を通り、しかも重要な boundary を静かに弱める code が production に入ります。

もし workflow が prompt、paste、review、merge だけなら、generation speed が上がるほど、どれだけ速く変更できるかと、どれだけ信頼できるかの差が広がります。

必要なのは review の根性論ではありません。必要なのは、異なる failure modes を異なる地点で捕まえる layered safety stack です。

Layer 1: types、schemas、contracts による prevention

最も安い defect は、そもそも program が表現できない defect です。

だから最初の layer は prevention です。

この layer では、runtime behavior の前に surface area を狭めます。

Branded types と phantom types は、構造的には似ていても意味が違う値の取り違えを防ぎます。
Runtime schemas たとえば Zod は、untyped data が入ってくる boundaries を守ります。
Contracts は、重要な code paths に preconditions、postconditions、invariants を与えます。

AI-generated code ではこの layer が特に重要です。models は一見まともでも category mistakes を含む implementation をかなり自然に出します。types と schemas が弱いと、model は「だいたい合っている」状態のまま通ってしまいます。

Layer 2: property-based tests による verification

Example tests は有用ですが、人間にも model にも overfit されやすいです。

model は function と、それにぴったり合う happy-path test を同時に作れます。pull request 上では生産的に見えますが、意味論はほとんど固定されていないことがあります。

Property-based testing は問いを変えます。三つの examples で動くかではなく、input のクラス全体で何が常に真であるべきかを問います。

入り口として ROI が高いのはたいてい次です。

round trips
idempotence
ordering invariants
monotonicity
invalid input に対する正しい error behavior

ここでは AI が意外と役に立ちます。signature や doc comment から、使える first draft properties を提案するのは比較的得意です。人間の review は必要ですが、blank page problem はかなり小さくなります。

Layer 3: mutation testing による assessment

Coverage は quality metric ではありません。Execution metric です。

Mutation testing は本当に重要な問いを投げます。もし code がもっともらしく壊れた形に変わったら、あなたの tests はそれに気づくか。

だから mutation testing は contracts と property tests の上に置かれます。置き換えるためではなく、それらが本当に効いているかを測るためです。

これは AI-generated test suites に特に重要です。models は見た目が立派で、多くの lines を通るのに、ほとんど何も検証していない tests を簡単に作れます。Mutation testing はその false confidence を露出させます。

実践的なやり方は、常に全 code に full mutation analysis をかけることではありません。実践的なのは次です。

critical modules から始める
changed code に incremental mutation testing を使う
survivors をしっかり triage する
suite が成熟するにつれて thresholds を引き上げる

AI 時代において、mutation testing は false confidence の解毒剤です。

Layer 4: runtime containment と recovery

強い verification stack でも、すべてを捕まえられるわけではありません。

だから外側の layer は runtime containment です。

ここでは crash-only design、deadlines、circuit breakers、leases、capability-based boundaries が効きます。何かが漏れても、一つの bad path を cascading incident にしないためです。

多くの teams では、この layer は小さく始まります。

external calls への explicit timeouts
state-changing endpoints への idempotency keys
不安定な dependencies の前に置く circuit breakers
sensitive operations のための narrow capability surfaces

目標は perfection ではありません。目標は bounded blast radius です。

なぜ layers は組み合わせると強いのか

各 layer は別の failure class を捕まえます。

Types と contracts は obvious invalid states を防ぎます。Property-based tests は semantics を検証します。Mutation testing は tests に本当に teeth があるかを測ります。Runtime containment はすり抜けたものを扱います。

これが重要な考え方です。必要なのは一つの完璧な technique ではありません。独立に失敗する複数の不完全な techniques です。

実践的な rollout はどう見えるか

多くの teams は、全部を一度に有効化しないほうがいいです。

現実的な順序はだいたいこうです。

TypeScript または Rust の boundaries を強くする
external inputs に runtime schemas を追加する
critical functions に contracts を入れる
serializers、reducers、validators に property tests を書く
high-risk modules に incremental mutation testing を入れる
よく壊れる dependencies の周りに runtime containment を追加する

この順序が機能するのは、各 layer が次の layer を強くするからです。より良い schemas はより良い properties を生みます。より良い properties は mutation scores を上げます。Mutation feedback は contracts や test depth の弱い場所を教えてくれます。

AI 時代の engineering で本当に変わること

勝つのは、最も多く code を生成する teams ではありません。より多くの generated code を、trust を下げずに吸収できる teams です。

それを可能にするのが safety stack です。

これにより、AI は speed amplifier から reliability amplifier になります。model は implementations、tests、contracts、rules の生成を助けます。stack は、それらの artifacts が「見た目が良いから」ではなく、deterministic systems によって継続的に検証されるようにします。

もし production engineering で AI を本気で使うなら、目指す基準はここです。review checklist を一つ増やすことでも、“be careful” という prompt を一つ増やすことでもありません。

すべての generated change が prevention、verification、assessment、containment を通過しなければならない layered system です。

それが、速い code を trust できる code に変える方法です。