The Review Problem

The standard advice for AI-generated code is “review it carefully.”

That advice is correct and useless at scale.

A developer reviewing AI output catches problems when they are alert, familiar with the domain, and not under time pressure. In every other condition — which is most conditions — things slip through.

An AI reviewer catching AI-generated problems is even less reliable. You are asking a probabilistic system to verify the output of another probabilistic system. The failure modes correlate.

The only pattern that scales is deterministic enforcement. Rules that are checked on every commit, that cannot be overridden by fatigue or optimism, and that fail the build when violated.

What Deterministic Means

Deterministic means the check produces the same result every time, regardless of who runs it or when.

  • A linter rule is deterministic.
  • A type check is deterministic.
  • A boundary import restriction is deterministic.
  • A contract test is deterministic.
  • A human reviewer’s attention span is not.
  • An LLM’s assessment is not.

This distinction matters more in AI-generated codebases because the volume of generated code exceeds what any team can review manually. You cannot scale review linearly with generation speed. You can scale deterministic checks trivially.

The Guardrail Stack

For AI-generated codebases, the guardrail stack has four layers:

Layer 1: Type System

The type system is the first guardrail. Strict types catch a class of errors that no review process — human or AI — consistently catches:

  • Null violations
  • Interface mismatches
  • Missing case handling
  • Wrong argument types

If the project uses TypeScript, strict: true is non-negotiable. If it uses a language without a strong type system, add one via tooling or choose a different language.

Layer 2: Architecture Rules

Architecture rules enforce boundary discipline:

  • No screen imports from another screen’s internals.
  • No domain logic imports infrastructure directly.
  • No module bypasses the composition root for dependency access.
  • No vendor SDK appears outside its adapter module.

Tools like Semgrep, ArchUnit, dependency-cruiser, or custom ESLint rules can enforce these statically. The key is that they run in CI and they fail the build. A warning that developers can ignore is not a guardrail.

Layer 3: Contract Tests

Contract tests verify that modules satisfy their interfaces without running the full system:

  • The auth adapter satisfies the auth interface.
  • The analytics adapter satisfies the analytics interface.
  • The storage adapter satisfies the storage interface.

These run fast, test integration boundaries, and catch the specific failure mode where AI-generated code satisfies the type signature but violates the behavioral contract.

Layer 4: Deterministic Integration Checks

At the integration level, checks verify system-wide properties:

  • The composition root resolves all dependencies without runtime errors.
  • The dependency graph contains no cycles.
  • All required environment variables are declared.
  • Configuration is valid at build time, not just at runtime.

Semgrep for AI-Generated Code

Semgrep deserves specific mention because it excels at expressing architectural rules as code:

  • Pattern-matching on AST structure, not string matching.
  • Custom rules per project, not just generic linting.
  • Fast enough to run on every commit.
  • Expressive enough to encode boundary violations.

A team using AI generation extensively should maintain a Semgrep ruleset that encodes their architectural boundaries. When the AI generates code that violates a boundary, the build fails before merge. No human attention required.

This is not about catching bugs in AI output. It is about making structural violations impossible to merge regardless of how they were produced.

What AI Code Review Actually Catches

AI code review tools are useful for:

  • Style consistency
  • Documentation gaps
  • Obvious logic errors
  • Suggesting alternative approaches

AI code review tools are unreliable for:

  • Architectural boundary violations
  • Subtle coupling introduced across modules
  • Behavioral contract violations
  • Security properties that require whole-system reasoning

The failure mode is not that AI review misses things occasionally. The failure mode is that it misses things unpredictably, and you cannot know when it has missed something. That is why it cannot replace deterministic enforcement — only supplement it.

The Economics

Deterministic guardrails are cheap to run and expensive to build initially.

Writing the architecture rules takes a few days. Maintaining the Semgrep configuration takes ongoing attention. Setting up contract tests requires defining interfaces first.

But once they exist, they run on every commit at near-zero marginal cost. They do not get tired. They do not skip checks on Friday afternoons. They do not defer to seniority or social pressure.

For AI-generated codebases where code volume is high and generation speed is fast, this economic profile is decisive. The alternative — scaling human review to match AI generation speed — is not viable.

The Connection to AI-Native Architecture

I wrote about this broader pattern in Stanford CS146S Is Right About AI Coding — The Missing Subject Is Architecture. Deterministic guardrails are the enforcement mechanism that makes replaceable architecture real.

Without guardrails, architectural boundaries are aspirational. With them, boundaries are structural. The difference between “we try to keep modules isolated” and “the build fails if module isolation is violated” is the difference between architecture that survives AI-speed iteration and architecture that collapses under it.

The modern software developer does not just need AI tool fluency. They need a guardrail stack that makes AI-generated code structurally safe to ship at speed.

FAQ

What is the best tool for enforcing architecture rules on AI-generated code?

Semgrep is the most flexible option for custom architectural rules. It supports pattern-matching against AST structure, runs fast in CI, and lets teams encode project-specific boundaries. For JavaScript/TypeScript projects, dependency-cruiser and custom ESLint rules are also effective.

Can AI code review replace human code review?

No. AI code review supplements human review for style and documentation but is unreliable for architectural boundary enforcement and security properties. Deterministic checks (type system, linters, architecture rules, contract tests) are the only scalable replacement for attention-dependent review.

How do you set up guardrails without slowing down the team?

Start with the type system (strict mode, no exceptions). Add architecture rules for the three highest-risk boundaries. Add contract tests for external integrations. Each layer takes a day to set up and runs in seconds. The slowdown comes from violations, not from the checks themselves.

What is the difference between a guardrail and a linter?

A linter suggests improvements. A guardrail fails the build. The distinction is enforcement. In AI-generated codebases, suggestions get ignored at scale because the volume is too high for consistent manual attention. Only build failures guarantee compliance.