Your Gherkin specs are lying to you

Your Gherkin specs are lying to you.

Not intentionally. They started out faithful. But six sprints later, someone refactored the checkout flow and forgot to update the When the user submits payment step. The .feature file still passes, because the step definition still exists. It just calls code that no longer matches what the scenario actually describes. You have green tests and false confidence. This is the default trajectory of BDD unless you actively fight it.

The problem is not that developers are lazy. It is that the relationship between .feature files and step definitions is fundamentally loose. Gherkin scenarios are strings. Step definitions are regexes or annotations that match those strings. There is no compiler enforcing that a scenario change requires a corresponding code change, or vice versa. The toolchain assumes you will manually keep them aligned. You will not.

Why manual discipline fails at scale

Every team starts with the same plan: write the spec, implement the steps, update both together. This works in week one.

It breaks down during refactoring. You rename a domain concept in code, but the Gherkin still uses the old terminology because changing it means updating twelve feature files and re-reviewing them with product. Or you extract a new validation rule, but the existing scenario implicitly relied on the old behavior, and nobody noticed because the step definition was quietly generalized to keep the test passing. The specs become a parallel, increasingly inaccurate universe.

The cost is not just outdated documentation. It is trust. Once developers stop believing the feature files describe reality, they stop reading them. Then they stop writing them. Then you are back to unit tests with opaque names and no shared language with stakeholders.

What “in sync” actually means

Keeping specs in sync is not about making the tests pass. Passing is easy. In sync means three things:

Every Gherkin step has a corresponding step definition that does what the spec says.
Every step definition is actually reached by at least one scenario.
The language in the spec matches the language in the codebase.

Most teams only verify the first point, and they do it at runtime. You need to verify all three, and you need to do it in CI before the code merges.

Automated step validation with strict binding

The loose string matching in tools like Cucumber is the root cause. You can tighten it by making step definitions first-class references that the build can validate.

In TypeScript or JavaScript projects, you can replace regex-based step definitions with a generated step registry that maps Gherkin steps to actual function references. The key is that the mapping is generated, not hand-written, so the build fails if a scenario references a step that does not exist.

Here is a minimal setup using a custom parser and a generated registry. First, parse your .feature files at build time:

// scripts/validate-steps.ts
import { readFileSync, readdirSync } from 'fs';
import { parse } from '@cucumber/gherkin';
import { IdGenerator } from '@cucumber/messages';

const featureFiles = readdirSync('./features').filter(f => f.endsWith('.feature'));
const allSteps = new Set<string>();

for (const file of featureFiles) {
  const content = readFileSync(`./features/${file}`, 'utf-8');
  const gherkinDocument = parse(content, new IdGenerator());
  
  for (const feature of gherkinDocument.feature?.children || []) {
    for (const step of feature.scenario?.steps || []) {
      allSteps.add(step.text);
    }
  }
}

// Import the actual step registry from your test code
import { stepRegistry } from '../steps/registry';

const registeredSteps = new Set(Object.keys(stepRegistry));
const undefinedSteps = [...allSteps].filter(s => !registeredSteps.has(s));
const orphanedSteps = [...registeredSteps].filter(s => !allSteps.has(s));

if (undefinedSteps.length > 0) {
  console.error('Undefined steps:', undefinedSteps);
  process.exit(1);
}

if (orphanedSteps.length > 0) {
  console.error('Orphaned steps:', orphanedSteps);
  process.exit(1);
}

console.log(`Validated ${allSteps.size} steps against ${registeredSteps.size} definitions.`);

Your step registry exposes functions by their exact Gherkin text:

// steps/registry.ts
import { given, when, then } from './step-helpers';

export const stepRegistry: Record<string, Function> = {
  'the user is logged in': given.theUserIsLoggedIn,
  'the user adds an item to the cart': when.theUserAddsAnItemToTheCart,
  'the total should be {int}': then.theTotalShouldBe,
};

The given, when, and then objects are plain modules with functions. There is no regex magic. If a developer changes the Gherkin text, they must add a corresponding entry to the registry, or the build fails. If they delete a scenario, the orphaned step detection catches the leftover definition.

Tie it into CI before merge

A script that developers run locally is a script that developers forget to run. You need to make the validation fail the build.

Add it to your test pipeline:

# .github/workflows/ci.yml
jobs:
  validate-specs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx ts-node scripts/validate-steps.ts
      - run: npm test

The important detail is that validate-steps.ts runs before the actual test suite. If there is a mismatch between feature files and step definitions, you want to fail fast with a clear error, not run a hundred cucumber scenarios that might silently pass on stale logic.

Living documentation requires generated reports

Validation keeps the syntax aligned, but it does not guarantee the specs are readable or useful. For that, you need a living documentation pipeline that generates HTML reports from your feature files and publishes them on every merge to main.

Tools like Cucumber Reports or Pickles can turn your .feature files into browsable docs. The key is that the docs are generated from the same files that CI validates. If a scenario is removed, it disappears from the docs. If the language changes, the docs update automatically. There is no second source of truth to maintain.

Publish the report as an artifact in CI, or deploy it to a static site:

# .github/workflows/docs.yml
jobs:
  publish-docs:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g @picklesdoc/pickles
      - run: pickles --feature-directory=./features --output-directory=./docs
      - uses: actions/upload-pages-artifact@v3
        with:
          path: ./docs

Stakeholders do not need to read raw Gherkin. They need a readable page that they trust is current. Automation builds that trust.

The trade-off: strictness versus expressiveness

The registry approach has a cost. You lose the flexibility of regex patterns like /^the user adds (\d+) items? to the cart$/. Every variant becomes an explicit entry, or a parameterized step with typed placeholders. This is verbose.

The alternative is keeping regexes but adding a stricter linter that warns when a pattern is too broad or when a step text does not match any known pattern. You can get 80% of the safety with 20% of the verbosity by using Cucumber’s built-in dry-run and publish flags, combined with a custom linter that checks for unused step definitions.

# Dry-run parses all features without executing them, surfacing undefined steps
npx cucumber-js --dry-run

This is less strict than the registry approach. It catches undefined steps, but not orphaned ones, and it does not enforce semantic alignment. For teams with large existing suites, it is a pragmatic starting point. For new projects, the registry approach pays off within a month.

What we tried that did not work

We experimented with generating Gherkin from code comments. The idea was that developers would annotate their test methods, and a tool would produce the .feature files. It failed because Gherkin is supposed to be readable by non-developers. Generated prose from method names is not readable. It is not even prose.

We also tried enforcing pair programming for every spec change. It helped, but it did not scale. The problem is mechanical, and the fix should be mechanical too.

Start with undefined step detection today

If you have an existing Cucumber suite, the smallest useful change is adding --dry-run to your CI pipeline. It takes five minutes and it will catch the most common drift: a refactored scenario that no longer matches any step definition.

If you are starting fresh, consider a registry-based approach. The upfront cost of explicit mappings is repaid by build-time guarantees and the confidence to refactor freely without worrying that your specs are silently going stale.

Your Gherkin specs should describe what the system does. If you cannot trust them to do that, they are just expensive comments. Automate the checks that keep them honest, or accept that they will lie to you.