Running the Same Prompt Five Times Produces Five Copies of the Same Mistake

N-version programming assumes diversity comes from different authors. With LLMs, that means different models, different providers, maybe different training runs. But the assumption is wrong. You can get meaningful diversity from the exact same model by changing how you ask, not what you ask.

The catch: turning the temperature up to 1.0 and running your prompt five times is not a strategy. You’ll get surface-level variation. The variable names change. The comments shuffle around. The structure stays identical, and the bugs stay identical too.

If you want implementations that fail independently, you need to prompt for different thinking patterns, not different outputs.

What N-Version Programming Actually Needs from LLMs

N-version programming is a fault tolerance technique where multiple independent implementations of the same specification are executed in parallel. The outputs are compared, and a majority vote determines the correct result. The idea is that different developers, working independently, will introduce different bugs. The bugs won’t correlate, so a majority vote suppresses them.

It is an old idea. It is also expensive. You are paying N teams to build the same thing.

LLMs make it cheap enough to try. Instead of N teams, you have N API calls. The problem is that N API calls to the same model with the same prompt produce N nearly identical implementations. The bugs correlate perfectly. Your majority vote is useless.

The fix is to treat the prompt as the developer, not the model. Different prompts produce different developers.

Why Temperature Alone Produces Cosmetic Diversity

Temperature controls the probability distribution over tokens. At high temperature, the model picks less likely next tokens. This creates variation in phrasing, variable naming, and superficial structure.

It does not create variation in algorithmic approach. If you ask for a function to find the longest palindromic substring, temperature changes whether you use left and right or l and r. It does not change whether you reach for expand-around-center or dynamic programming.

For N-version programming, that is useless. You need implementations that solve the problem differently, not implementations that look different while solving it the same way.

Four Prompting Strategies That Force Algorithmic Diversity

Here are four approaches that change how the model thinks about the problem.

Vary the Problem Framing

The same task framed as “write a parser” versus “write a state machine that recognizes this grammar” will produce different code. One might use recursive descent. The other might use a table-driven approach.

You can automate this by asking the model to adopt a specific framing before solving:

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def generate_with_framing(task: str, framing: str) -> str:
    prompt = f"""{framing}

Task: {task}

Write a complete, correct implementation. Do not explain your approach."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
    )
    return response.choices[0].message.content

task = "Parse a CSV string into a list of dictionaries, handling quoted fields and newlines within quotes."

framings = [
    "Approach this as a finite state machine with explicit state transitions.",
    "Approach this using recursive descent parsing with a lexer and parser.",
    "Approach this by splitting on delimiters and post-processing edge cases.",
]

for framing in framings:
    print(f"=== {framing} ===")
    print(generate_with_framing(task, framing))

Running this against GPT-4o, the state machine framing consistently produces a character-by-character parser with an explicit state enum. The recursive descent framing produces a lexer and separate parser functions. The split-and-fix framing produces a more compact but brittle solution.

Switch Personas

Different personas prime different knowledge. A systems programmer writes different code than a data scientist or a competitive programmer.

personas = [
    "You are a systems programmer who prioritizes memory efficiency and avoids unnecessary allocations.",
    "You are a Pythonic developer who prefers concise, idiomatic code using standard library features.",
    "You are an algorithms researcher who reaches for theoretically optimal solutions even if the code is longer.",
]

Persona prompting is surprisingly effective for structural diversity. The systems programmer reaches for arrays and indices. The Pythonic developer reaches for itertools and comprehensions. The algorithms researcher might pull in a library or write a more formal solution.

Constrain the Available Tools

Restricting or expanding the available toolkit forces different approaches.

constraints = [
    "You may only use the Python standard library. No external dependencies.",
    "You may use numpy and pandas. Optimize for vectorized operations.",
    "You must implement this without using regular expressions.",
]

This is particularly useful when you know one approach has a blind spot. If your regex-based parsers keep mishandling nested quotes, force a version that does not use regex.

Chain-of-Thought with Divergent Reasoning

Instead of asking for code directly, ask the model to generate multiple solution strategies and pick the least obvious one.

cot_prompt = f"""Task: {task}

First, list three different algorithms or approaches to solve this problem.
Then, pick the one that is most different from the others and implement it.
Do not pick the most obvious approach."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": cot_prompt}],
    temperature=0.7,
)

The chain-of-thought forces the model to surface its reasoning. The “most different” constraint pushes it away from the default solution. In practice, this produces the highest structural diversity of any single technique.

Where This Breaks Down: The Diversity Ceiling

Same-model diversity has limits, and you will hit them.

Fundamental knowledge gaps are shared. If the training data contains a systematic misunderstanding about floating-point comparison, every framing and every persona will reproduce that misunderstanding. The model has one set of weights. You are not getting around that with prompts.

There are also diminishing returns. The first three framings might give you a state machine, a recursive parser, and a split-based approach. The fourth framing might give you a state machine with different variable names. After three to five genuinely different approaches, you are scraping the bottom of the barrel.

Some techniques degrade quality. The “most different” constraint occasionally produces solutions that are different because they are wrong. Divergence for its own sake is not useful. You need a voting or testing mechanism to filter out the bad ideas.

A Practical Setup You Can Deploy Today

If you are building this into a system, do not randomize. Design your diversity.

Pick three to five techniques from the list above. Generate one implementation per technique. Run your test suite or property-based tests against all of them. Keep the ones that pass. Use a simple majority vote for the final output.

from collections import Counter

def majority_vote(outputs: list[str], test_fn) -> str:
    passing = [o for o in outputs if test_fn(o)]
    if not passing:
        raise RuntimeError("No implementation passed tests")

    # Exact match voting; swap for AST comparison if needed
    return Counter(passing).most_common(1)[0][0]

The test filtering step is non-negotiable. Diversity without correctness is just noise.

FAQ

Does this work with smaller models?

Yes, but the diversity ceiling is lower. Smaller models have fewer distinct solution strategies in their training data. You might get two genuinely different approaches instead of four. The techniques still work; they just produce less variation.

How many implementations do I actually need?

Three is the practical minimum for majority voting. Five gives you better coverage but with linearly increasing cost. After five, same-model diversity degrades into cosmetic variation. If you need more than five, switch to cross-model diversity.

Is same-model diversity as good as cross-model diversity?

No. Different models have different training data, architectures, and fine-tuning. They fail in genuinely different ways. Same-model diversity is a cost and operational convenience trade-off. Use it when you need good fault tolerance fast, not when you need perfect fault tolerance.

Can I combine these techniques?

Absolutely. A persona prompt combined with a tool constraint and a chain-of-thought step will produce more diversity than any single technique alone. The cost is a longer prompt and more tokens per generation. For critical code paths, the extra tokens are worth it.

What to Try First

Start with framing variation. It is the easiest to implement and produces the most consistent structural diversity. Add persona switching if you need more. Save cross-model diversity for the cases where same-model diversity hits its ceiling.

Run your implementations through the same test suite before you let them vote. An untested diverse implementation is just a buggy implementation you have not met yet.