Same Prompt, Four Agents, Four Results

I gave four AI agents the same instruction. I got four different results. Not because the instruction was bad — because the system did exactly what stochastic systems do.

The Setup That Looked Simple

I was optimizing the PR workflow across four repositories, each with its own Claude Code agent running in parallel. I updated the workflow rules in all repos, then asked Claude Cowork how to notify the running agents about the change without restarting all sessions.

The answer was a one-liner to paste into each CLI: „Re-read CLAUDE.md — the PR workflow has changed. Create PRs as drafts first, then mark ready after CI passes.“

Copy. Paste. Paste. Paste. Paste. Four agents, same prompt, same context — four different interpretations. One acknowledged and moved on. Another started „fixing“ its current PR. A third decided its in-progress work needed restructuring. The fourth did something I still can’t fully reconstruct.

Why the Agents Diagnosed Themselves

When I asked Claude what went wrong, it diagnosed its own kind: „‚Re-read CLAUDE.md‘ is ambiguous. Some agents will just acknowledge. Others will start ‚fixing‘ things. The safer instruction would have been explicit about what to do and what NOT to do.“

An AI system explaining why its own kind can’t reliably follow its own instructions. That’s not a bug report — that’s a design constraint surfacing.

The Galton Board in Every Token

LLMs are stochastic systems. Same input, different output — by design. Every token an LLM generates is a probabilistic choice, shaped by context but never fully determined by it. Think of a Galton board: you drop a ball from the top, and it bounces left or right at each peg. Drop it again from the exact same position — it takes a different path.

How LLMs make token-by-token decisions: a Galton board analogy

Token generation works the same way. At each step, the model has a probability distribution across possible next tokens. Temperature, top-p sampling, and context all shape those probabilities — but they don’t eliminate the variance. The output is one path through a possibility space, not the deterministic result of a computation.

This is why the same prompt produces different results. Not because the model is broken, but because interpretation is what it does. LLMs don’t execute instructions — they interpret them. And interpretation, by its nature, varies.

The Distinction That Actually Matters

The lesson from my four-agent incident isn’t „write better prompts.“ Better prompts reduce variance, they don’t eliminate it. The real lesson is architectural: when you automate a business process, you need to decide which parts require deterministic behavior and which parts can tolerate — or even benefit from — stochastic reasoning.

Creative reasoning, problem decomposition, code generation — these benefit from the model’s ability to explore a possibility space. That’s where stochasticity is a feature.

Process execution, state transitions, workflow orchestration — these need to produce the same result every time. That’s where stochasticity is a liability.

The mistake isn’t using AI in automation. The mistake is trusting a probabilistic system to be repeatable in places where your process demands it.

Design for the Boundary

The practical consequence: use AI to design the automation, then use deterministic rules to run it. Let the model reason about what the workflow should look like, which edge cases to handle, what the error states are. Then encode the result in something that executes — a state machine, a rule engine, a CI pipeline — not in another prompt.

Reliability and intelligence are different system properties. Your architecture should treat them that way.

Boris Heuer

Boris Heuer
AI Engineer & Consultant

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top