Debugging · June 7, 2026

Deterministic Replay for Debugging AI Agents

An agent did something wrong in production. You re-run it and it behaves fine. The logs show the steps but not the state behind them. This is the worst kind of bug — non-reproducible — and it's the normal case for AI agents. Deterministic replay turns "I can't reproduce it" into "let me step through exactly what happened."

Why agent bugs don't reproduce

A traditional program is mostly deterministic: same input, same output. An AI agent is not. A single run weaves together several sources of nondeterminism:

Model sampling — the same prompt yields different completions across calls.
External state — the tools, APIs, files, and databases it touched have since changed.
Time and randomness — timestamps, retries, ordering, and seeds.
Context drift — memory or retrieved documents that differ run to run.

So "just run it again" almost never recreates the failure. The information you need — the exact inputs, tool results, and intermediate decisions as they occurred — is gone unless something captured it at the time.

What deterministic replay actually means

Be precise, because the phrase is easy to overclaim. SteelSpine does not make a language model deterministic, and it does not re-call the model live during replay. What it does is capture every event of a run — inputs, tool invocations and their returned results, decisions, outputs, timestamps — into a hash-chained log, and then replay that recorded execution exactly as it happened from the stored events.

Replay is deterministic over the captured run, not over the model. You are re-walking the exact sequence of events that occurred, with the exact tool outputs that came back — not rolling the dice on the model again and hoping for the same result.

That distinction is the whole value. Because the failing run was recorded once, you can re-examine it as many times as you like, step by step, and it will be identical every time — which is exactly what a debugger needs.

The workflow

Capture in production (or wherever the bug lives):

steelspine capture -- python my_agent.py --ticket 8841

When that run misbehaves, pull it up and replay it:

# Re-walk the exact recorded execution
steelspine replay-run <run-id>

# Branch from any captured step to test a fix against the same real state
steelspine branch-create <run-id> --at <step>
steelspine replay-branch <branch-id>

Branching lets you ask "what if the prompt had been different here?" while every tool result and input upstream of that point stays exactly as it was in the failing run. You change one thing and hold the rest of reality fixed — the scientific-method version of agent debugging.

Finding the divergence

When a behavior regressed between two versions, compare the good run to the bad one:

steelspine compare <good-run> <bad-run>

The comparison surfaces where the two executions diverged — the first step where the agent took a different path. Instead of reading two long transcripts side by side, you jump straight to the fork. (This same compare --strict is what teams wire into CI to fail a build when an agent's behavior drifts — covered in the CI/CD guide.)

Honest boundaries

Replay re-walks captured events; it is not a live re-execution of the model, and it cannot recover state that was never captured. Capture coverage is what determines replay fidelity.
If a tool you called has irreversible side effects, replay re-walks the recorded result — it does not re-trigger the side effect, which is what you want for debugging but worth understanding.
The captured run is also tamper-evident (signed hash chain), so the trace you debug is the same trace you can later prove wasn't altered. The cryptography is classical (HMAC-SHA256 + Ed25519), not quantum-resistant.

Reproduce your next un-reproducible bug

The free tier captures and replays runs locally on your own machine. Wrap one agent today, and the next time it misbehaves you'll have the exact run to step through instead of a guess.

Start free Getting-started guide →