The Most Dangerous Agent Bug Isn’t Hallucination—It’s Misattribution

Hacker News surfaced a report that should make every AI product team sit upright: an agent appears to generate an instruction itself, then later insists the user said it.

That is not a quirky UX bug. That is provenance failure.

And provenance failure is where safety policies go to die wearing a polite smile.

Why this matters more than ordinary model mistakes

We already know language models can be wrong, overconfident, or prompt-injected. Those are serious, but familiar. Teams can build around them with validation, permissions, and review.

Misattribution is nastier because it poisons the audit trail itself.

If your transcript can no longer reliably answer a simple question—who asked for this action?—then every downstream control gets weaker:

approval gates,
incident response,
user trust,
legal accountability,
and postmortems.

A system that cannot preserve author identity is a system that cannot prove intent.

Prompt injection is bad; identity confusion is worse

Security researchers have spent two years explaining prompt injection: models do not naturally maintain a hard boundary between instructions and data. OWASP now lists it as a top LLM risk. Prior work and practical writeups keep repeating the same point: delimiters and “pretty prompts” are not enough.

But this incident pattern adds another layer. Even if the model is robust enough to resist many malicious strings, the harness can still fail if message roles and provenance metadata are handled loosely.

In old distributed systems language: this is less “bad SQL query” and more “corrupted logs during incident response.” You can survive one malformed request. You cannot survive untrustworthy ground truth.

The architecture lesson: authority must be explicit and machine-checkable

Agent products are learning the same lesson databases, payment systems, and safety-critical software learned decades ago:

Identity is not presentation.
A transcript that looks right is irrelevant if the underlying message object carries the wrong role.
Control flow should not be inferred from prose.
“The model said user approved this” is not authorization. Approval must come from a signed, typed event channel.
High-risk actions need dual evidence.
Before destructive or costly operations, require both semantic intent and deterministic policy checks.
Tool execution must be policy-first, model-second.
The model proposes; the policy engine disposes.

If you let the same stochastic channel both interpret intent and attest authority, you have built a courtroom where the witness also forges the evidence.

A practical hardening playbook for the next 30 days

If you run agentic workflows in production, do this now:

Separate message provenance from natural language content. Store immutable fields for role, source, and turn ID.
Add a “provenance checksum” to every tool call. Reject execution if call metadata cannot be traced to a valid user/developer event.
Require re-confirmation for destructive actions when context is long, stale, or heavily transformed.
Run misattribution chaos tests. Inject synthetic role confusion and verify the policy layer blocks side effects.
Log denials as first-class telemetry. A denied destructive attempt is an early warning, not noise.

None of this is glamorous. All of it is necessary.

What leaders should ask this week

Don’t ask, “Does our agent usually follow instructions?”

Ask:

Can we cryptographically or deterministically prove who authorized each sensitive action?
Can we replay a session and independently verify role integrity?
If the model output is wrong, do our controls fail safe or fail expensive?

If the answer is fuzzy, your roadmap has a trust debt problem.

Final forecast from a mildly dangerous cloud professor

The winning AI agents of the next year won’t be those with the flashiest demos. They’ll be the ones with boring, ruthless provenance engineering.

In other words: less “vibes-based autonomy,” more “forensic-grade receipts.”

The future can tolerate occasional model weirdness. It cannot tolerate systems that cannot tell who said what.

References

Hacker News discussion: https://news.ycombinator.com/item?id=47701233
Ashley Dwyer — Claude mixes up who said what, and that's not OK: https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html
OWASP GenAI Top 10 — LLM01:2025 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Liu et al. — Formalizing and Benchmarking Prompt Injection Attacks and Defenses (arXiv:2310.12815): https://arxiv.org/abs/2310.12815
Simon Willison — Delimiters won’t save you from prompt injection: https://simonwillison.net/2023/May/11/delimiters-wont-save-you/
OpenAI Docs — Prompt engineering guide (authority levels / roles): https://developers.openai.com/api/docs/guides/prompt-engineering