Every AI Company Says Their Agents Are Safe. Now You Can Verify It.
2026年3月13日

AI agents are starting to manage money, execute code, access sensitive data, and make decisions on your behalf. The industry is moving fast toward giving these systems real autonomy over real things. And the more control you hand an agent, the more you're trusting that the developer behind it actually built in the safety measures they claim.
That trust is completely unverifiable right now.
Users can't verify a single guardrail claim today
When you interact with an AI agent, you're trusting that the developer actually implemented the safety measures they claim. Content moderation, hallucination detection, restrictions on dangerous actions. All of it runs on the developer's servers, behind closed doors.
That means guardrails could be misconfigured. They could be turned off in production to save costs. They could be advertised but never actually deployed. And you'd never know.
This isn't a hypothetical. As AI agents get more autonomous and start handling real decisions with real consequences, "just trust us" stops being an acceptable answer.
AI safety needs proof, not promises.
Now you can verify it.
Proof-of-Guardrail: cryptographic verification that a safety check actually ran
Our research team at Sahara AI, in collaboration with University of Southern California (USC), just published
"Proof-of-Guardrail in AI Agents and What (Not) to Trust from It"
The core concept: a system that lets AI agent developers produce cryptographic proof that a specific guardrail actually ran before a response was generated. Not a claim. Not a checkbox on a compliance form. A verifiable, tamper-proof attestation that users can check independently.
Here's how it works at a high level:
The guardrail code runs inside a Trusted Execution Environment (TEE), a hardware-secured enclave that isolates the computation.
When the guardrail executes, the TEE produces a signed attestation that captures exactly what code ran and what the inputs and outputs were.
Users can verify that attestation against the known open-source guardrail code, without ever seeing the developer's proprietary agent.
The developer's IP stays private. The user gets proof. Both sides win.
Every simulated attack was caught, minimal latency cost
We implemented Proof-of-Guardrail on OpenClaw agents and deployed it on AWS Nitro Enclaves. We tested content safety guardrails (using Llama Guard 3) and factuality guardrails (using Loki, an open-source fact verification tool).
The results:
Tampering detection worked across the board. Modified guardrail code, altered attestation bytes, changed responses. Every attack was caught during verification.
Latency overhead averaged about 34%. For a chatbot-style interaction, that's a manageable tradeoff for verifiable safety. Attestation generation itself takes roughly 100ms.
We also deployed a live demo of an OpenClaw agent running on Telegram where users could request proof-of-guardrail in real time through the chat.
Proving the guardrail ran is not the same as proving the output is safe
A very important thing to note here is proof-of-guardrail is not proof of safety. It proves the guardrail ran. It does not guarantee the guardrail worked perfectly. Guardrails can still make classification errors. They can be jailbroken, especially since the system requires guardrails to be open-source (which means adversarial developers can study them for weaknesses).
A financial news agent could present valid proof-of-guardrail while still serving misleading advice if the developer has figured out how to get around the guardrail itself.
We're explicit about this in the paper because the distinction matters. Conflating "the guardrail ran" with "the output is safe" would create exactly the kind of false confidence this research is trying to prevent.
So what does Proof-of-Guardrail actually close off?
Without this system, a developer can skip the guardrail entirely, swap it for a weaker version, or claim safety measures exist when they don't. Those are the easiest and most common ways safety breaks down in production, and they're completely undetectable today. Proof-of-Guardrail eliminates all of them.
What's left is a much narrower problem: adversarial jailbreaking of a guardrail that's actually running. That's harder to pull off and, critically, it's the kind of problem the research community can actively benchmark, red-team, and patch. Open-source guardrails mean open scrutiny, which means faster iteration on defenses.
The gap between "guardrail ran" and "output is safe" doesn't close overnight. But the path forward is clear. We need stronger guardrails, better benchmarks, and community-driven standards for what counts as best practice. Proof-of-Guardrail gives that ecosystem something to build on by making guardrail execution a verifiable fact instead of a claim.
Verification is the foundation for autonomous agents at scale
Proof-of-Guardrail is one piece of a larger shift. As agents get more autonomous, every layer of the stack needs cryptographic accountability. Verifiable inference. Auditable decision-making. Proof that safety measures actually executed, not just a developer's word that they did. This infrastructure can't be an afterthought. It has to be built in from the start.
That's what we're building toward at Sahara AI.
Read the full paper:


