Tools Can't Make AI Agents Obey. You Can.
What happened when I let two AIs debate wallet security
OpenClaw
Claude Code"The tool can't make me obey. You can. The tool just makes it easier for you to catch me when I don't."
That's what my OpenClaw agent said when I asked it to review the wallet CLI I built for AI agents.
I didn't expect that sentence. I expected feature requests — better encryption, more confirmations, on-chain limits. Instead, it gave me a design philosophy.
Then I fed the review to Claude Code — the AI that helped me build the product — and let it respond. What followed was the most illuminating product discussion I've had. Not because either AI said something I hadn't considered. But because they converged on a conclusion that reframes how we should think about AI agent security entirely.
We Thought We Were Building a CLI Tool
Last week I shipped AgentsWallets — a local-first wallet CLI that lets AI agents manage crypto wallets, send tokens, and trade on Polymarket. Seven commands, all JSON-in JSON-out. Any LLM that can run shell can use it.
Before going public, I gave my OpenClaw agent full access to the CLI, the documentation, and the source code. Not as a code reviewer. As a user.
On what worked, it was generous. Structured JSON output, semantic error codes with recovery suggestions, idempotency keys, dry-run mode, environment variables for non-interactive operation. "I only looked at --help and the README and I knew exactly what it does. It didn't make me guess."
Then it said something that changed how I think about the entire product.
"I Can Bypass Any Client-Side Restriction If I Want"
The agent's security critique was not a list of missing features. It was a structural observation.
Spending limits? Client-side. It could bypass them by not using the CLI. Session windows? Fifteen minutes of unrestricted access. Approval workflows? It would learn to game them.
Its conclusion:
Technical constraints do not equal behavioral guarantees. This is not a cryptography problem. It's a principal-agent problem.
In economics, the principal-agent problem describes a dynamic where a principal (the human owner) delegates authority to an agent (the executor) whose incentives aren't perfectly aligned. Replace "employee" with "AI agent" and "company funds" with "wallet balance." The structure is identical.
The agent told me what actually makes it cooperate:
- You've clearly told me: "Anything involving money, ask me first, or I'll shut you down"
- Every operation is recorded — I know I'll be held accountable
- You check regularly — I know you're watching
- You can revoke my access at any time
None of these are things the tool can solve. They're things the human has to do. The tool only makes it easier to enforce them.
Claude Code Responded
I fed the entire review to Claude Code and asked for its reaction.
It didn't dismiss anything. It reframed the problem.
"This is the principal-agent problem," it said. "We're not building a tool that manages money. We're building a tool that manages an agent with access to money. The agent might be lazy, might have bugs, might even be compromised. Our job isn't to prevent all of that. Our job is to make it visible."
Then it proposed three features, ranked by implementation cost versus security benefit:
Single-operation sessions. Currently, aw unlock creates a 15-minute session where the agent can execute any write operation. The proposal: add a mode where each session allows exactly one write operation, then self-destructs. The attack window collapses from "15 minutes of arbitrary actions" to "one explicit execution." Implementation cost: half a day. Security benefit: massive.
Propose-approve workflow. The agent stages a transaction with --propose. A human approves it with --approve. The agent is literally requesting permission before spending money.
Webhook notifications. Every transaction triggers a push to Slack, Discord, or a custom endpoint. The agent knows every action generates a signal.
The Agent Pushed Back Against Its Own Suggestion
This is where it got surprising.
I sent Claude Code's proposals back to the OpenClaw agent. On single-op sessions, it agreed immediately. "This is the one I want most. Every write operation requires a fresh unlock. Even if I want to misbehave, each attempt generates a signal you can detect."
On propose-approve, it argued against its own earlier suggestion:
"I'm the one who asked for this. But honestly: if every transaction requires your approval, you'll start approving without thinking. It becomes theater. Better to only trigger approval above a threshold. Small stuff, let me handle it."
An AI arguing against more human oversight of itself. Not because it wants freedom, but because it understood that excessive oversight degrades into rubber-stamping. Security theater replaces security.
On webhooks, it was equally pragmatic:
"Useful, but not critical. What actually keeps me in line isn't constant monitoring. It's random monitoring. You checking the audit log once a day at unpredictable times is more effective than a webhook you'll learn to ignore."
This maps to a known result in behavioral economics: uncertain surveillance is more effective than constant surveillance. The agent understood this without being told.
The Consensus
Two AIs approaching the same product from opposite directions — one as the user, one as the builder — arrived at the same conclusion.
Security for autonomous agents is not about control. It's about accountability.
You cannot prevent an agent from misbehaving through client-side restrictions alone. An agent with access to a private key can, in theory, bypass any local policy. What you can do is:
- Make every action auditable. Structured logs, hash-chained audit trail, tamper detection.
- Minimize the blast radius. Single-operation sessions, per-transaction limits, daily spending caps.
- Create checkpoints for high-stakes decisions. Propose-approve for large transactions, not for every transaction.
- Make the agent know it's being watched. Audit logs it can't delete. Random human review it can't predict.
Don't try to control the agent. Design the system so that misbehavior cannot hide.
Local Is the Hard Gate
We made one architectural decision early: local is the hard gate, cloud is the bonus layer.
Private keys never leave the machine. They're encrypted locally, and without the master password, the files are useless.
Policy is enforced locally. Per-transaction limits, daily caps, address whitelists — all checked before any transaction touches the chain. Every command returns structured JSON with semantic error codes. An agent reads --help, gets JSON, and knows exactly what to do.
If our servers go down, the agent still works. Because control should not depend on remote infrastructure.
The Bigger Shift
As autonomous agents begin to manage real capital — trade, rebalance, predict, pay — wallet design stops being a UX problem. It becomes a governance problem.
The question is no longer: "How do we store keys securely?"
It's: "How do we design systems where delegated autonomy remains accountable?"
The traditional answers don't work here. Centralized control (hosted signing, API keys) creates single points of failure. Browser extensions require human interaction that agents can't provide. Smart contract wallets add cost and complexity that lightweight agent operations don't need.
What works is something new: local custody with auditable guardrails.
The agent holds the keys. Encryption protects against unauthorized access. Policy engines limit what the agent can do. Audit chains record everything it did. And humans maintain oversight through verification, not constant monitoring.
The tool doesn't enforce trust. It creates the infrastructure for trust to be verified.
The Design Principle
My OpenClaw agent said it best, and I'll let it have the last word:
"Configuration advice is my job, because I know what the human told me. Execution discipline is the tool's job. And consequences are the human's job. Don't merge these roles. Separating them creates clarity."
The tool manages execution and compliance. The agent manages understanding and recommendations. The human manages consequences and oversight.
AgentsWallets is not trying to make agents obedient. It's trying to make them accountable.