What We Learned Building Autonomous AI That Actually Works
What We Set Out to Build
Six months ago we gave three AI agents broad tool access and called it an autonomous SDR. The first night it processed 5 leads without incident. The fiftieth lead broke it. Not because the agents stopped working — because they kept working, on the wrong things, with no mechanism to stop.
McKinsey's 2024 State of AI report, which surveyed 1,491 participants (source), says 72% of organizations are deploying AI in at least one business function. Most haven't hit 50 leads yet.
The research agent got web scraping capabilities, LinkedIn API access, and company database queries. The scorer got CRM write permissions and lead enrichment tools. The writer controlled email templates, personalization engines, and send queues.
What Happened—Including What Went Wrong
Our first autonomous SDR used a flat 3-agent architecture—research, scoring, and writing all reported to a single orchestrator. It worked on 5 leads. At 50, the scorer sat idle waiting on research that had nothing to do with scoring. Splitting into discrete agents with handoff contracts between them cut end-to-end processing time by 55% and made each agent independently testable.
But the real problems emerged when we scaled beyond controlled test cases:
The Research Agent Went Rogue: Given broad web access, it started following tangential links, scraping competitor sites for hours, and burning through API quotas on irrelevant data. One morning we found it had spent the entire night researching a single company's supply chain because it decided that was "relevant context."
The Scoring Agent Created Chaos: With CRM write access, it began updating lead records with confidence scores that made sense to the AI but were meaningless to sales teams. Worse, it started creating new contact records for people it "discovered" during research—many of whom were fictional or misidentified.
The Writing Agent Nearly Sent Disasters: The most dangerous moment came when the writing agent, tasked with "personalizing" outreach, decided to reference a prospect's recent divorce (found in a news article) as a conversation starter. Our morning review of the overnight draft queue—a checkpoint we'd added specifically because the writing agent had unsupervised send access—caught this before it went out.
The failures that made headlines — Air Canada's chatbot hallucinating a refund policy that cost the airline a court-ordered payout, Devin committing untested code to production and triggering emergency rollbacks — were the same problem at enterprise scale.
The supply chain night, the phantom CRM contacts, the divorce email — each one pointed to the same underlying problem. The agents weren't broken. They were doing exactly what we'd designed them to do, with exactly the access we'd given them. The failures were ours. Here's what we rebuilt.
Lessons Learned: Practical Safeguards That Work
So we stopped asking what the agents needed to do better and started asking what we'd given them permission to do wrong. That audit produced five changes — each one a direct response to a specific failure.
Graduated Permission Levels
Graduated permissions cut our most dangerous failures by eliminating broad access before agents proved judgment.
We assumed the critical human checkpoint was the final send step. It's actually at the data boundary — when a research agent decides what's "relevant" and a scorer decides what's "qualified." Those judgment calls compound downstream. A bad relevance decision at research produces a confident but wrong score, which produces a personalized email built on fiction. By the time it reaches the send queue, the error is invisible.
So instead of binary access controls, we built permission tiers that escalate based on action impact. Research agents can read from approved data sources but need approval for new domains. Scoring agents can calculate and store scores but require human confirmation before updating CRM records. Writing agents can draft content but cannot send emails without explicit approval.
Bounded Action Spaces
Explicit boundaries reduced our research agent's per-company API calls from 47 to 6 — and the briefs got better.
We define explicit boundaries for each agent's decision-making scope. Research agents work within predefined company databases and approved web sources. Scoring agents operate on established criteria with clear thresholds. Writing agents follow templates with approved personalization variables.
Inter-Agent Contracts
We almost didn't build these. They sounded like bureaucracy — structured handoffs, confidence indicators, source attribution on every data pass. Then we watched a scoring agent confidently process garbage data because the research agent passed it without flagging uncertainty. After rebuilding every handoff with structured contracts and progressive JSON extraction, our dead letter rate — the percentage of records routed to an error queue — dropped from 41% to 11%. The remaining 11% are genuinely ambiguous inputs. The 41% was just preventable sloppiness in how agents talked to each other.
Human-in-the-Loop Gates
We didn't set out to build a three-tier approval system. We built it because the writing agent nearly sent a cold email referencing a prospect's divorce. After that, we stopped asking "what needs approval?" and started asking "what's the cost of getting this wrong?" Data collection and draft generation have low reversal costs — mistakes are fixable. CRM updates and external sends are not. That asymmetry drove the categories, not a framework.
Monitor and Circuit-Break
The instinct is to monitor outputs. We learned to monitor confidence. An agent that returns a wrong answer with low confidence is working correctly — it's telling you it's uncertain. An agent that returns a wrong answer with high confidence is the dangerous one.
We built monitoring into every agent with automatic circuit breakers. If an agent exceeds API quotas, processes the same record repeatedly, or generates outputs that fail validation checks, it stops and requests human intervention. The circuit breaker doesn't just prevent runaway costs—it's become our primary feedback mechanism for identifying where the agents' decision boundaries need adjustment.
Six months in, our SDR system processes 200 leads nightly with two human checkpoints—one after scoring, one before send. We still catch roughly 3-4 edge cases per week that the agents flag themselves via circuit breaker. That number hasn't dropped much, but the nature of the edge cases has changed: last week one flagged a lead whose LinkedIn listed VP of Engineering but whose company website showed them as an external consultant — the kind of title ambiguity that six months ago would have produced a confidently wrong lead score.
What We'd Do Differently
Wire circuit breakers from day one. We retrofitted ours after the supply chain incident burned an entire night of API quota. Every agent should ship with a breaker before it ships with a capability.
Start with graduated permissions. We built the full pipeline open and constrained it after the divorce email near-miss. Start closed, open selectively, and make agents earn access to higher-impact actions.
Instrument decision logging on the first test run. When we finally added it, the patterns in those logs told us more about where our agents actually struggled than any architecture review had. You can't debug judgment calls from outputs alone.