methodologyMay 6, 2026·7 min read

Building a Cold Email System With an LLM API

By Jonathan Stocco, Founder

The Problem With Cold Email at Volume

As of 2026, 72% of organizations now use AI in at least one business function, up from 50% in previous years, according to McKinsey's State of AI 2024 report. Cold outreach is one of the first places that adoption shows up - and also one of the first places it breaks. The failure mode isn't sending too few emails. It's sending 500 generic ones that read like they were written by a mail merge from 2009.

The real problem is the tradeoff between volume and specificity. Manual personalization doesn't scale past a few dozen contacts per day without a dedicated SDR. Traditional sequencing tools like Outreach or Salesloft handle delivery mechanics well, but they don't write the emails - you still need a human to craft copy for each segment, and that copy goes stale fast. What I wanted was a system that could research a contact, write a contextually relevant opening line, sequence follow-ups based on reply behavior, and do all of it without me touching a keyboard after setup.

That's a multi-step reasoning problem, not a template problem. And it's exactly the kind of task a well-structured LLM pipeline handles better than any static sequence builder.

Architecture: How the Pipeline Actually Works

The system I settled on has four discrete stages: list enrichment, message generation, send scheduling, and reply classification. Each stage is its own process with a defined input schema and output contract. Nothing shares state implicitly.

I learned this the hard way. Our first Autonomous SDR used a flat three-component setup - research, scoring, and writing all reported to a single orchestrator. It worked on 5 leads. At 50, the scorer sat idle waiting on research that had nothing to do with scoring. Splitting into discrete components with explicit handoff contracts between them cut end-to-end processing time and made each piece independently testable. That's why every pipeline I build now uses explicit inter-component schemas - implicit data passing doesn't hold up past toy-scale.

The enrichment stage pulls from a contact list (CSV or CRM export), then hits a data provider API to fill in company size, recent funding, tech stack signals, and LinkedIn headline. That enriched record becomes the input to the generation stage. The LLM receives a structured prompt containing the contact's role, company context, and a target value proposition. It returns a subject line, an opening line, and a call-to-action - nothing else. Constraining the output format matters here: if you ask a reasoning model to "write a cold email," you get a cold email. If you ask it to return a JSON object with three fields, you get something you can actually slot into a template and audit.

The scheduling stage handles send timing and follow-up logic. I use n8n for this layer - it's well-suited to time-based branching and webhook handling, and the visual graph makes it easy to trace what fires when. A contact who opens but doesn't reply gets a different follow-up than one who never opens. A contact who replies gets routed to the classification stage, where a smaller, faster model reads the reply and tags it: interested, not now, wrong person, or unsubscribe. That tag determines what happens next in the sequence.

Implementation Considerations

The generation stage is where most people over-engineer things. You don't need a fine-tuned model or a complex chain-of-thought prompt to write a decent cold email opening. A well-scoped system prompt with three or four concrete examples of good and bad openings outperforms elaborate prompting in my testing. Keep the context window focused: the model doesn't need the contact's full LinkedIn history, it needs their current title, their company's recent activity, and your value prop. More context doesn't always mean better output - it often means slower, noisier output.

Rate limits and deliverability are the two constraints that will actually bite you. Most LLM APIs throttle at the request level, so if you're processing 500 contacts in a single batch, you need a queue with retry logic, not a for-loop. On the deliverability side, sending volume matters less than domain reputation. I'd rather send 50 well-timed emails from a warmed domain than 500 from a cold one. If you're building this for the first time, start with a subdomain, warm it for 30 days, and cap daily sends at 100 until you have reply data to calibrate against. This is where the approach breaks down for teams that want immediate volume - the infrastructure warmup period is real and can't be skipped.

There's also a cost consideration that doesn't get discussed enough. Running a reasoning model on 500 contacts per day adds up. For the generation stage alone, you're looking at non-trivial API spend if you're passing rich context per contact. A classification model for reply tagging is far cheaper per call - use the smallest model that gets the job done for that task, and reserve the heavier reasoning for the generation step where quality actually matters. If your budget is tight, this system may cost more than a junior SDR at high volume. That's an honest tradeoff worth calculating before you build.

For teams already running n8n workflows, this pipeline slots in naturally. The enrichment and scheduling layers map directly to HTTP Request nodes and Wait/Switch branching. If you're earlier in your automation journey, our breakdown of building AI pipelines from scratch versus using a service toolkit covers the infrastructure decision in more depth.

What We'd Do Differently

Build the reply classifier before the sender. I built the send logic first and the classification logic last. That was backwards. The reply classifier determines whether your follow-up sequences make sense - if it's miscategorizing "not now" replies as "interested," your sequence logic fires wrong and you burn contacts. Build and validate classification on a sample of real replies before you wire up the full send pipeline.

Version your prompt templates like code. The generation prompt will change. Your value prop shifts, your ICP tightens, a competitor releases something that changes how you position. If your prompts live in a database field or a hardcoded string, you have no history and no rollback. Store them in version control with the same discipline you'd apply to any other configuration file. When output quality drops - and it will - you want to know exactly what changed.

Don't skip the human review gate on the first 50 sends. Even a well-designed pipeline produces occasional outputs that are technically correct but contextually wrong - a company name that's been acquired, a role that no longer exists, a value prop that lands badly for a specific vertical. The first 50 sends should go through a spot-check queue before full automation kicks in. After that, you'll have enough signal to know what edge cases your prompts don't handle, and you can add guardrails rather than reviewing everything manually. The goal is supervised automation, not blind automation.

Related Articles