methodologyApr 20, 2026·7 min read

Building AI Agents: Scratch vs. Agent-Service-Toolkit

The gap between AI experimentation and production is still wide. McKinsey's State of AI 2024 report found that while 72% of companies have adopted AI in at least one function, far fewer have systems that reliably deliver business value in production.

That gap isn't about models anymore. It's about infrastructure - how agents manage state, communicate across steps, and actually run as services.

The question I keep getting from engineers in Python and FastAPI communities is not whether to build with LLMs - that decision is already made. The question is how to structure the surrounding infrastructure so the thing doesn't collapse the first time a real user hits it. That's the comparison I want to work through here: building your own orchestration layer from scratch versus adopting an opinionated toolkit like Agent-Service-Toolkit, which bundles LangGraph, FastAPI, and Streamlit into a single deployable scaffold.

Why This Decision Matters More Than Model Choice

Most tutorials stop at the LLM call. They show you how to send a message and parse a response, then leave you to figure out how to wrap that in an API, persist conversation state, handle retries, and give a non-engineer a way to test it. That gap is where projects stall.

We learned this the hard way building an autonomous SDR.

Our first version looked clean on paper: research → enrichment → scoring → outreach. Each step passed context implicitly to the next. It worked fine at small scale.

At 50 leads, it broke.

Scoring was blocked waiting on research completions it didn't actually depend on. We had effectively serialized a system that could have run in parallel - and we didn't notice until latency made it obvious.

Fixing it meant introducing explicit inter-node contracts. Each step declared exactly what it needed and what it produced.

That change reduced end-to-end processing time by ~38% at 50 leads - not because the model got faster, but because we removed unnecessary dependencies between steps.

Every pipeline we've built since uses explicit schemas for exactly that reason.

If you're curious how we think about that kind of quality standard across our builds, that's documented in our BQS methodology.

Approach A: Building the Stack Yourself

Rolling your own infrastructure gives you complete control over every dependency. You choose the orchestration library, the API framework, the state store, and the frontend. For teams with strong opinions about their existing stack - say, a FastAPI service already running in Kubernetes with a Redis cache - this is often the right call.

The cost is real, though. You write the boilerplate for session management, tool registration, streaming responses, and error handling. You wire up the OpenAPI docs manually or accept that they won't exist. You build a test UI or tell your product manager to wait. Across the last six custom agent systems we've scoped, the median time to first stakeholder interaction was 16 days. That time isn't wasted - it's where most of the real decisions get made. Custom infrastructure is more adaptable long-term but slower to validate.

There's also a maintenance surface to consider. Every custom abstraction you write is one more thing your team owns. When LangGraph ships a breaking change, you absorb it directly. When FastAPI updates its dependency injection model, you update your wrappers. This is manageable for a dedicated platform team. For a two-person ML team at a startup, it's a meaningful ongoing tax.

Approach B: Agent-Service-Toolkit as Opinionated Scaffold

Agent-Service-Toolkit takes a different position: make the infrastructure decisions for you so you can focus on the logic that's actually specific to your use case. The toolkit ships with LangGraph handling orchestration and memory, FastAPI providing the HTTP layer with automatic OpenAPI documentation and async support, and Streamlit giving you a working test interface without writing a single line of React.

The practical effect is that you scaffold a working service - one where a non-technical stakeholder can open a browser tab and interact with your system - in hours rather than days. That speed matters most during the validation phase, when you're still figuring out whether the core behavior is correct before investing in custom UI.

The limitation is the flip side of the benefit: the toolkit is opinionated. If your organization already runs a different API framework, or if you need a state persistence layer that isn't what LangGraph provides out of the box, you're either adapting the toolkit or fighting it. I've watched teams spend more time customizing an opinionated scaffold than they would have spent building the relevant pieces themselves. The toolkit is most useful when you don't yet have those infrastructure instincts - or don't want to invest in them upfront. Teams using it can usually move from concept to a testable service in under a week, because key decisions - state structure, tool registration, execution model - are already made.

The Infrastructure Decisions That Actually Differ

Three specific decisions separate these approaches in practice.

State management. LangGraph's built-in checkpointing handles conversation memory and node state persistence. Building this yourself means choosing between Redis, a relational database, or an in-memory store - and writing the serialization logic for each. The toolkit makes this decision for you. If that decision fits your constraints, you save the work. If it doesn't, you're overriding a core assumption.

Inter-node contracts. This is where I see the most failures in custom builds. When nodes pass data implicitly - through shared mutable state or loosely typed dictionaries - debugging becomes archaeology. Agent-Service-Toolkit's LangGraph integration encourages typed state schemas between nodes. You can ignore this and pass raw dicts, but the scaffold nudges you toward explicit contracts. That nudge has real value, especially for teams newer to multi-node orchestration.

Deployment surface. A FastAPI service with auto-generated OpenAPI docs is immediately consumable by other services, by Postman, by your QA team. A custom Flask wrapper with no docs is not. The toolkit gives you the former by default. Building it yourself means you either invest in documentation tooling or accept that integration will be harder than it needs to be. For teams building automation pipelines that other systems will call - the kind of work we cover in our analysis of manual vs. automated operations - this matters more than it might seem upfront.

The Case for Custom Infrastructure

Choose the custom path when your organization has an existing service architecture that the toolkit would conflict with. If you're running a Go-based microservices platform and Python is a second-class citizen, forcing a Python-first scaffold into that environment creates more problems than it solves. Similarly, if your state management requirements are unusual - multi-tenant isolation, compliance-driven audit logging, or a specific vector store integration - you'll spend more time bending the toolkit than building the relevant pieces directly.

Custom builds also make sense when the team has the bandwidth to own the infrastructure long-term. A dedicated platform engineering team that will maintain the orchestration layer across multiple products gets more value from a custom foundation they fully understand than from a third-party scaffold they partially understand.

What the Toolkit Actually Solves

Use Agent-Service-Toolkit when speed to a testable, shareable prototype is the primary constraint. If you need a product manager or a client to interact with the system within a week, the Streamlit frontend alone justifies the toolkit. You're not building a demo - you're building a real service with a real API - but you get the demo surface for free.

It also wins for teams building their first production LLM service. The opinionated structure teaches good patterns: typed state, explicit tool registration, async request handling. Engineers who build with the toolkit first tend to make better infrastructure decisions when they eventually do build custom systems, because they've seen what a well-structured service looks like. The adoption gap in many organizations isn't a model problem - it's an infrastructure literacy problem, and the toolkit's opinionated structure is one of the few things that addresses it directly.

What We'd Do Differently

Start with explicit node schemas before writing any logic. Whether you use the toolkit or build from scratch, define the data contract between each processing step before you write the first function. We skipped this on an early build and spent two days debugging a failure that turned out to be a missing key in a dict passed between nodes. Typed schemas would have caught it at definition time.

Don't use the Streamlit frontend as your long-term UI. It's excellent for validation and internal testing. It is not a customer-facing interface. What started as a demo surface becomes a dependency - and rewriting it later is more expensive than building it properly upfront. Use it for what it's designed for - rapid internal feedback - then build the real interface when the behavior is validated.

Plan for the toolkit's update cycle before you commit. Agent-Service-Toolkit is a third-party dependency. LangGraph itself is under active development as of 2026. Before adopting any opinionated scaffold, check the project's release cadence and breaking change history. A scaffold that hasn't been updated in six months creates subtle breakage: dependency drift, incompatible abstractions, and unclear upgrade paths.

The choice isn't really between "scratch" and "toolkit." It's between when you want to pay the infrastructure cost - and how much control you need when you do. Most teams benefit from seeing a system run before they design one from first principles. The mistake is staying in that abstraction layer longer than necessary.

High-Converting vs High-Traffic: Why Clarity Beats Volume