methodologyMay 18, 2026·8 min read

Build a Local AI Assistant That Works Offline

What We Set Out to Build

In early 2026, we decided to run an experiment: build a fully local AI assistant, no cloud API keys, no data leaving the machine, no monthly bill from Anthropic or OpenAI. The goal was a system that could answer questions against a private knowledge base, execute simple research tasks, and slot into automation pipelines, all running on a mid-range developer laptop. According to McKinsey's The State of AI in 2024 (source), organizations are increasingly exploring on-premises and open-source AI solutions to reduce dependency on cloud APIs and improve data privacy and control. We wanted to see whether that shift was actually practical for a small team, or whether it was aspirational conference-talk.

The stack we chose: Ollama for local LLM serving, ChromaDB as the vector store, a Python-based retrieval-augmented generation pipeline, and n8n as the orchestration layer to wire the agent into real workflows. Nothing exotic. All open-source. All free to run.

Here is what actually happened.

What Happened, Including What Went Wrong

Getting Ollama running took under ten minutes. You pull a reasoning LLM the same way you pull a Docker image:

ollama pull llama3
ollama serve

That gives you a local inference server at http://localhost:11434 with an OpenAI-compatible API surface. Any tool that speaks to OpenAI's chat completions endpoint can point at this instead, with one URL change and no key required.

The RAG pipeline took longer. We chunked a 200-page internal knowledge base into 512-token segments, embedded them using nomic-embed-text (also served via Ollama), and stored the vectors in ChromaDB. A retrieval query looks like this in practice:

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.Client()
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embedding_functions.OllamaEmbeddingFunction(
        url="http://localhost:11434/api/embeddings",
        model_name="nomic-embed-text"
    )
)

results = collection.query(
    query_texts=["What is our refund policy?"],
    n_results=5
)

The retrieved chunks feed directly into the LLM's context window. No external call. No data leaving the machine.

Where things broke: memory. A 7-billion-parameter reasoning LLM running on CPU is slow. On a MacBook Pro M2 with 16GB unified memory, response latency for a complex query sat around 8-12 seconds. Acceptable for a developer tool. Not acceptable for anything customer-facing. We tried quantized 4-bit versions of the same LLM, which cut latency roughly in half, but introduced noticeable quality degradation on multi-step reasoning tasks. There is no free lunch here. You are trading inference speed and quality against hardware cost and privacy.

The agent layer exposed a second problem. We wired the local LLM into an n8n workflow using the HTTP Request node pointed at localhost:11434. Tool-calling, the mechanism that lets an LLM decide to run a function rather than just generate text, is inconsistently supported across open-source LLMs as of mid-2026. Some handle it well. Others hallucinate tool invocations in formats the parser cannot read. We spent two days debugging a pipeline that was silently failing because the LLM was generating tool calls in a slightly wrong JSON structure, and our error handling was not strict enough to catch it.

That debugging experience connects directly to something I learned building multi-provider pipelines earlier. When we originally designed an SDR automation, we used three separate API providers: one for research, one for scoring, one for writing. The per-lead cost worked out to $0.016 cheaper than a single-provider setup. We scrapped it anyway. Three API keys, three billing accounts, three status pages, three sets of rate limits. The operational friction was not worth sixteen-tenths of a cent. Every pipeline we build now runs on a single provider's LLM lineup. Simpler setup, one credential to manage, one bill to track. The local stack taught us the same lesson from the opposite direction: when you eliminate cloud providers entirely, you inherit a different kind of complexity. You own the hardware, the model updates, the inference reliability, and the debugging. That is not free. It is just a different cost structure.

If you want to go deeper on the security surface area that opens up when you run agents with tool access, our post on AI agent permission blind spots covers the specific failure modes we have seen.

Lessons Learned

Three things we would tell anyone starting this build today.

Start with a quantized model and a GPU, or accept the latency tradeoff explicitly. The difference between CPU inference and Apple Silicon GPU inference on the same LLM is not marginal. It changes whether the tool is usable in practice. If you are on x86 hardware without a discrete GPU, a 3-billion-parameter LLM will outperform a 7-billion-parameter one on user experience, even if the larger LLM produces better answers, because the user will not wait 15 seconds for a response. Size the LLM to your hardware, not to your ambitions.

Validate tool-call output schemas before you trust them. Open-source LLMs vary significantly in how reliably they emit structured tool invocations. Build a validation step into every agent node. If the LLM output does not match the expected schema, route it to a retry with an explicit correction prompt rather than letting the pipeline fail silently. We added a JSON schema validation step in n8n between the LLM response node and the tool execution node. That single addition eliminated the class of silent failures we spent two days chasing.

The RAG pipeline itself is where local AI earns its keep. Querying a private knowledge base, a legal document archive, an internal wiki, a customer support history, without sending any of that data to an external API, is a genuinely different capability from what cloud AI offers. The latency is higher. The quality ceiling is lower than the best cloud LLMs. But the data never leaves your infrastructure, and that matters for regulated industries, for competitive data, and for any organization that has learned to read the terms of service on cloud AI providers.

One tradeoff worth naming directly: this approach does not work well for tasks that require up-to-date world knowledge. A local LLM's training data has a cutoff. It cannot browse the web unless you build that tool explicitly. For research tasks that need current information, you either build a local web-scraping tool into the agent, accept stale answers, or route those specific queries to a cloud API. A hybrid setup, local LLM for private data queries, cloud LLM for current-events queries, is architecturally sound but reintroduces the multi-provider complexity we described above. Know which problem you are actually solving before you commit to a fully offline setup.

The n8n integration piece is where this becomes more than a developer toy. Once the local inference server is running, it looks identical to a cloud API from the orchestration layer's perspective. Any automation pipeline that calls an LLM can be rerouted to the local endpoint. That means existing workflow blueprints, document processors, classification pipelines, summarization chains, can run entirely on-premises with a single configuration change. For teams already running n8n self-hosted, this is a natural extension. For teams evaluating whether to self-host their automation infrastructure, it is worth reading about what we learned building AI agents quickly in 2026 before committing to the architecture.

What We'd Do Differently

We would instrument the inference server from day one. Latency per query, token throughput, memory pressure: none of this is visible by default with Ollama. We added logging after the fact and discovered that two specific query patterns were consistently hitting the context window limit and silently truncating. Earlier instrumentation would have caught this in the first week instead of the fourth.

We would separate the embedding LLM from the reasoning LLM on different memory budgets. Running both on the same machine with a shared memory pool caused contention during peak load. Embedding is cheap and fast; it should run on a small, always-loaded process. The reasoning LLM should load on demand. We treated them as equivalent processes initially, which was wrong.

We would build the hybrid routing layer before we needed it. The fully offline setup works until it does not. A query that requires current information, a task that exceeds local hardware capacity, a use case where quality matters more than privacy: these will appear. Building a routing layer that can send specific query types to a cloud endpoint, while keeping sensitive queries local, is not complex to build, but it is much easier to design before the system is in use than to retrofit afterward. We built ours under pressure. It shows.

Build a Local AI Assistant That Works Offline

What We Set Out to Build

What Happened, Including What Went Wrong

Lessons Learned

What We'd Do Differently

Related

Related Articles