Challenges

RAG Explained for Founders: How to Give an LLM Your Business Context

By Marc Molas·November 16, 2024·10 min read

You've tried ChatGPT for your business. It's impressive — until you ask it about YOUR customers, YOUR products, YOUR internal processes. Then it starts making things up. Fake names, nonexistent data, policies you never had. It hallucinates because it doesn't have your data.

That's not a bug. It's a fundamental limitation. GPT-4o, Claude 3.5, Llama 3.1 — every LLM has massive general knowledge but zero knowledge of your company. They don't know who your customers are, what your internal documentation says, or what your return policies look like.

The good news: there's a solution that doesn't require retraining any model. It's called RAG. I've watched teams build it well and build it badly, and I think it's the most relevant AI technique for your startup right now.

The problem: a general-purpose LLM doesn't know your business

Think of an LLM as an extremely smart employee who knows a bit of everything but just started at your company. They can code, write, and analyze data — but they have no access to your internal wiki, your CRM, your contracts, or your knowledge base.

If you ask "what's our refund policy for enterprise clients?", they'll make up an answer that sounds plausible. Not because they're dishonest, but because they have no other option. They fill the gaps with probabilities, not facts.

Fine-tuning (retraining the model on your data) sounds like the obvious fix, but it has serious drawbacks: it's expensive, requires specialized expertise, the data goes stale quickly, and you have to repeat the process every time your information changes.

This is where RAG comes in.

What RAG is and why it matters

RAG (Retrieval-Augmented Generation) is a technique that gives the LLM access to your documents at the moment it answers a question. You don't retrain the model. Instead, you feed it the relevant information as context alongside the user's query.

It's like giving that new employee access to the company archives before they answer each question. The model stays the same, but now it has real data to work with.

The concept is simple. The implementation has nuances, but it's not rocket science. Any senior engineering team can build a working RAG system in weeks.

How RAG works: the 3 steps

Step 1: Index your data

Take the documents you want the LLM to be able to query: PDFs, wikis, databases, emails, technical documentation, support tickets. Whatever's relevant.

Each document is split into chunks (fragments). A chunk can be a paragraph, a section, a page — it depends on your use case. Then each chunk is converted into an embedding: a numerical representation (a vector) that captures the semantic meaning of the text.

You don't need to understand the math behind embeddings. What matters is that two texts with similar meaning will have similar embeddings. "Return policy for premium clients" and "refunds for enterprise accounts" will have similar vectors, even though they use different words.

Step 2: Store the embeddings

Those vectors are stored in a vector database. The most common options:

Pinecone: managed, easy to get started, scales well.
Weaviate: open source, highly flexible.
Qdrant: open source, excellent performance.
pgvector: a PostgreSQL extension. If you already use Postgres, you can get started without adding new infrastructure.

For most startups, pgvector is the most pragmatic option. You already have PostgreSQL. Adding an extension is much simpler than managing a new database.

Step 3: Query in real time

When a user asks a question:

The question is converted into an embedding (same process as with the documents).
The most similar chunks are retrieved from the vector database — the fragments of your documents whose meaning is closest to the question.
Those chunks are injected into the LLM's prompt as context: "Based on the following information, answer the user's question: [relevant chunks]".
The LLM generates a response using YOUR data, not its general knowledge.

The result: answers grounded in real information from your company instead of plausible-sounding guesses. Hallucinations drop dramatically — not to zero, as we'll see, but enough to change what's possible.

When it makes sense to implement RAG

RAG isn't a silver bullet, but it's the right solution for many common startup use cases:

Internal knowledge bases: employees querying documentation, processes, and policies.
Customer support chatbots: answers based on your actual documentation, not the model's general knowledge.
Document search: finding relevant information across thousands of legal, technical, or financial documents.
Compliance queries: answering regulatory questions using your compliance documentation.
Sales enablement: giving your sales team precise answers about products, pricing, and competitor comparisons.

The common thread: you have a corpus of documents that your users (internal or external) need to query, and you want a conversational interface that delivers accurate answers.

When RAG is NOT enough

There are limits, and I'd rather you hit them on paper than in production:

When you need the model to learn new patterns. RAG provides context, not learning. If you need the model to classify support tickets according to your company's specific categories, you may need fine-tuning.
When data changes in real time. RAG works well with documents that are updated periodically (daily or weekly). If you need data by the second — stock quotes, real-time inventory — you need a streaming pipeline, not just RAG.
When accuracy must be 100%. RAG dramatically improves accuracy, but it doesn't guarantee 100%. For medical decisions, legally binding rulings, or high-stakes financial calls, you need a human in the loop.

The mistakes almost every first RAG build makes

I've seen these mistakes repeat across nearly every team implementing RAG for the first time:

Chunks that are too large. If each chunk is an entire 20-page document, you're injecting a lot of noise into the prompt. The LLM receives too much irrelevant information and the response quality suffers. Chunks of 200-500 tokens are usually a good starting point.

Chunks that are too small. The opposite extreme: single-sentence fragments lose context. If a chunk says "the limit is 30 days" but doesn't say "for enterprise client refunds," the answer will be incomplete. You need a balance between specificity and context.

No evaluation pipeline. You implement RAG, it "works," and you push it to production. But how do you know it's actually finding the right documents? How do you measure whether the answers are accurate? You need metrics: relevance score, faithfulness, answer correctness. Without evaluation, you're flying blind.

Not filtering by relevance. Sometimes the vector database returns the "most similar" chunks, but none of them are actually relevant. If the similarity score is low, it's better for the system to say "I don't have information on that" rather than forcing an answer with tangentially related data.

Trusting that the LLM will say "I don't know." LLMs are notoriously bad at saying "I don't know." Even with RAG context, if the answer isn't in the documents, many models will try to answer anyway. You need explicit guardrails in your system to handle these cases.

Build vs. buy: the practical decision

You have two paths:

Existing frameworks: LangChain and LlamaIndex are the most popular. They give you pre-built components for chunking, embedding, retrieval, and generation. They speed up initial development but add dependencies and sometimes unnecessary abstraction.

Custom pipeline: you build each component separately. More upfront work, but total control. For simple use cases (a chatbot over your documentation), a custom pipeline with pgvector and the OpenAI or Anthropic API can be simpler than a framework.

My recommendation: start with a simple, custom pipeline. If complexity grows (multiple data sources, advanced re-ranking, hybrid search), then evaluate a framework. Don't start with the most complex solution.

Who should build your RAG system

RAG is conceptually simple, but the implementation details matter a great deal. Chunking strategy, embedding model selection, retrieval tuning, prompt engineering, evaluation — every decision affects the quality of the final result.

You don't need a team of AI researchers. You need senior engineers who have built RAG systems in production and know where the gotchas are. Who have iterated on chunking strategies, tested different embedding models, and built real evaluation pipelines.

At Conectia, the engineers we vet for AI projects have worked on real RAG implementations — not demos or tutorials. They know the difference between a prototype that works in a notebook and a system that works in production with real data, at scale, with users asking questions that weren't in the original plan.

If you're evaluating RAG for your product, the difference between a great result and a poor user experience comes down to the experience of the team building it. A senior engineer with RAG experience can save you months of iteration and mistakes that someone else has already made.

Want to implement RAG in your product but don't have a team with production experience? Talk to a CTO — we connect you with senior engineers who have already built real RAG systems, and we respond within 72 hours.