What is Retrieval-Augmented Generation (RAG)?

RAG is a technique that connects a language model to a document database. When a user asks a question, relevant documents are retrieved first (usually via semantic similarity search), then passed to the model along with the question. The model generates an answer using those documents as context. This solves the training cutoff problem and allows models to reason over private data without retraining.

When should you use fine-tuning instead of RAG?

Fine-tuning makes sense when you have a specific style or format the model needs to match consistently, specialised terminology the base model handles poorly, or enough labelled training examples. It doesn't make sense for making a model 'know' your documents - fine-tuning teaches style, not content. For content, use RAG.

What are the main ways to deploy RAG?

You can build your own RAG pipeline using open-source components (full control, full responsibility), buy a commercial RAG platform (managed infrastructure, less control), or use an API with file upload features from an LLM provider (easiest to start, can get expensive). For most businesses: use an API or buy a platform first. Build custom only when you understand your specific requirements.

What is the most common mistake businesses make with RAG?

Trying to use fine-tuning instead of retrieval to make a model 'know' company data. Fine-tuning 200 documents doesn't make a model remember what's in them - it teaches the model the aggregate style of those documents. The second most common mistake is building a complex custom system before validating that a simpler approach works.

RAG, Fine-Tuning and Enterprise AI Strategy

A language model knows stuff about the world up until its training cutoff. After that, it's making things up. Ask it about private company data and it's hallucinating. Ask it about a document it never saw, and it doesn't have the information.

John Bowman AI Strategist & Developer

Unit 7 · Generative AI & Practical Applications 5 April 2026 11 min read

menu_book In this lesson expand_more

The problem RAG is solving
How RAG works: retrieve then generate
What fine-tuning is and when it's actually needed
Why most enterprises should try RAG first
The build, buy, or use-an-API decision
Where most businesses get it wrong

Listen to this lesson

0:00

Enterprises need their AI systems to work with current, private data. That means feeding context to the model beyond what it was trained on. You have two main strategies: retrieval-augmented generation and fine-tuning. They solve different problems.

The problem RAG is solving

A language model is like someone who read a lot in school and then never read anything new. Ask it about something after its training cutoff and it'll confidently make something up. Ask it about an internal document it never saw, and it either hasn't seen it or confabulates details.

That's hallucination. It's not malice or stupidity - the model does what it was trained to do, predict plausible text. But plausible isn't the same as true.

RAG solves this by giving the model access to relevant context before it answers. You ask a question, RAG finds relevant documents or database entries, and feeds those to the model along with your question. Now the model has actual facts to work with. Hugging Face's training documentation is the standard reference for implementing this in practice.

How RAG works: retrieve then generate

The process has two steps. First, retrieve. You take the user's query and search a database of documents to find relevant ones. Usually by embedding the query and finding documents with similar embeddings - semantic similarity, not keyword matching.

Then generate. You take the retrieved documents and the original query and feed them both to a language model. The model generates an answer with those documents as context. If the documents contain the answer, the model can usually find it.

This works well. The model doesn't have to know the answer - it has to be able to read and synthesise information from documents. That's something language models do well.

The limitations are real. If retrieval gets the wrong documents, generation works from bad context. If your documents are poorly structured or low quality, retrieval won't help. If the answer requires combining information across many documents in complex ways, even a good model might struggle.

What fine-tuning is and when it's actually needed

Fine-tuning means taking a pre-trained model and training it further on your specific data. You're adapting the model's weights to your domain. The generative AI models lesson covers what models are available to fine-tune.

It's expensive. You need lots of labelled examples - usually hundreds or thousands of good training pairs. You need compute. It takes time. And you end up with a model that's different from the base model, which your tools and infrastructure need to handle.

Fine-tuning makes sense when you have a very specific style or format the model needs to match, specialised terminology the base model handles poorly, consistent patterns your data follows that a model can learn, and enough labelled examples to train on.

What doesn't make sense: fine-tuning to make a model "know" your private data. A model can't memorise its way to better knowledge tasks. Fine-tuning helps with style, format, and specialised language. It doesn't turn a model into a domain expert.

That's where the confusion happens. People think: "I'll fine-tune on our documents so it knows our data." But models don't work that way. Fine-tuning with documents mostly teaches the model the style of your documents, not their content. For content, you need retrieval.

Why most enterprises should try RAG first

RAG should be your first move. Every time. The prompt engineering lesson is worth reading alongside this - prompts shape what RAG retrieves.

RAG is cheaper. It's faster to set up. You don't need to train anything. You don't need labelled data. You can iterate and improve retrieval without retraining. If your documents are bad, you fix the documents instead of retraining a model.

RAG scales. Add new documents and the system immediately has access to them. No retraining, no waiting.

The only real downside is latency. Retrieval takes time. If you need sub-100ms response times, RAG is harder. For most enterprise applications, that's not a constraint.

Should you eventually fine-tune on top of RAG? Maybe. If you have specific style requirements or a lot of domain-specific terminology, fine-tuning a model to use within your RAG system could help. But that's a second step. Start with RAG. Get it working. Then instrument to see where it's failing. Then optimise based on real data about where it breaks.

The build, buy, or use-an-API decision

You can build your own RAG system using open-source components. You can buy a commercial RAG platform. You can use API-based RAG from an LLM provider. The AI tools and platforms lesson covers what's available from major providers.

Building means you own everything and can customise everything. It also means you're responsible for everything - security, availability, monitoring. You need ML and engineering expertise in house.

Buying a platform means someone else handles the infrastructure. But you're paying per query or per month, and you're locked into their choices about retrieval algorithms and embedding models.

Using an API (like file upload features from OpenAI or Anthropic) is easy to start with but can get expensive and gives you less control.

For most businesses: buy or use an API first. Once you understand your requirements better, then consider building something custom. Don't build the infrastructure before you understand the problem. Once you do build, you'll need proper deployment infrastructure to serve it.

Where most businesses get it wrong

Mistake one: trying to do everything with fine-tuning instead of retrieval. People have 200 documents and think the solution is to fine-tune a model on them. That won't work well. Fine-tuning teaches the model about those documents in aggregate, not what's in them.

Mistake two: building before understanding the problem. Teams build elaborate RAG systems with custom retrieval logic and complex architectures before they've validated that RAG is even the right solution. Start simple. Retrieve with vector search, generate with an API, iterate from there.

Mistake three: assuming quality data. Your RAG system is only as good as your documents. If they're outdated, poorly structured, or low quality, RAG won't fix that. Audit your data first.

Most businesses build too much, too early. Get a RAG system working with open-source tools and APIs. See where it breaks. Then optimise. Then maybe build custom pieces. The default should be "use what exists" not "build something custom." The same logic applies in the build vs buy lesson.

Check your understanding

2 questions - select an answer then check it

Question 1 of 2

What problem does RAG solve that fine-tuning doesn't?

Question 2 of 2

Why is fine-tuning on company documents usually the wrong first move?

Deep Dive Podcast

Podcast Version

Created with Google NotebookLM · AI-generated audio overview

0:00 0:00