RAG (Retrieval-Augmented Generation)
RAG improves an LLM's output by retrieving relevant information from an authoritative external source before generation, rather than relying solely on the model's frozen training data. It is the primary defense against hallucination and a cost-effective alternative to fine-tuning for domain or organizational specificity.
The standard pipeline
- Create external data. Convert documents into embeddings via an embedding model and store them in a vector database.
- Retrieve relevant info. Embed the user query and run a similarity search (cosine similarity) against the store.
- Augment the prompt. Insert the retrieved passages alongside the user query.
- Keep it fresh. Re-embed source documents as they change (async or batch).
Why it matters
RAG mitigates four well-known LLM failure modes: presenting false information when it does not know, returning out-of-date or generic answers, drawing on non-authoritative sources, and confusing terminology across domains. It also adds source attribution, which builds trust and lets developers swap sources, restrict by permissions, and troubleshoot retrievals.
RAG vs fine-tuning
These are the two canonical ways to make a foundation model work for a specific domain. They differ in where the domain knowledge lives.
- RAG keeps the model untouched. Knowledge lives in an external vector database; relevant chunks are retrieved at inference time and injected into the prompt. Updates are cheap. Best for facts that change.
- Fine-tuning modifies the model itself by continued training on domain data. Knowledge and behavior get baked into the weights (full fine-tuning, or parameter-efficient variants like LoRA / QLoRA). Updates are expensive. Best for style, tone, format, and behavior.
| RAG | Fine-tuning | |
|---|---|---|
| Model weights | unchanged | modified |
| Knowledge location | external store | inside the model |
| Update cost | cheap (re-embed) | expensive (retrain) |
| Best for | facts, current data, citations | style, tone, format, behavior |
| Cost profile | inference-heavy | training-heavy |
| Risk modes | retrieval misses, context overflow | catastrophic forgetting, overfitting |
In practice they are combined more often than chosen between: fine-tune for style, RAG for facts.
Embeddings: the layer RAG depends on
An embedding is a dense numeric vector (typically 384--4096 dimensions) representing a piece of text, image, or audio, chosen so that semantically similar inputs end up close together in vector space. The output quality of a whole RAG system is bounded by the embedding model's ability to put related text near each other.
- Dense and semantic. Unlike sparse keyword vectors, embeddings capture meaning -- the classic
example is
king - man + woman ~= queen. - Context-dependent. Modern transformer embeddings give "bank" a different vector in "river bank" versus "financial bank".
- A working default for English RAG (2026): OpenAI
text-embedding-3-small(cheap, well-behaved) orbge-small-en-v1.5(self-hostable). Use the MTEB leaderboard as a starting filter, not as truth; build a small eval set from real queries and measure retrieval recall@k yourself.
Vector databases
A vector database stores, indexes, and efficiently searches high-dimensional embeddings. Where traditional databases excel at exact matches, vector DBs excel at similarity searches -- "give me the rows whose vector is closest to this one". The key efficiency primitive is approximate nearest neighbor (ANN) search, which checks a carefully selected subset of candidates instead of all vectors, trading a small amount of accuracy for a large speedup. Common options: Pinecone (hosted), pgvector (Postgres extension), OpenSearch, Weaviate, Milvus, and Chroma (lightweight, good for prototypes). See Tooling for how these fit the broader stack.
Production levers (in order of ROI)
- Add hybrid retrieval first. Dense embeddings miss exact strings ("error code ABC-1234"); keyword search misses paraphrases. Combine dense + sparse (BM25) with rank fusion. This is the single highest-ROI fix for weak RAG.
- Rerank the top results. Run a cross-encoder reranker (e.g. Cohere
rerank-3,bge-reranker-large) over the top ~50 candidates. Often a bigger win than swapping the embedding model. - Mind chunking. Match chunk size to the embedding model's natural window; wildly larger or smaller chunks degrade quality.
- Respect query/document asymmetry. Many models need different prefixes or input-types for queries versus documents. Forgetting this halves recall.
- Quantize once it works. Store vectors as
int8orhalfvecfor 4x+ storage reduction with a small recall hit. See quantization. - Plan for re-embedding. Embeddings drift as content grows, and switching embedding models requires re-embedding the whole corpus. Treat the embedding model as a versioned artifact.
The trend: just-in-time retrieval in agents
Pure pre-inference embedding retrieval is giving way to hybrid approaches in agent design: agents keep lightweight references and load data on demand via tools. RAG is not going away, but "some data up front, exploration at runtime" is becoming the default.
See also
- Large Language Models -- the model RAG grounds
- AI Agents -- retrieval as a tool; MCP vs RAG
- Tooling and Frameworks -- LangChain / LlamaIndex, vector DBs, evaluation
- Cloud vs Local Models -- managed RAG (Bedrock Knowledge Bases) vs local RAG
- AI Glossary -- embedding, vector database, reranking, semantic search, and more