Retrieval-Augmented Generation (RAG)

Quick answer

Retrieval-augmented generation (RAG) is an architecture in which a language model's answer is grounded in documents retrieved from an external knowledge base at query time, rather than relying solely on what the model memorized during training. RAG is the dominant pattern for enterprise LLM applications because it keeps knowledge updateable, grounds answers in auditable sources, and serves many domains by swapping the index.

WHAT IT IS

A RAG pipeline has five stages: document ingestion (chunking, cleaning), embedding (converting chunks to vectors with OpenAI, Cohere, Voyage, or open models), storage in a vector database (Pinecone, Weaviate, Qdrant, pgvector, Elastic), retrieval (semantic search, hybrid with BM25, reranking with Cohere Rerank or BGE), and generation (LLM answers conditioned on retrieved context with explicit citations).

HOW IT WORKS

Production-grade RAG layers in metadata filtering, access control (user-level permissions on retrieved docs), evaluation (RAGAS, hit rate, faithfulness), and prompt design that forces citation. The pattern was formalized by Lewis et al. at Meta AI in 2020 and has become the default architecture for enterprise knowledge assistants.

WHEN TO USE

Use RAG when answers must reflect proprietary, recent, or verifiable content — internal knowledge assistants, customer support, policy Q&A, research synthesis. Don't use it when the model's training data is already sufficient and current.

SOURCES

Related questions.

What is retrieval-augmented generation?

Retrieval-augmented generation (RAG) is an architecture in which a language model's answer is grounded in documents retrieved from an external knowledge base at query time, rather than relying solely on what the model memorized during training. RAG is the dominant pattern for enterprise LLM applications.

Why use RAG instead of fine-tuning?

RAG keeps the knowledge base updateable without retraining, grounds answers in verifiable sources for auditability, and lets the same model serve many domains by swapping the retrieval index. Fine-tuning still matters for tone, format, or specialized skills — but grounding is almost always the first problem to solve.

What does a production RAG system require?

Document ingestion and chunking, embedding generation, a vector store (Pinecone, Weaviate, Qdrant, pgvector), a retrieval layer with semantic and often keyword search, a reranker, prompt templating, an evaluation set, and monitoring for retrieval quality and hallucination rate.

What are the common failure modes?

Retrieval misses relevant documents (bad chunking, poor embeddings), retrieves irrelevant ones that pollute the context, model ignores the context and hallucinates anyway, or answers from context that itself is outdated. Evaluation must cover retrieval quality separately from generation quality.

How does NUUN AI build RAG?

We choose embeddings, chunking, and reranking per workload, build retrieval eval sets from real user queries, monitor hallucination rate in production, and design architecture to swap models and vector stores as the market shifts. Vendor lock-in is avoided by default.

WHAT IT IS

HOW IT WORKS

WHEN TO USE

RELATED

SOURCES

Related questions.

Need this term in action?