What is retrieval-augmented generation (RAG)?

RAG is a technique that retrieves relevant passages from an external knowledge source and inserts them into the model's prompt, so the answer is grounded in your data rather than only the model's training.

Retrieval-augmented generation, RAG, is the pattern of fetching relevant information at query time and supplying it to a language model as context so its answer is grounded in a specific, up-to-date corpus instead of relying solely on what the model memorized during training. The flow has two phases. Offline, you ingest your documents: chunk them into passages, embed each chunk, and store the vectors, usually in a vector database. At query time you embed the user's question, retrieve the most relevant chunks (often via semantic search, frequently hybrid search plus reranking), and paste those passages into the prompt alongside the question. The model then answers using the retrieved evidence, which reduces hallucination, lets you cite sources, and keeps knowledge current without retraining. RAG is the standard way to give an LLM access to private or fast-changing knowledge, and quality depends heavily on the retrieval half: good chunking, embeddings, and ranking matter as much as the model. RAG and MCP are complementary rather than competing, a RAG retriever can be wrapped as an MCP server so any agent can perform retrieval through a tool call.