RAG — IT definition
Retrieval-Augmented Generation: an architecture that combines information retrieval with LLM generation to produce answers grounded in verified sources.
RAG (Retrieval-Augmented Generation) is the dominant architectural pattern for using an LLM on business data. It retrieves the relevant passages of a document base (search) and injects them into the model's prompt (generation), so the model produces an answer grounded in these sources rather than relying on its training alone.
RAG has become the standard approach to reduce hallucinations, serve recent or proprietary data (past the knowledge cutoff), and provide sourced, auditable answers. Per Gartner, over 80 % of enterprise GenAI projects deployed in 2025 rely on RAG, up from less than 20 % in 2023.
Why RAG exists
Without RAG, an LLM answers from its internal parameters only — which has three problems:
- •Knowledge cutoff: the model knows nothing past its training date.
- •Private data: internal company documents are not in its corpus.
- •Hallucinations: with no source to cite, the model invents plausibly.
RAG solves all three: answers are grounded on up-to-date external sources, the company keeps control of its data, and sources can be cited.
How a RAG pipeline works
A typical RAG pipeline has five stages:
- Indexing: source documents (PDFs, web pages, SQL rows, tickets) are split into chunks, embedded (turned into vectors), and stored in a vector database (Pinecone, Weaviate, Qdrant, pgvector, OpenSearch).
- User question: the user asks a question in natural language.
- Retrieval: the question is embedded, compared by vector similarity against indexed chunks. The top-k most relevant chunks are returned.
- Augmentation: retrieved chunks are injected into the LLM's prompt alongside the question.
- Generation: the LLM produces an answer grounded in those chunks, ideally citing sources.
RAG can combine keyword search (BM25), metadata filters, and a re-ranker (model that reorders results by relevance).
RAG variants
- •Naive RAG: the basic pipeline above.
- •Advanced RAG: adds pre-processing (query rewriting), post-processing (re-ranking, compression), metadata.
- •GraphRAG: combines vector store and knowledge graph for multi-hop questions.
- •Hybrid RAG: combines vector and lexical search (BM25).
- •Agentic RAG: an AI agent decides when and how to retrieve, can issue multiple successive queries.
- •Self-RAG: the model critiques its own answer and triggers further retrieval as needed.
RAG vs fine-tuning vs long context
Three competing approaches to bring business knowledge to an LLM:
- •RAG: dynamically retrieve the relevant context per question. Flexible, easy to update, sourced. Industry standard.
- •Fine-tuning: adapt the model weights to a domain. Training cost, slow updates, but useful for tone, format, or very specific tasks.
- •Long context: push the whole corpus into the context window (1M+ tokens with Gemini 2 or Claude). Simple but expensive per call, and quality drops on very long contexts (lost in the middle).
The three are complementary rather than competing. Most serious deployments combine RAG and light fine-tuning.
RAG challenges in production
Many RAG projects ship to demo but stall in production. Watch out for:
- •Chunking quality: too small = context loss, too large = noise.
- •Embedding fit: a generalist embedding (OpenAI text-embedding-3) may underperform a domain-tuned one.
- •Metadata: without filters (date, team, classification), retrieval gets noisy at scale.
- •Permissions and confidentiality: RAG must respect access rights — a user must only receive chunks they're allowed to see. Often forgotten.
- •Evaluation: build a company-specific eval set with known-good answers, measure relevance, faithfulness, hallucinations.
- •Inference cost: every question is one LLM call plus embeddings; at scale the budget can explode.
- •Updates: re-index sources regularly, handle deletions.
RAG and IT-estate context
To answer internal questions ("who owns this application?", "how much do we spend on Slack?", "which apps are end-of-life?"), RAG has to draw from the live patrimony of the IT estate. That is precisely what Kabeen exposes — through REST API, MCP, and a dedicated RAG endpoint — to feed enterprise AI copilots with accurate, up-to-date context.
Common RAG tools
- •Vector databases: Pinecone, Weaviate, Qdrant, Milvus, pgvector, Chroma.
- •Frameworks: LangChain, LlamaIndex, Haystack, DSPy.
- •Embeddings: OpenAI, Cohere, Voyage, BGE, Mistral Embed.
- •Managed platforms: Azure AI Search, AWS Kendra, Vertex AI Search.
- •Evaluation: Ragas, TruLens, DeepEval.
Frequently asked questions
What is RAG?
+
RAG (Retrieval-Augmented Generation) is an architecture that combines information retrieval and LLM generation. Instead of answering from its parameters alone, the LLM receives in its prompt the relevant passages from an internal document base — grounding the answer in verified sources and drastically reducing hallucinations.
Why use RAG over a raw LLM?
+
Three reasons: (1) a raw LLM knows nothing past its training date (knowledge cutoff), (2) a raw LLM does not have your company's internal data, (3) without a source to cite, an LLM invents plausibly (hallucinations). RAG solves all three: up-to-date sources, controlled private data, sourced and auditable answers.
What is the difference between RAG and fine-tuning?
+
RAG dynamically retrieves relevant context per question — flexible, easy to update, sourced. Fine-tuning adapts the model weights to a domain — better for tone, format, or very specific tasks, but expensive to train and update. The two are complementary: most serious deployments combine RAG over business data with light fine-tuning on response format.
What are the production pitfalls of RAG?
+
Seven recurring watchpoints: chunking quality, choice of embeddings tuned to the domain, richness of metadata for filtering, respect for user-level access rights (often forgotten), a proper eval set, inference cost at scale, and a clear re-indexing strategy for updates. A RAG that works in demo and breaks in production almost always fails on one of these.
All terms
5R Method
A strategy used during application rationalization to determine the best approach for managing applications.
8R Method
An extended version of the 5R method used in application portfolio management and migration strategies.
Application
A computer program or set of programs designed to automate a business process or deliver value to end users.
Architecture
Refers to the structure and behavior of IT systems, processes, and infrastructure within an organization.
Need help mapping your IT landscape?
Kabeen helps you inventory, analyze and optimize your application portfolio.