To understand the concept of retrieval-augmented generation (RAG) in generative AI, imagine a newsroom. An experienced journalist can write articles on various topics using their general knowledge and understanding of the subject matter.
However, when covering a complex story, like an investigative piece or a technical subject—they rely on researchers to dig through archives, reports, and interviews for accurate, relevant information.
Similarly, large language models (LLMs) generate responses to a wide range of questions, but to provide precise and well-sourced answers, they need a helper to gather the right data. In this analogy, the newsroom researcher is represented by RAG, a process that combines data retrieval with AI's generative power.
Retrieval augmented generation is an AI approach that combines the functionality of conventional information retrieval systems, like search engines and databases, with the advanced text-generation abilities of LLMs. Large language models capture patterns of how humans use language to form sentences, which allows them to respond quickly to a wide range of questions. However, they may struggle when users need detailed or up-to-date information on a specific topic. RAG bridges this gap by combining two processes: data retrieval and the model's generative abilities to deliver accurate and relevant AI-generated responses.
To understand how information is processed and retrieved in a RAG AI system, let's explore vector databases and how RAG retrieves data from different sources.
Vector databases are specifically designed to store embeddings (numerical vectors that represent real-world objects, such as words, images, or videos, making them easier for machine learning models to understand) created from code and documentation. These embeddings, represented as vectors, are easily searchable by a RAG system that identifies similarities between vectors to retrieve the most relevant information from a vector database.
When a user asks an LLM a question, the model sends the query to a system that translates it into an embedding or vector, which is then compared to other vectors stored in a machine-readable index of a knowledge base. If there are matches, the system retrieves the related information, converts it back into human-readable text, and sends it to the LLM. The LLM then combines this retrieved information with its own response to create a final answer for the user, sometimes including references to the sources found.
For instance, let's explore a scenario where a company wants to implement a customer support assistant powered by RAG. The assistant is designed to answer questions about the company's products, pulling information from a vector database populated with product manuals, FAQs, and troubleshooting guides.
Here's how it would work:
Behind the scenes, the embedding model regularly updates machine-readable indexes, also known as vector databases, as new or updated knowledge bases become available.
In RAG, data is organized like books in a library, making it easier to locate information through indexing, which involves categorizing data based on exact word matches, themes, or metadata such as topic, author, date, or keywords. Indexing is usually updated regularly when new data becomes available, and there are several methods to achieve this:
Typically, RAG pipelines store documents and break them into smaller chunks. Each chunk is then represented as a vector (embedding) that captures its core meaning, enabling more effective retrieval and response generation.
This step focuses on refining the user's question to align it more closely with the indexed data. The query is simplified and optimized to improve the search process. Effective query processing helps RAG identify the most relevant information.
When working with vector indexes, the refined query is converted into an embedding, which is then used to perform the search.
Once the question is clear, RAG searches its indexed data to gather the most relevant context for the answer. The search method depends on how the data is stored. For vector-based searches, RAG calculates the distance between the query's vector and the document chunks to find matches.
This search generates many potential results. However, not all of these are suitable for use by the LLM, so they need to be filtered and prioritized. This process is similar to how search engines like Google work. When you search for something, you get multiple pages of results, but the most relevant ones are organized and displayed on the first page.
RAG applies a ranking system to its search results, filtering out less relevant data. Each result is scored based on how closely it matches the query, and only the highest-scoring results are passed on to the generation stage, thus making sure that the AI uses the most relevant information to create a response.
After selecting the most relevant information, RAG combines it with the original question to improve the prompt. This added context helps LLM better interpret and respond to the query. LLM incorporates current and specific details so that its response goes beyond general knowledge, making it more accurate.
The final step is where the AI truly shines. Equipped with the enriched prompt, the LLM uses its advanced language capabilities to craft an answer. The generated response is not just generic—it’s informed by the precise and up-to-date information gathered earlier, making it highly relevant and reliable.
Most organizations don't build their own AI models from the ground up. Instead, they adapt pre-trained models to fit their needs using methods like RAG or fine-tuning. Fine-tuning involves modifying the model's internal weights and creating a highly specialized version adapted to a specific task. This method works well for organizations dealing with any heavily specialized scenario. It's worth mentioning that Fine-tuning is a meticulous process that should be approached with care. Collecting a dataset for additional training of the model is not for the faint of heart, and there is always a chance of dulling the model so that it performs worse than before the additional training.
RAG, on the other hand, skips weight adjustments. Instead, it retrieves information from various data sources to enhance a prompt, letting the model produce more context-aware responses for the user.
Some organizations use RAG as a starting point and then fine-tune the model for more specialized tasks. Others find that RAG alone is sufficient for customizing AI to meet their needs.
An AI tool needs proper context to provide useful responses, just like humans need relevant information to make decisions or solve problems. Without the appropriate context, it isn't easy to act effectively.
Modern generative AI applications rely on large language models built using transformer architectures. These models operate within a "context window," which is the maximum amount of data they can process in a single prompt. Although these windows are limited in size, advancements in AI are steadily increasing their capacity. The type of input data an AI tool uses depends on its specific capabilities.
Since the size of the context window is limited, machine learning engineers face the challenge of deciding which data to include in the prompt and in what order. This process is called prompt engineering, and it helps the AI model generate the most relevant and helpful output.
RAG enhances an AI model's contextual understanding by allowing an LLM to access information beyond its training data and retrieve details from various data sources, including customized ones. By pulling in additional information from these sources, RAG enhances the initial prompt, enabling the AI to generate more accurate and relevant responses.
Unlike traditional keyword searches a machine learning-based semantic search system uses its training data to recognize the relationships between terms. For instance, in a keyword search, "coffee" and "espresso" might be treated as unrelated terms. However, a semantic search system understands that these words are closely linked through their association with beverages and cafés. As a result, a search for "coffee and espresso" might prioritize showing results for popular cafés or coffee-making techniques at the top.
When a RAG system uses a customized database or search engine, semantic search helps improve the relevance of the context added to prompts, resulting in more accurate AI-generated outputs.
As we discussed before, RAG uses vector databases and embeddings to retrieve relevant information. But what we haven't discussed yet is that RAG doesn't rely solely on embeddings or vector databases.
A RAG system can use semantic search to pull relevant documents from various sources, whether it's an embedding-based retrieval system, a traditional database, or a search engine. The system then formats snippets from those documents and integrates them into the model's prompt.
RAG techniques are especially valuable for AI-powered semantic search engines. By integrating these methods into NLP search tools, advanced systems are created that not only answer user queries directly but also uncover related information using generative AI.
What makes RAG-powered search engines unique is their ability to process unstructured data, such as legal case files. For example, rather than depending only on exact keyword matches, a semantic search engine can analyze legal documents and answer complex questions, like identifying cases where a particular law was applied under specific circumstances.
In essence, incorporating RAG techniques allows a semantic search engine to deliver precise answers while also detecting patterns and relationships within the data.
RAG systems can also pull information from both external and internal search engines. When paired with an external search engine, RAG can retrieve data from across the internet. Meanwhile, integrating with an internal search engine allows access to organizational resources, such as internal websites or platforms. Combining both types of search engines enhances RAG's capability to deliver highly relevant and comprehensive responses.
For example, imagine a customer service chatbot for an e-commerce company that integrates both an external search engine like Google and an internal search engine designed to access the company's knowledge base.
RAG origins date back to the 1970s. Natural language processing was used in the very early applications to retrieve information, focusing on niche topics. While the main ideas behind text mining have stayed consistent, the technology behind these systems has advanced significantly, making them more effective. By the mid-1990s, services like Ask Jeeves (now Ask.com) popularized question-answering with user-friendly interfaces. IBM's Watson brought further attention to the field in 2011 when it beat human champions on the TV game show Jeopardy!
RAG took a major step forward in 2020, due to research led by Patrick Lewis during his doctoral studies in NLP at University College London and his work at Meta's AI lab. Patrick's team aimed to enhance LLMs by integrating a retrieval index into the model, which would allow it to access and incorporate external data dynamically. Inspired by earlier methods and a paper from Google researchers, they envisioned a system capable of generating accurate, knowledge-based text outputs.
When Lewis integrated a promising retrieval system developed by another Meta team, the results exceeded expectations on the first try—an uncommon feat in AI development.
The research, supported by major contributions from Ethan Perez and Douwe Kiela, ran on a cluster of NVIDIA GPUs and demonstrated how retrieval systems could make AI models more accurate, reliable, and trustworthy. The resulting paper has since been cited by hundreds of others, influencing ongoing advancements in the field.
Modern LLMs, powered by ideas like those in RAG, are redefining what's possible in question-answering and generative AI. By connecting models to external data sources, RAG helps provide more informed and authoritative responses, trailblazing a path for innovation in many other industries.