What is RAG?

To understand the concept of retrieval-augmented generation (RAG) in generative AI, imagine a newsroom. An experienced journalist can write articles on various topics using their general knowledge and understanding of the subject matter.

However, when covering a complex story, like an investigative piece or a technical subject—they rely on researchers to dig through archives, reports, and interviews for accurate, relevant information.

Similarly, large language models (LLMs) generate responses to a wide range of questions, but to provide precise and well-sourced answers, they need a helper to gather the right data. In this analogy, the newsroom researcher is represented by RAG, a process that combines data retrieval with AI's generative power.

More specifically, RAG is:

Retrieval augmented generation is an AI approach that combines the functionality of conventional information retrieval systems, like search engines and databases, with the advanced text-generation abilities of LLMs. Large language models capture patterns of how humans use language to form sentences, which allows them to respond quickly to a wide range of questions. However, they may struggle when users need detailed or up-to-date information on a specific topic. RAG bridges this gap by combining two processes: data retrieval and the model's generative abilities to deliver accurate and relevant AI-generated responses.

How does RAG work?

To understand how information is processed and retrieved in a RAG AI system, let's explore vector databases and how RAG retrieves data from different sources.

Vector databases are specifically designed to store embeddings (numerical vectors that represent real-world objects, such as words, images, or videos, making them easier for machine learning models to understand) created from code and documentation. These embeddings, represented as vectors, are easily searchable by a RAG system that identifies similarities between vectors to retrieve the most relevant information from a vector database.

When a user asks an LLM a question, the model sends the query to a system that translates it into an embedding or vector, which is then compared to other vectors stored in a machine-readable index of a knowledge base. If there are matches, the system retrieves the related information, converts it back into human-readable text, and sends it to the LLM. The LLM then combines this retrieved information with its own response to create a final answer for the user, sometimes including references to the sources found.

For instance, let's explore a scenario where a company wants to implement a customer support assistant powered by RAG. The assistant is designed to answer questions about the company's products, pulling information from a vector database populated with product manuals, FAQs, and troubleshooting guides.

Here's how it would work:

  1. It all starts with a user query. For example, a customer might ask, "How can I reset my device to factory settings?"
  2. Following the user's query, the system converts it into a vector representation (a numerical format) using an embedding model.
  3. The query vector is then compared to the vectors in the database, which usually represent chunks of text from the product documents.
  4. The system retrieves relevant passages from the vector database and adds them to the input prompt for the large language model.
  5. Finally, the response is generated. The LLM combines the retrieved data with its own knowledge to prepare a complete, accurate response for the customer, such as detailed steps for resetting the device.

Behind the scenes, the embedding model regularly updates machine-readable indexes, also known as vector databases, as new or updated knowledge bases become available.

The RAG process

1. Organizing data

In RAG, data is organized like books in a library, making it easier to locate information through indexing, which involves categorizing data based on exact word matches, themes, or metadata such as topic, author, date, or keywords. Indexing is usually updated regularly when new data becomes available, and there are several methods to achieve this:

  • Lexical indexing is a method that organizes data by exact word or phrase matches. It’s fast and precise but it can miss related information that doesn’t exactly match the query.
  • Vector indexing is an approach that uses numerical vectors, or embeddings, to represent the meaning of words or phrases. While it’s slower and less precise, it can uncover related data even without an exact match.
  • Hybrid indexing combines exact matches and numerical vectors, benefiting from the strengths of lexical and vector indexing. It improves the accuracy and variety of retrieved data.

Typically, RAG pipelines store documents and break them into smaller chunks. Each chunk is then represented as a vector (embedding) that captures its core meaning, enabling more effective retrieval and response generation.

2. Input query processing

This step focuses on refining the user's question to align it more closely with the indexed data. The query is simplified and optimized to improve the search process. Effective query processing helps RAG identify the most relevant information.

When working with vector indexes, the refined query is converted into an embedding, which is then used to perform the search.

3. Searching and ranking 

Once the question is clear, RAG searches its indexed data to gather the most relevant context for the answer. The search method depends on how the data is stored. For vector-based searches, RAG calculates the distance between the query's vector and the document chunks to find matches.

This search generates many potential results. However, not all of these are suitable for use by the LLM, so they need to be filtered and prioritized. This process is similar to how search engines like Google work. When you search for something, you get multiple pages of results, but the most relevant ones are organized and displayed on the first page.

RAG applies a ranking system to its search results, filtering out less relevant data. Each result is scored based on how closely it matches the query, and only the highest-scoring results are passed on to the generation stage, thus making sure that the AI uses the most relevant information to create a response.

4. Prompt augmentation

After selecting the most relevant information, RAG combines it with the original question to improve the prompt. This added context helps LLM better interpret and respond to the query. LLM incorporates current and specific details so that its response goes beyond general knowledge, making it more accurate.

5. Response generation

The final step is where the AI truly shines. Equipped with the enriched prompt, the LLM uses its advanced language capabilities to craft an answer. The generated response is not just generic—it’s informed by the precise and up-to-date information gathered earlier, making it highly relevant and reliable.

RAG and fine-tuning

Most organizations don't build their own AI models from the ground up. Instead, they adapt pre-trained models to fit their needs using methods like RAG or fine-tuning. Fine-tuning involves modifying the model's internal weights and creating a highly specialized version adapted to a specific task. This method works well for organizations dealing with any heavily specialized scenario. It's worth mentioning that Fine-tuning is a meticulous process that should be approached with care. Collecting a dataset for additional training of the model is not for the faint of heart, and there is always a chance of dulling the model so that it performs worse than before the additional training.

RAG, on the other hand, skips weight adjustments. Instead, it retrieves information from various data sources to enhance a prompt, letting the model produce more context-aware responses for the user.

Some organizations use RAG as a starting point and then fine-tune the model for more specialized tasks. Others find that RAG alone is sufficient for customizing AI to meet their needs.

How AI models use context

An AI tool needs proper context to provide useful responses, just like humans need relevant information to make decisions or solve problems. Without the appropriate context, it isn't easy to act effectively.

Modern generative AI applications rely on large language models built using transformer architectures. These models operate within a "context window," which is the maximum amount of data they can process in a single prompt. Although these windows are limited in size, advancements in AI are steadily increasing their capacity. The type of input data an AI tool uses depends on its specific capabilities.

Since the size of the context window is limited, machine learning engineers face the challenge of deciding which data to include in the prompt and in what order. This process is called prompt engineering, and it helps the AI model generate the most relevant and helpful output.

RAG enhances an AI model's contextual understanding by allowing an LLM to access information beyond its training data and retrieve details from various data sources, including customized ones. By pulling in additional information from these sources, RAG enhances the initial prompt, enabling the AI to generate more accurate and relevant responses.

RAG and semantic search

Unlike traditional keyword searches a machine learning-based semantic search system uses its training data to recognize the relationships between terms. For instance, in a keyword search, "coffee" and "espresso" might be treated as unrelated terms. However, a semantic search system understands that these words are closely linked through their association with beverages and cafés. As a result, a search for "coffee and espresso" might prioritize showing results for popular cafés or coffee-making techniques at the top.

When a RAG system uses a customized database or search engine, semantic search helps improve the relevance of the context added to prompts, resulting in more accurate AI-generated outputs.

As we discussed before, RAG uses vector databases and embeddings to retrieve relevant information. But what we haven't discussed yet is that RAG doesn't rely solely on embeddings or vector databases.

A RAG system can use semantic search to pull relevant documents from various sources, whether it's an embedding-based retrieval system, a traditional database, or a search engine. The system then formats snippets from those documents and integrates them into the model's prompt.

What is a RAG search engine?

RAG techniques are especially valuable for AI-powered semantic search engines. By integrating these methods into NLP search tools, advanced systems are created that not only answer user queries directly but also uncover related information using generative AI.

What makes RAG-powered search engines unique is their ability to process unstructured data, such as legal case files. For example, rather than depending only on exact keyword matches, a semantic search engine can analyze legal documents and answer complex questions, like identifying cases where a particular law was applied under specific circumstances.

In essence, incorporating RAG techniques allows a semantic search engine to deliver precise answers while also detecting patterns and relationships within the data.

RAG systems can also pull information from both external and internal search engines. When paired with an external search engine, RAG can retrieve data from across the internet. Meanwhile, integrating with an internal search engine allows access to organizational resources, such as internal websites or platforms. Combining both types of search engines enhances RAG's capability to deliver highly relevant and comprehensive responses.

For example, imagine a customer service chatbot for an e-commerce company that integrates both an external search engine like Google and an internal search engine designed to access the company's knowledge base.

  • The external search engine allows the chatbot to retrieve up-to-date information from the web, such as the latest shipping regulations or a competitor's holiday sale policies. For example, a user might ask, "What are the current rules for international shipping to Europe?" The chatbot leverages Google to provide accurate and current details.
  • The internal search engine, however, enables the chatbot to access private data, such as the company's specific shipping policies or order tracking details. Without this internal search engine, the chatbot wouldn't be able to answer questions like, "What are my options for expedited shipping on my current order?" unless the customer explicitly referenced their order number or details.

Some RAG history

RAG origins date back to the 1970s. Natural language processing was used in the very early applications to retrieve information, focusing on niche topics. While the main ideas behind text mining have stayed consistent, the technology behind these systems has advanced significantly, making them more effective. By the mid-1990s, services like Ask Jeeves (now Ask.com) popularized question-answering with user-friendly interfaces. IBM's Watson brought further attention to the field in 2011 when it beat human champions on the TV game show Jeopardy!

RAG took a major step forward in 2020, due to research led by Patrick Lewis during his doctoral studies in NLP at University College London and his work at Meta's AI lab. Patrick's team aimed to enhance LLMs by integrating a retrieval index into the model, which would allow it to access and incorporate external data dynamically. Inspired by earlier methods and a paper from Google researchers, they envisioned a system capable of generating accurate, knowledge-based text outputs.

When Lewis integrated a promising retrieval system developed by another Meta team, the results exceeded expectations on the first try—an uncommon feat in AI development.

The research, supported by major contributions from Ethan Perez and Douwe Kiela, ran on a cluster of NVIDIA GPUs and demonstrated how retrieval systems could make AI models more accurate, reliable, and trustworthy. The resulting paper has since been cited by hundreds of others, influencing ongoing advancements in the field.

Modern LLMs, powered by ideas like those in RAG, are redefining what's possible in question-answering and generative AI. By connecting models to external data sources, RAG helps provide more informed and authoritative responses, trailblazing a path for innovation in many other industries.

cta image

LLMs: areas of excellence and limitations

As companies worldwide are starting to wonder how LLMs can benefit their business, the question of where they excel the most arises. Thus, we have summed up a brief article on areas of excellence and ineptitude of Large Language Models.

AI in digital marketing

A complete guide to how artificial intelligence is helping digital marketing specialists become more efficient.

Best practices for web applications development

Everything you need to know about web applications development.

Build interactive animations that run anywhere with the Rive app

Rive is a powerful animation tool that allows designers and developers collaborate efficiently to build interactive animations for virtually any platform.

Build versus buy software

Making the right choice in software development.

Devstark - an Industry game-changer on Clutch

We’re proud to be your go-to 5-star partner and an industry game-changer!

Everything you need to know about FHIR

Helping healthcare providers and patients stay on the same page.

Fixed price, time and materials, or a dedicated team

Choosing the right collaboration approach when partnering with a tech vendor for custom software development can benefit your product by increasing productivity while reducing hiring costs.

Hacking success with a discovery phase done right

The discovery phase of a software development project is the cornerstone for business success. Dive into the significance of the project discovery phase in the product development process.

How to build an MVP that can get your startup funded

Craft an experience that resonates with your audience.

How to explain a business idea to the development team

Help your project succeed with an effective communication strategy.

LAW AND AI

Artificial intelligence is reshaping how the legal field is doing business. Learn how AI can improve workflows and save time and money for lawyers and their clients.

Lottie - an open-source animation rendering tool

Revolutionize your animation game with Lottie, the free and easy-to-use open-source rendering tool.

Payload 3.0 release

Working with Payload has never been more comfortable! With the new release of Payload CMS 3.0 it has become Next.js native! You can easily install it in the Next.js app with a single line of code alongside your frontend. Read about what else is new in Payload 3.0 in our article.

Jamstack - deciphered!

You've probably heard the term "Jamstack" used a lot lately, so what does it mean? Jamstack is a modern web development architecture, designed to provide better performance, more security, cheaper scaling costs, and a smoother developer experience.

Identify, prevent, and mitigate potential digital project risks

IT project risks and ways to asses and prevent them.

Speed up development with Payload CMS

Find out how Payload CMS speeds up the development process of not only websites, but also web apps without compromising on product quality!

Unlock the potential of your custom software project with the right technology stack

How to choose the correct technology for your project.

What is Jobs to be done?

If you're looking for a new way to think about your business, look into Jobs to be done.