In the first video on RAG, David O’Dell and I cover a quick intro to RAG, aka Retrieval Augmented Generation, why RAG was needed and what RAG is. In this second video, we do a deep dive into the various components of RAG and then David walks me through a Jupyter notebook showing RAG in action on a subset of whitepapers and blogs from the Dell Technologies Infohub repository.
Since ChatGPT was released in November 2022, GenAI has totally turned the world upside down and is now on everybody’s mind, particularly Conversational AI. Conversational AI is a subset of GenAI and is powered by Large Language Models (LLMs), with ChatGPT being one of those models.
The reason why GenAI is so popular is because it is capable of generating unique content, which, in the case of Conversational AI, are conversations. LLMs are able to understand context and meaning, allowing them to carry conversations and generate new content. Who hasn’t turned to ChatGPT when faced with the syndrome of the blank page?
LLMs are facing 2 big issues. The first one is the same reason that is making them currently so popular in the first place: their ability to generate content. Why is that a problem? Because there is one sentence that LLMs are not capable of saying and that sentence is: “I don’t know”. Faced with with a question it doesn’t know the answer to, an LLM will do what it does best: it will generate content. That content might be relevant or not, which is the root of the issue. This is called hallucination.
The second issue is the need for LLMs to be fine-tuned on specific data to be able to offer relevant answers. For instance, if a company wants to use an LLM to power its chatbot, then that LLM will need to be fine-tuned on the company specific data. Unfortunately, this fine-tuning isn’t a one time activity but needs to be an ongoing process as it is the only way for them to ingest up-to-date information. While fine-tuning isn’t as resource intensive as training, it still requires significant resources. The other challenge with fine-tuning is that it needs to be repeated every time an update to the LLM is performed, because the new updated version of the LLM won’t contain the values of the various parameters and weights that were updated during fine-tuning.
So how do we address those issues?
This is why Retrieval Augmented Generation (RAG) was developed. RAG was introduced in 2021 in a paper by Lewis et el. and is trying to address the deficiencies of LLMs, while keeping all the goodness they offer.
RAG is based on the premise of division of labour. When I ask a question to ChatGPT, at a very high level, 3 things need to happen:
- ChatGPT needs to understand my question,
- ChatGPT needs to retrieve the information to answer the question,
- ChatGPT needs to formulate the answer.
Steps #1 and #3 are right in the wheelhouse of an LLM, but step #2 is where the hallucination is created, because if the LLM can’t find the answer, it will “make something up” that will be grammatically correct, but most likely flatly false.
Step #2 is where RAG differs from traditional LLM architecture. In a RAG architecture, the information requested by the original query isn’t “stored” in the LLM, but instead is stored in a database. That database isn’t your traditional RDMS database, but instead is a purpose built database, capable of recording relationships between data points. Those relationships are what allows the database to take a semantic query such as “what is the maximum number of cores in a R960 PowerEdge server?” and return the proper answer. If the database doesn’t have the right information, it will just not return any information, instead of “making it up” like an LLM would.
The type of database used in RAG can be divided into 2 categories:
- Vector(ed) database
- Graph database
Both are able to take a document, like a PDF spec sheets, and create relationships between words of the document and extract meaning through those relationships. In a vector database, the data is stored in high dimensional vectors, which are numerical representation of the features or attributes of that data. In a graph database, the data is represented and stored in nodes, edges and properties. The graphs within a graph database are used to represent the relationships between the various data. Because of the difference in how they store and represent the data, vector and graph databases have different strength. A vector database is typically better at finding patterns and answering simpler queries whereas a graph database is typically better at finding relationships and answering more complex queries. Which type is used within your RAG architecture depends on the data and the type of queries being made.
One of the issues with LLMs is the cost to keep them up-to-date on content. In RAG, this is done by simply ingesting the newer data into the database. Once the data is ingested, it is readily available for querying.
What are the components of RAG ?
RAG can be split into 3 major components:
- A Retriever
- A Document Store
- A Generator
Here is how it was illustrated in the 2021 paper:

A more simplistic view looks like this:

In a nutshell, the retriever takes the initial query or input, i.e. what the user wants to know, and, instead of passing that query/input to the generator, like in ChatGPT’s case, it passes it to the database or document store, which returns a number of documents matching the input. That document store can be either a vector database or a graph database. Those documents, along with the initial query/input, are then sent to the generator, typically a seq2seq model, for the response. In the combination of the initial query/input with the return documents, the returned documents are effectively providing the generator with the appropriate context to avoid hallucinations. In this case, the generator draws the right information from the returned documents instead of having it stored within itself and just needs to generate the response based on both inputs.
RAG’s downside, which is also one of its benefits, is that there is no prescribed stack to build it. The benefit of this is that each component can be tailored to the use case as it is very easy to, for instance, switch a vector database for a graph database and vice-versa. The challenge with this flexibility is that it makes implementing RAG fairly complex as there is a plethora of options for each components and it can be difficult to choose the right one.
How Can Dell Help

Dell Services experts help you realize the value of GenAI (be it a RAG framework or some other) with a portfolio of services to assist you at every stage of your GenAI journey.
If your considering a RAG framework bear in mind Inferencing will still need to be carried out (As the LLM still needs to generate a response!) Thus, Dell professional services can offer a framework and work with you to achieve your goals.
“STRATEGIZE: Build your roadmap”
- Dell Advisory Services for Generative AI build a strategy and roadmap, gaining consensus on your high-priority use cases and highlighting
“IMPLEMENT: Deploy a full-stack inferencing solution”
- Dell Implementation Services for Generative AI deploy an inferencing solution using the Dell Validated Design for Generative AI with NVIDIA, as reflected in the dark blue layers in the middle of this page
“ADOPT: Apply solution to your use cases”
- Dell Adoption Services for Generative AI apply the deployed inferencing solution to your priority use cases so that your business can see tangible Generative AI benefits
“SCALE: Manage operations efficiently”
- Dell Scaling Services for Generative AI provide operational expertise with Education Services for AI, resident experts and managed services
- More information on Dell’s Generative AI services can be found here
- Services overview can be found here here
To Close
By now, this should be a very easy question to answer. While LLMs are extremely powerful tools to create content, it turns out that without proper context, they sometimes can struggle and become “creative”. RAG is a great way to leverage existing technology, such as vector or graph databases, to provide the LLM with the right context and information to prevent it from hallucinating, while still allowing it to do what it does best, which is offering answers that sound like they are coming from a human.
Resources
- Code example for this notebook and others at: https://github.com/dell-examples
- Dell Technologies Infohub: https://infohub.delltechnologies.com/
- Original inspiration repo: https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain


Leave a comment