RAG with LLM

Sebastien Sime
15 min readDec 10, 2023

A Technique to combine Large Language Models with your own Data.

Generated with DALLE (ULEARN-training is a fictive company)

With the ChatGPT storm, we must anticipate the fact that Large Language models will change our habits as internet previously did. We could deny the fact that our learning and exploration experience will never be the same, or we could learn how to harness the potential of such tools.

Have you ever wondered how you can use large language models (LLMs) to generate texts that are relevant to your specific domain or data?

LLMs are amazing models that can produce natural language texts on almost any topic, but they may not always have the facts or details that you need. For instance, if you want to write an article internal to your company about the latest technology that you engineered, you may not find your product specifications in the LLM’s memory (and for good and obvious reasons).

This is where Retrieval Augmented Generation (RAG) comes in handy. RAG is a technique that allows you to use your own data sources to supplement the LLM’s knowledge and generate more accurate and informative texts. In this article, we will briefly explain what RAG is, how it works, and how you can use it with LLMs to create amazing content.

However, before dive deep into what RAG is, we need to have a clear mental view about the different components that we will use to implement RAG efficiently.

The Generative AI project life cycle

Generally speaking, using RAG comes in the context of using generative AI models like ChatGPT or LLama to address specific needs of companies. More specifically, companies that have lots of reports or historic data that need to be processed and used in day to day operations are among the first to need the capabilities of such tools.

So using RAG is not only a developer task but also a business decision and project that needs to be carried with a tailored framework in other to succeed.

Below, you can see a project framework that can be used in any generative AI project :

Gen AI life cycle (simplified)

We have to understand that, RAG is just a technique that will result in the creation of a LLM-powered application. And before having the application up and running, we will need to :

  • Define the use case : for our use case, we will take a fictional start-up named ULearn-training, specialized in training in Data Science and Data Engineering, that want to build a chatbot to interactively answer clients’ frequently asked questions.
  • Select a model or decide to train our own : Here we will use ChatGPT as our base model, since the latter is easy to use. However, at this step, you could decide to perform a benchmark of several LLMs and compare performances. On the Hugging Face hub you can find a great diversity of models that you could freely use for this matter.
  • Validate the fact that the model suits our needs at the adapt and align step using prompt engineering or fine-tuning : In the context of RAG we won’t need fine-tuning since we are only interested in building a prototype but keep in mind that in few cases you would need to fine-tune a model.

So now that we know what is the purpose, we will use the RAG technique to augment or model.

The RAG methodology and benefits

Retrieving Augmented Generation (RAG) combines the strengths of two approaches: retrieval-based and generation-based methods. It aims to enhance the accuracy and reliability of generative AI models by supplementing them with information obtained from external sources (i.e. the documentation of a company catalog).

Retrieval-based methods primarily use a vast corpus of text to retrieve relevant snippets to create responses. This approach excels in swiftly providing factual and accurate information. However, it might occasionally lack creativity and fluency.

In contrast, generation-based methods, also known as generative AI models or large language models (LLMs), generate text directly. This approach can produce more creative and fluent responses, but it can also be susceptible to inaccuracies and inconsistencies.

RAG seamlessly merges these two approaches by first retrieving relevant text snippets and then employing an LLM to generate a new response informed by the retrieved information. This process can significantly enhance the accuracy, reliability, and fluency of generative AI models.

The RAG workflow consists in three main steps :

  1. Retrieval : This phase involves identifying and retrieving relevant text snippets from an external knowledge source. This can be done using various techniques, such as keyword-based search, semantic similarity search, or neural network-based retrieval.
  2. Augmentation : Once relevant snippets are retrieved, they are used to augment the generation process. This can be done in various ways, such as providing additional context, improving the accuracy of the response, or enhancing its fluency.
  3. Generation : This phase involves using the augmented information to generate a new response. This can be done using a variety of techniques, such as template-based generation, statistical language models, or neural network-based generation.

RAG offers several benefits over traditional approaches :

  • Reduced Risk of Hallucinations : Hallucinations are common in generative AI models, where they produce text that is not grounded in reality. RAG mitigates this issue by providing models with access to external knowledge sources to ensure their responses are based on accurate information.
  • Improved Factual Accuracy : RAG enhances the factual accuracy of responses by ensuring that models base their responses on reliable information retrieved from external sources.
  • Enhanced Fluency : RAG improves the fluency of responses by providing models with access to a wider range of language structures and expressions found in the retrieved text snippets.
  • Reduced Reliance on Retraining : RAG reduces the need for frequent retraining of generative AI models, as it can access and incorporate the latest information from external sources without requiring complete model updates.
  • Question Answering : RAG can enhance the accuracy of question answering systems by providing them with access to knowledge bases, external documents, and other relevant information.
  • Summarization : RAG can improve the fluency and accuracy of summarization systems by providing them with access to summaries of other relevant documents and sources.

The RAG flow diagram

Let’s plot below the main components of a simple RAG architecture.

Simple RAG architecture

The general workflow of the Retrieval-Augmented Generation (RAG) technique involves a series of steps that incorporate external information to enhance the responses of a Large Language Model (LLM). Here’s how the workflow unfolds:

  1. Input (Question) : The process starts when a user inputs a question or query that they want the model to answer.
  2. Prompt Templating : The question is then reformatted using a prompt template. This template is designed to convert the user’s query into a structured prompt that will include some guidelines and the business logic we want the model to follow.
  3. Prompt : This structured version of the user’s question, now a prompt, is ready to use a LLM model.
  4. Embedding Model : The question and the company documents are passed through an embedding model, which converts the textual data into a high-dimensional vector space. This numerical representation captures the semantic meaning of the prompt, allowing it to be compared with other similarly embedded pieces of text.
  5. Similarity Search : Next, the embedding of the question and the documents are used to perform a similarity search within a Document Store or Vector Database. This database contains pre-stored embeddings of a large corpus of documents.
  6. Document Retrieval : The similarity search retrieves a set number (K) of the nearest neighbor documents that are most semantically similar to the question. These documents are expected to contain information that is relevant to the query.
  7. Integration with LLM : The retrieved documents, along with the original prompt, are then fed into the Large Language Model. The LLM uses both the prompt and the content of these documents to generate a response. The integration of the retrieved data allows the LLM to produce more informed and contextually relevant answers, as it can draw upon a wider base of knowledge.
  8. Sampling and Generation : The LLM then generates potential answers based on the combined input. Through a sampling process, it selects the most appropriate response. This step might involve techniques like beam search or top-k sampling to find the answer that best fits the context given by the prompt and the retrieved documents.
  9. Output (Answer) : The final step is the output of the answer. This is the LLM’s response to the initial query, enhanced by the information from the retrieved documents, aiming to be more accurate, comprehensive, and contextually relevant than a response generated without the augmentation process.

Technology stack

With the above process in mind, we now know that we need to have the following components :

  • An LLM model and the embedding model : this will be in our case ChatGPT-3.5-Turbo. Note that here we could decide to use two different models.
  • A vector store (or a database to store textual embedding representations) : We will use a free version of Pinecode
  • A processing environment : we could have used a local computer and Python but nowadays, you can build such a system very easily without having to code everything with a SaaS solution like Langflux which propose a limited version that you can after simply creating an account. However, in this case I combined render (a SaaS service that could help you deploy very easily applications) and flowise (the open source version of langflux that provides a no-code graphical interface to create LLM-powered applications).

If you decide to use Langflux, this video will describe all the necessary steps to build the application from scratch. On the other hand, if you want a more technical chalenge using render and flowise here is the documentation that will help you deploy flowise on render.

Building the application

Let’s remember our main goal : Build a Chatbot for ULearn-training a fictive data science and data engineering start-up to interact with client about FAQ (frequently asked questions). ULearn-training hopes to optimize the time spent answering questions on the phone to focus more on the learning experience of their clients.

Step 1 : creating the company documentation

To build the company documentation, I used ChatGPT to provide 10 most frequently question with the folowing (far from perfect but effective for a prototype) prompt :

### PROMPT USED :
--------
Create for a fictional training company a FAQ with potential and answers that client could have. The training company propose trainaing in Data Science and Data Engineering. The founder, John, after struggling to find a suitable training for himself built a company to address specific needs that client could have. The company is in Paris, France.
before writing the questions, give a brief introduction about John, his company called ULearn-training, what is different about this company and give 10 Q&A as the FAQ

### MUCH BETTER PROMPT (NOT USED BUT THAT YOU SHOULD PREFER):
--------
Write a FAQ page for a fictional training company that offers courses in Data Science and Data Engineering.
The company is called ULearn and it is based in Paris, France.
The founder, John, started the company after he couldn’t find a suitable training for himself.
He wanted to create a company that could address the specific needs of clients who want to learn or improve their skills in data-related fields.
Before writing the questions and answers, give a brief introduction about John, his company, and what makes it different from other training companies.
Write 10 questions and answers that cover the most common topics that clients might ask about ULearn-training’s courses, such as the content, the duration, the price, the instructors, the certification, and the support.
Use bullet points to organize the questions and answers, and use clear and concise language.

Below is the result I got from ChatGPT :

### ULearn-training: Revolutionizing Data Science and Data Engineering Training

Founded in the heart of Paris, France, ULearn-training is the brainchild of John, a seasoned Machine Learning Engineer. Frustrated with the lack of tailored training options in Data Science and Data Engineering, John envisioned a training hub that goes beyond the traditional one-size-fits-all approach. With his rich experience in complex architectures, John has infused ULearn with a unique blend of practicality and innovation.

ULearn-training stands out for its:

1. **Customized Learning Paths:** Courses are tailored to meet specific career goals and skill levels.
2. **Industry-Relevant Curriculum:** Designed by experts, it stays updated with the latest industry trends.
3. **Hands-On Approach:** Emphasizes practical projects and real-world applications.
4. **Small Class Sizes:** Ensures personalized attention and better learning outcomes.
5. **Networking Opportunities:** Connects students with industry professionals in Paris and beyond.

### ULearn-training's FAQ

1. **Q: What courses does ULearn offer?**
A: ULearn specializes in Data Science and Data Engineering courses, covering fundamentals to advanced topics like machine learning algorithms, big data processing, and cloud computing.

2. **Q: Are the courses suitable for beginners?**
A: Yes, we offer beginner-level courses and gradually progress to more advanced topics, ensuring a solid foundation for all learners.

3. **Q: How long are the courses?**
A: Course durations vary. Short courses last a few weeks, while comprehensive programs can span several months.

4. **Q: Do you offer online or in-person classes?**
A: Both. Students can choose from in-person classes in Paris or online courses for remote learning.

5. **Q: What makes ULearn's teaching approach unique?**
A: Our courses are project-driven, focusing on real-world problems. This hands-on approach, combined with expert guidance, prepares students for industry challenges.

6. **Q: Are there networking opportunities with industry professionals?**
A: Yes, we regularly host meetups, guest lectures, and workshops with industry leaders.

7. **Q: Is there a certification upon course completion?**
A: Yes, students receive a ULearn certification, which is highly regarded in the industry.

8. **Q: Can I get help with job placements after completing a course?**
A: While we don’t offer direct placement services, our courses are designed to enhance employability, and we provide guidance for job hunting.

9. **Q: What is the student-to-teacher ratio in classes?**
A: We maintain small class sizes, typically not exceeding 15 students per instructor, to ensure personalized attention.

10. **Q: Are the course materials accessible after course completion?**
A: Yes, students have lifelong access to the course materials, including updates.

### About John and His Vision

**John: The Architect Behind ULearn**

John, a distinguished Machine Learning Engineer, is the founder of ULearn. His journey into the world of Data Science and Data Engineering was marked by challenges and a lack of tailored training opportunities. This experience fueled his desire to create ULearn, aiming to bridge the gap between theoretical knowledge and practical industry skills.

His vision for ULearn is not just about imparting education; it's about nurturing a community of data enthusiasts who are well-equipped to tackle real-world problems. John believes in a personalized learning experience, emphasizing that every student's journey is unique. His approach combines cutting-edge technology, industry-relevant curriculum, and hands-on learning, all tailored to individual needs and aspirations.

### Testimonies from ULearn Students

1. **Alice, Data Analyst Turned Data Scientist:**
"Transitioning from data analysis to data science seemed daunting until I met John at ULearn. His guidance and the practical nature of the courses made complex concepts accessible. I now confidently tackle machine learning projects that seemed out of reach before."

2. **Marc, IT Professional to Data Engineer:**
"I always had an interest in big data but didn't know where to start. John's personal experience in the field and his ability to demystify data engineering concepts were game-changers for me. His encouragement led me to a successful career shift."

3. **Sophie, Marketing Specialist to Data Strategist:**
"As someone from a non-technical background, the idea of diving into data science was intimidating. John's mentorship at ULearn, along with the supportive community he's built, helped me understand the power of data in marketing. His approach to teaching is truly inspiring."

in your case you might need to compile documents from different departments, clean them (remove unnecessary parts) and store them in a centralized way.

Step 2 : Select a vector store and upload the documents

Now with the sample documentation, we need to :

  1. Create an index (or a database) on Pinecode : creating an index on Pinecode is as simple as a balad in the park and is a self explanatory process that you can very easily find on YouTube.
  2. Split the documentation as chunks

Chunking in the context of Retrieval-Augmented Generation (RAG) is the process of breaking down large pieces of information into smaller, more manageable parts, or “chunks.” This technique can be especially helpful in RAG for several reasons:

  • Improved Retrieval Efficiency : Large documents may contain a vast array of information, only parts of which are relevant to any given query. By chunking documents, the RAG system can perform a more focused retrieval, pinpointing the exact segments of text that are most pertinent to the user’s prompt.
  • Enhanced Relevance of Results : When a RAG system retrieves whole documents, there’s a risk that the overall relevance score of a document might not accurately reflect the relevance of specific parts within it. Chunking ensures that the system assesses the relevance of each piece of content independently, which can improve the precision of the information retrieved.
  • Better Contextual Understanding : Chunking can help a model better understand the context of a document by allowing it to process and integrate smaller sections that are more directly related to the query. This is particularly useful for complex queries that may span multiple contexts or topics within a larger document.
  • Resource Optimization : Processing entire documents can be resource-intensive, especially for large language models that have to consider a wide range of potential sources. Chunking can reduce the computational load by limiting the amount of text the model needs to process at any one time.
  • More Effective Use of Model’s Attention (maybe one of the most import reason) : Language models can only consider a certain number of tokens (words or characters) at once. Chunking ensures that each piece of text is small enough for the model to process in its entirety

Step 3 : gather everything on flowise

The best part about flowise is that you can build a very advanced app with simple drag and drop.

Application logic in Flowise canva

Once the logic of the app have been built. you can export and version it using Github very easily (even if a direct connection to github would be appreciated 😉).

The Text File tool box allows to manually upload documents to Pinecode. The document are automatically chunked and embedded for similarity search.

Step 4 : Test the app

The flowise UI allow to directly test the app to make sure everything works fine.

Interaction with the model 1
Interactions with the model

As you can see, we retrieve the informations from the documentation even if sometimes the model seems to hallucinate a little. To address the matter we can start by reducing the model temperature and include a prompt template to customize the interaction more.

Embed the chatbot into a web page is a matter of seconds with flowise that produce the code that could be inserted in a HTML page.

To embed in a web page we simply need to click on the </> icon (upper right)

When precision really matter and that we would like a model to be customized to our needs, fine-tuning will be much more helpfull.

Conclusion

The Retrieval-Augmented Generation (RAG) technique stands as a transformative advancement in the field of Natural Language Processing (NLP). By incorporating external sources of information into the response generation process of Large Language Models (LLMs), RAG significantly enhances the capacity of these models to provide accurate, relevant, and context-rich answers.

Throughout our discussion, we’ve explored the intricate workflow of RAG, highlighting how it leverages the power of embedding models and similarity searches to fetch pertinent information from a vast corpus of data. The methodical chunking of data not only improves the efficiency and relevancy of the retrieval process but also ensures that the cognitive load on the LLM is manageable, allowing for more precise attention to the details most relevant to the user’s query.

However, there are also some challenges and limitations that need to be addressed, such as how to select the best knowledge sources, how to integrate them with the generation model, and how to evaluate the generated texts.

In this blog post, we have only scracth the surface and I hope that it will inspire to to conduct more research and start building applications using retrieval augmented generation.

--

--