16 June, 2023

RAG - Retrieval-Augmented Generation To Deploy Real-World Applications Of LLMs

The release of ChatGPT by OpenAI in November 2022 generated a huge amount of interest in Large Language Models (LLM's) which had garnered little attention previously. In particular, businesses around the world have been scrambling to articulate a strategy to use LLM technology to either drive efficiencies or generate incremental revenue. The lack of knowledge within these organisations, especially within the senior ranks is hampering progress.

Generative AI has exploded in popularity since OpenAI released ChatGPT as the chat interface made the technology more accessible to non-technical users. Since then a number of alternative GPT based technologies have released including Microsoft Copilot (originally Bing AI), Anthropic's Claude, Google Bard and before those both Character AI and Perlexity AI were released. The technology is becoming more broad based, capable and accessible to both consumers and businesses.

One of the primary risks associated with the use of third party LLM's is the potential for leakage of sensitive or proprietary data from businesses. One way to benefit from LLM's whilst eliminating the risk of data leakage is to use techniques such as Retrieval-Augmented Generation (RAG) in combination with running an open source LLM such as Meta's LLaMA locally.

Retrieval-augmented generation is a technique that combines LLMs with an external knowledge source, such as a database or a collection of documents, to generate more accurate and informative responses. Businesses have vast amounts of data that is useful on a day-to-day basis but which is often difficult to access and often sits in silos that have gatekeepers acting as bottlenecks. Making this data more generally available and via a medium that has no technical barriers to access has the potential to unleash huge productivity gains and power new business opportunities.

As both an early adopter of LLM technology via the Perxeive app I released to the Apple App Store in November 2021, and as someone who has operated in senior management roles in large businesses I believe RAG techniques offer the optimal initial LLM implementation strategy for most businesses. Deploying RAG solutions offers business the best way to both demonstrate progress and to minimise the risks associated with a revolutionary new technology.

RAG: Real-World Deployment Opportunities

The RAG approach can be used to deploy real-world applications of LLMs in the following ways:

Improving accuracy and relevance: By retrieving relevant information from an external knowledge source, RAG can help improve the accuracy and relevance of the generated text. For example, a chatbot that uses RAG can provide more detailed and accurate answers to user queries by first searching for relevant information in a database before generating a response.
Reducing hallucinations: LLMs are known to sometimes generate text that is factually incorrect or misleading, a phenomenon known as "hallucination". By using RAG, it is possible to reduce the likelihood of hallucinations by verifying the generated text against the external knowledge source.
Personalization: RAG can be used to personalize the generated text based on the user's profile or preferences. For example, a recommendation system that uses RAG can retrieve information about the user's past purchases and browsing history from a database before generating product recommendations.
Real-time updates: By using an external knowledge source that is regularly updated, RAG can ensure that the generated text reflects the latest information. For example, a news summary system that uses RAG can retrieve the latest news articles from a database before generating a summary.
Scalability: RAG can be used to scale LLMs to handle large amounts of data. By retrieving only the relevant information from an external knowledge source, RAG can reduce the amount of data that needs to be processed by the LLM, making it more efficient and scalable.

RAG: Potential Applications In Businesses

There are many potential applications both for internal use and also for client facing. For those businesses wishing to make a start in using LLM's in their businesses an internal deployment of a RAG application is a low risk strategy. This could be as simple as deploying a chat application that enables employees to query internal policies and procedures. A good example for businesses operating in highly regulated sectors such as financial services providing tools that enable regulations to be queried for a specific context and the relevant sections of regulations to be referenced and summarised could be hugely valuable.

External customer applications could include enabling customers to self serve learning how to use a product via a RAG enabled chat bot.

RAG: Architecture

The architecture of a RAG application consists of two main components: a retriever and a generator. The retriever is responsible for searching a large corpus of documents to find relevant information related to the user's query, while the generator uses this information to generate a response.

During inference, the retriever first identifies a set of potentially relevant documents based on the user's input. It then extracts salient snippets from these documents and passes them to the generator, which uses them as context to generate a response. On a practical basis, this typically is implemented as a query to a vector store in which a business has represented its text data as a set of semantically optimised embeddings.

Pinecone is an example of a fully managed vector database that is well suited to RAG applications. Having implemented RAG using Pinecone for my own educational purposes I found Pinecone very straightforward to set up and use. Vector stores and embeddings are covered in a later section below.

The generator is typically a pre-trained transformer model that has been fine-tuned for a specific task, such as question answering or dialogue generation. Proprietary LLM's such as OpenAI's various GPT's and open source models such as Meta's LLaMA are examples of potential generators.

The following figure is taken from the Pinecone blog on RAG and provides an example of a RAG architecture:

The RAG architecture has several advantages over traditional generative models. By incorporating retrieved information into the generation process, it can provide more accurate and relevant responses to user queries. Additionally, because the retriever only needs to search a fixed corpus of documents, it can be more efficient than generating responses from scratch using a large language model.

One potential limitation of the RAG architecture is that it requires access to a large and diverse corpus of documents to be effective. If the corpus is too small or not representative of the domain being queried, the retriever may not be able to find relevant information, which could negatively impact the quality of the generated response.

RAG: Vector Stores And Embeddings

Vector stores and embeddings are important concepts in retrieval-augmented generation.

A vector store is a data structure that is used to efficiently retrieve vectors (i.e., numerical representations of data) based on their similarity to a query vector. In the context of retrieval-augmented generation, a vector store might be used to index and retrieve embeddings of text documents, which can then be used to generate responses to user queries.

Embeddings are compact vector representations of data that capture its semantic meaning or other relevant properties. In the case of text data, embeddings can be generated using techniques such as word2vec, GloVe, or BERT. These embeddings capture the meanings of words and phrases in a way that allows them to be compared and manipulated mathematically, which is essential in RAG applications.

There are a number of ways in which the text can be broken down into semantically relevant chunks to be embedded into the vector store. We cover text splitting and chunking in more detail in a later section.

In retrieval-augmented generation, embeddings are used to represent the meaning or context of user queries and text documents. By comparing the embeddings of a query and a set of relevant documents, the model can identify the most relevant documents to use in generating a response. The vector store is then used to efficiently retrieve these documents based on their similarity to the query embedding.

Vector stores and embeddings are important tools for retrieval-augmented generation because they allow models to efficiently compare and retrieve relevant information based on its semantic meaning or other properties. This can lead to more accurate and contextually relevant responses, which is particularly important in applications such as chatbots, virtual assistants, and other natural language processing tasks.

RAG: Text Splitting And Chunking

Text splitting and chunking are techniques used to divide large text documents into smaller pieces or chunks, which can then be processed and analyzed more efficiently. These techniques are often used in natural language processing (NLP) and machine learning applications that involve working with text data, such as text classification, sentiment analysis, and information retrieval.

Text splitting and chunking are essential in the development and deployment of RAG applications as they enable large text documents to be represented in vector stores.

There are several methods for splitting and chunking text, including:

Tokenization: This is the process of dividing a stream of text into individual words or tokens. Tokenization can be performed at the word, character, or sentence level, depending on the specific application. For example, in word-level tokenization, a sentence such as "I love to play soccer" would be divided into the following tokens: ["I", "love", "to", "play", "soccer"].
Sentence splitting: This involves dividing a text document into individual sentences. Sentences can be identified using punctuation marks, such as periods and question marks, or by using more sophisticated techniques such as part-of-speech tagging and parsing.
Chunking: This is the process of dividing a text document into chunks or segments of a fixed size. For example, a document might be divided into chunks of 100 words each. Chunking can be useful for applications that require processing text in batches, such as when training a machine learning model on large datasets.
Sliding window: This is a technique for chunking text where a fixed-size window is slid over the text, and the contents of the window are used as the input for processing. For example, a window of size 5 might be slid over the text "I love to play soccer", resulting in the following chunks: ["I", "love", "to", "play", "soccer"], ["love", "to", "play", "soccer", "I"], and so on.
Document embedding: This is a technique for representing a document as a dense vector in a high-dimensional space, often using techniques such as word2vec or doc2vec. Document embedding can be useful for applications that require comparing or clustering documents based on their content.

When it comes to vector stores and embeddings, text splitting and chunking techniques are important for several reasons:

Efficiency: Splitting and chunking text can help reduce the computational resources required for processing large text datasets. By dividing text into smaller chunks, it is possible to process each chunk independently, allowing for parallelization and distributed computing.
Memory usage: Large text documents can consume significant amounts of memory when loaded into memory all at once. Splitting and chunking text can help reduce memory usage by loading only the necessary chunks into memory at any given time.
Quality of embeddings: The quality of document embeddings can be improved by using text splitting and chunking techniques. By dividing text into smaller chunks, it is possible to capture more fine-grained patterns in the text, leading to more accurate and meaningful embeddings.
Handling long documents: Some NLP tasks, such as document classification or summarization, require working with very long documents. Splitting and chunking these documents can help make them more manageable and allow for more efficient processing.

Text splitting and chunking techniques are important tools for working with text data in natural language processing and machine learning applications. By dividing text into smaller chunks, it is possible to process and analyze text more efficiently, while also improving the quality of embeddings and handling long documents more effectively.

RAG: Key Takeaways

Retrieval-augmented generation applications represent a low risk opportunity for businesses to demonstrate progress in their adoption of LLM technology. Internally deployed RAG applications enable a business to learn how to develop and deploy LLM technology whilst maintaining data security. There are potential quick wins to improve productivity by making internal data available to employees via a medium that is intuitive and which has minimal barriers to adoption.

Successful internal deployments then set a business up for success when deploying RAG applications to their customers.

Deploying RAG solutions offers business the best way to both demonstrate progress and to minimise the risks associated with a revolutionary new technology.

Take Your Learning to the Next Level! See More Content Like This On The New Version Of Perxeive.

Get Early Access And Exclusive Updates: Join the Waitlist Now!