Retrieval-Augmented Text Generation

Retrieval-Augmented Text Generation#

The moment that we’ve all been waiting for has finally arrived! The Retrieval-Augmented Text Generation (RAG) Framework is here! 🎉

Throughout this notebook we will be exploring RAG, what it is, how it works, and why it’s so exciting.

RAG Framework#

RAG proposes a solution to this issue by supplementing the prompt sent to the LLM with information from external sources through a retrieval model via vector embeddings (more on this later), thereby providing the LLM with more relevant input to generation responses. It allows you to use pre-trained LLMs without fine-tuning them or training your own LLM on your training data.

RAG Workflow

Image Source: Medium Blog

Multiple concepts influence RAG pipeline:

Retrieval
Augmentation
Generation

Retrieval#

The retrieval phase can also be considered the data and query/prompt preparation phase, focusing on efficient information retrieval or data access. To improve your RAG pipeline, the pre-retrieval phase contains tasks such as: (1): Indexing, (2) Query Manipulation, (3) Data Modification, (4) Search, and (5) Ranking. In this tutorial, we primarily focus on indexing and search.

Indexing enables fast and accurate information retrieval that sets up the context for any LLM to improve its response to a given user prompt or query.

We will be indexing Professor Jeff Erickson’s Algorithms textbook (previously used in module 1).

Embeddings#

Embeddings, also called “Vector Embedding,” help LLMs develop a semantic understanding of the textual data they are trained on. In simpler terms, these embedding models lay the groundwork for LLMs to perform tasks like sentence completion, similarity search, questions and answers, etc.

Embedding vs Fine-tuning#

	Embedding	Fine-tuning
Definition	Use pre-trained LLM as feature extractor	Update parameters of pre-trained LLM during task-specific training
Process	Input Encoding > tokenized > Embedding Extraction > Downstream Task	Initialization > Task-specific Training > Fine-tuning Layers (optional)
Advantages	Efficient use of pre-trained knowledge, Faster inference	Adaptability to task-specific nuances, May require less labeled data than from scratch
Considerations	N/A	Risk of overfitting, Computational cost can be high
When to use	Limited computational resources, Limited labeled data	Significant computational resources, Large corpus of labeled data
Performance	Performs well, especially with limited data	Can achieve state-of-the-art results on a wide range of tasks

In a nutshell#

Embeddings models are typically small in size and less computationally intensive
Regular updates of embedding vectors are faster, cheaper, and simpler compared to fine-tuning a model.

Vector#

At the lowest level, machines only understand numeric values. For LLMs to work, natural language is converted into an array of numeric values before they are fed into the models. These arrays of numeric values are called “Vector.”

An example of a vector: [2.5, 1.0, 3.3, 7.8]

The above is an example of a vector of size 4.

import numpy as np

vector = np.array([2.5, 1.7, 3.3, 7.8])
print(f"Vector: {vector}")

Tokens#

We stated above that “texts are converted into an array of numeric values called vectors”.

But depending on your use case, each word, sentence, paragraph, or entire document can be represented as a vector.

Tokens are the smallest natural language units converted into a vector. It could be at the character level, sub-word level, word level, sentence level, paragraph level, or document level.

Example: Consider the text below.

Earth is a planet of the solar system. There are 9 planets in the solar system. All planets revolve around the sun. Sun is a star.

Case 1.) Tokenizing the entire paragraph into vector.
Tokenization: The entire paragraph is a single token.
Vectorization: A single vector.
Sample Vector Representation: [3.1, 6.8, 5.4, 8.0, 7.1]

Case 2.) Tokenizing each sentence into vectors.
Tokenization: One token for each sentence (total 4 tokens)
Vectorization: One vector for each sentence (total 4 vectors).
Sample Vector Representation: [[1.2, 2.3, 3.8, 7.9, 0.8], [2.5, 3.0, 8.2, 6.6, 4.1], [3.2, 6.5, 8.1, 9.3, 1.4], [1.1, 0.7, 7.2, 3.5, 8.5]]

Case 3.) Tokenizing each word in the paragraph into a vector. There are 26 words in the paragraph, ignoring punctuation. Each word gets converted into a vector.
Tokenization: One token for each word in the paragraph (26 tokens)
Vectorization: One vector for each token (total 26 vectors).
Sample Vector Representation: [[2.1, 3.2, 4.1, 9.8, 7.0], [8.2, 4.2, 7.1, 3.8, 2.0]…..total 26 such representations]

Tokenizers#

Tokenizers are components responsible for converting large texts into tokens (tokenization). Different types of pre-trained tokenizers are available. You can even train your own tokenizers. But for the scope of this tutorial, we will use a pre-trained one.

Generally, each tokenizer follows the following steps:

Break down the original text into tokens. These tokens could again be at the character, sub-word, word, sentence, paragraph, or document levels.
Assign a unique identifier to each of the tokens created.

# For example, here is how you can split a short sentence into chunks of text
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=10,
    chunk_overlap=0,
)
text_splitter.split_text(text="Earth is a planet in the solar system.")

Learn more about how to split text into tokens in LangChain here.

Embedding Models#

A language model needs to understand how tokens are related to each other in the context of human language. To understand this semantic relationship, these tokens are converted into numerical vectors.

Embedding Models are trained upon these tokens to develop an “embedding space.”

Before the training, the embedding model initializes an N-dimensional ‘vector’ corresponding to each ‘token’ with random values. (Value of N depends on the embedding model)
During the embedding model training, the values for these vectors are updated across iterations. In this process, similar or related tokens are updated to have similarly valued vectors.
After the training, the collection of all the ‘vectors’ corresponding to all the tokens is called the “embedding space.”
“Embedding Space” is an encoded representation of meanings of tokens and inter-token relationships.

See Word Embeddings Resource for more conceptual details on embeddings.

To understand this further, let’s take a look at how it all works using a pre-trained embedding model.

For the tutorial and simplicity, we are using the Langchain Hugging Face integrations, which is available in the langchain-huggingface package. To use an embedding model available in Hugging Face, we will simply use the HuggingFaceEmbedding class.

from langchain_huggingface import HuggingFaceEmbeddings

We are using the all-MiniLM-L12-v2 sentence-transformers embedding model for this tutorial. After some evaluation that we did, we found that this model works well for our use case as it is lightweight and provides good performance.

This model “maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search”.

However, you can use any other embedding model available in Hugging Face, and we recommend going to MTEB Leaderboard to find embedding models and see how they compare to each other.

# Setup the embedding, we are using the MiniLM model here
embeddings_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L12-v2"
)

query_result = embeddings_model.embed_query("Earth is a planet in the solar system.")

# Dimension of vector
len(query_result)

query_result[-3:]

In an embedding space, you can find how similar two vectors are using dot product or using cosine similarity.

from scipy import spatial

print(
    "Similarity:",
    1
    - spatial.distance.cosine(
        query_result,
        embeddings_model.embed_query("Mars is a planet in the solar system."),
    ),
)

print(
    "Similarity:",
    1
    - spatial.distance.cosine(
        query_result, embeddings_model.embed_query("Hello Tacoma.")
    ),
)

What we have demonstrated above in finding similarity between vectors is essentially what’s happening in the retrieval phase of the RAG pipeline within a Vector Database.

Algorithms Textbook#

We will now bring in the textbook in question, as we alluded to above. First importing it and then converting into a langchain object.

# Write your code here for your retrieval step,
# see the documentation on PyMuPDF for more information:
# https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/#using-pymupdf

# Uncomment below for code to download the textbook
import os
from urllib.request import urlretrieve

url = "http://jeffe.cs.illinois.edu/teaching/algorithms/book/Algorithms-JeffE.pdf"
filename = os.path.basename(url)

if not os.path.exists(filename):
    # Download if file doesn't exist
    pdf_path, headers = urlretrieve(url, filename)

import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_folder_path = "."  # update path to point to the relevant directory
documents = []
for file in os.listdir(pdf_folder_path):
    if file.endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder_path, file)
        loader = PyMuPDFLoader(pdf_path)
        documents.extend(loader.load())

for each in documents:
    # print(each.page_content) # Uncomment this line to see the individual page_content
    print(each.metadata)

Vector Stores#

Once the embeddings are created for our relevant documents or knowledge base, we need to store these embeddings in the database for fast retrieval.

The type of databases that store these vector embeddings are called “Vector Stores.” We will use a vector store called “Qdrant,” as shown below.

In the below code,

Vector store works along with the embedding model to create vector embeddings.
Vector embeddings are stored in the Qdrant Vector database collection.

We will now create QDrant Collection/Database using the above Algorithms Textbook.

We start by defining the Qdrant path and collection name information. We can now use the Langchain Qdrant integrations package called langchain-qdrant to interact with the Qdrant database by using the Qdrant class.

from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from ssec_tutorials import TUTORIAL_CACHE
import shutil

qdrant_collection = "algorithms_book"
qdrant_path = TUTORIAL_CACHE / "algorithms_book"

client = QdrantClient(path=str(qdrant_path))

if qdrant_path.exists():
    print("Removing cached data")
    shutil.rmtree(qdrant_path)

print(
    f"Creating new Qdrant collection '{qdrant_collection}' from {len(documents)} documents"
)

# Load the documents into a Qdrant Vector Database Collection
# this will save locally in the qdrant_path as sqlite
qdrant = Qdrant.from_documents(
    documents=documents,
    embedding=embeddings_model,
    path=str(qdrant_path),
    collection_name=qdrant_collection,
)

Search#

Now that we have the Qdrant database instance, we are ready to search for the relevant documents based on the user query. However, before we can simply search, we will need a VectorStoreRetriever object.

To get the VectorStoreRetriever object, we can simply call the .as_retriever() method on the Qdrant object.

In this example, we will be setting the search_type to "mmr" and search_kwargs to {"k": 2}.

“mmr” stands for Maximum Marginal Relevance

MMR selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.

The k parameter in search_kwargs specifies the number of chunks to retrieve.

# Setup the retriever for later step
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

Let’s invoke this retriever object with some of the questions from previous section and see what we get.

documents = retriever.invoke("What is the best method for multiplying large numbers?")

We got the relevant documents from the Qdrant database for the given questions. Let’s see what these documents look like.

document = documents[0]

type(document)

dict(document)

We see that this is a core Langchain Document object that contains the document’s metadata and content.

Later we will see how we can use this document to generate the response, for now let’s create a utility formatting function to retrieve just the content of the document so that we can put this as part of our prompt template input, also known as “Augmentation”.

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

print(format_docs(documents))

Augmentation & Generation#

Now that we can retrieve the most relevant document based on a question, we can use the retrieved document and send it along with the prompt to increase the context for the LLM.

This can also be referred to as the retrieval-augmented prompt.

from langchain_community.llms import LlamaCpp
from langchain_core.prompts import PromptTemplate
from ssec_tutorials import download_olmo_model

OLMO_MODEL = download_olmo_model()

olmo = LlamaCpp(
    model_path=str(OLMO_MODEL),
    temperature=0.8,
    verbose=False,
    n_ctx=2048,
    max_tokens=512,
)

# Create a prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata["tokenizer.chat_template"],
    template_format="jinja2",
    partial_variables={"add_generation_prompt": True, "eos_token": "<|endoftext|>"},
)

# Test the prompt you want to send to OLMo.
question = "What is the best method for multiplying large numbers?"
context = format_docs(retriever.invoke(question))

final_prompt_content = prompt_template.format(
    messages=[
        {
            "role": "user",
            "content": f"""\
                You are an algorithms expert. Please answer the question on algorithms based on the following context:

                Context: {context}

                Question: {question}
            """,
        }
    ]
)

print(final_prompt_content)

You can see above that we now have a context input within the prompt. This context is the content of the document(s) that we retrieved from the Qdrant database. With this context, the LLM can generate more relevant responses. So let’s see how it does!

from langchain_core.callbacks import StreamingStdOutCallbackHandler

OLMo without context#

olmo.invoke(question, config={"callbacks": [StreamingStdOutCallbackHandler()]})

OLMo with context#

olmo.invoke(
    final_prompt_content, config={"callbacks": [StreamingStdOutCallbackHandler()]}
)

From the responses above, we can see that the response with context is more relevant and informative compared to the response without context, an this shows the power of the RAG framework, with just a few documents.

One way to generate the response with OLMo is to build context using the question beforehand, as shown above, create an llm_chain then invoke it with messages.

However, we can further use LangChain’s convenience functions to streamline our pipeline using create_stuff_documents_chain and create_retrieval_chain from the main langchain package.

The main langchain package contains chains, agents, and retrieval strategies that make up an application’s cognitive architecture

create_stuff_documents_chain specifies how retrieved context is fed into a prompt and LLM.

On looking its signature, notice that it accepts prompt argument of type BasePromptTemplate but it needs input keys as context and input.

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

To use the helper functions, we’ll need to setup our template string to use the context and input keys as variables.

# Create a new prompt_template
# so that it accepts `context` and `input` as input_variables
input_string_template = """\
You are an algorithms expert. Please answer the question on algorithms based on the following context.
Context: {context}
Question: {input}
"""
transformed_prompt_template = PromptTemplate.from_template(
    prompt_template.partial(
        messages=[{"role": "user", "content": input_string_template}]
    ).format()
)
transformed_prompt_template

document_chain = create_stuff_documents_chain(
    llm=olmo, prompt=transformed_prompt_template
)

We can run this by passing in the context directly:

question = "Which data structures have the most efficient lookup time?"
document_chain.invoke(
    {
        "input": question,
        "context": retriever.invoke(question),
    },
    config={"callbacks": [StreamingStdOutCallbackHandler()]},
)

However, we want the context to be dynamically generated using the passed input or question.

From LangChain’s documentation: create_retrieval_chain adds the retrieval step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key input, and includes input, context, and answer in its output.

retrieval_chain = create_retrieval_chain(retriever, document_chain)

response = retrieval_chain.invoke(
    {"input": "Which data structures have the most efficient lookup time"},
    config={"callbacks": [StreamingStdOutCallbackHandler()]},
)

response

One of the nice things about the LangChain helper function is that the result is a dictionary containing the input, context, and answer keys, so you can easily see what you asked and the context that was used to generate the answer.

This way of creating the RAG pipeline is quick, but not as customizable. If you need more control over the input variables, we’ll need to create our own chain.

In the next module, we’ll explore how to do this to create a simple Panel application that uses the RAG pipeline to generate responses to user questions.

For now let’s clean up the qdrant client by closing it before the next module, otherwise we’ll run into errors!

qdrant.client.close()