OLMO RAG DEMO

OLMO RAG DEMO#

Now it’s your turn to apply your data and specific domain knowledge.

You can use this notebook as a starting point and adapt it to your needs. You will need to develop the pre-processing stage for a RAG system. This includes document retrieval, cleaning, chunking, and ingestion into the vector database using an embedding model.

To help you, we’ve provided a few example code snippets in Jupyter notebooks found in the appendix.

from testcontainers.qdrant import QdrantContainer

qdrant = QdrantContainer()

qdrant.start()

client = qdrant.get_client()

Utility Functions#

A section for whatever utility functions you need. We have packaged up our utility functions in a Python package called ssec_tutorials. You can find the source code in this GitHub repository.

# Write your code here for whatever utility functions you need. This can be anything such as
# cleaning up document format, setting up prompt templates, etc.


# Uncomment the following for a simple document formatting function
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Retrieve documents#

A section for document retrieval. This just means getting your document from whatever sources, in your local computer or the internet. See the Document Loaders integration list from Langchain for an extensive list of what’s possible.

For the purpose of this tutorial, we recommend a simple example of loading a piece of text from a file such as PDF. Also, if you have a large piece of text, you can split it into smaller chunks using Langchains’s RecursiveTextSplitter.

If you don’t have any data with you, you can try out with this Algorithm Textbook by Jeff Erickson. This textbook has been generously made available by Jeff Erickson under the Creative Commons Attribution 4.0 International license, you can find more information about the textbook at https://jeffe.cs.illinois.edu/teaching/algorithms/.

Note

If you’re running things on Codespace, refer to this link and upload your data to resources/ folder.

# Write your code here for your retrieval step,
# see the documentation on PyMuPDF for more information:
# https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/#using-pymupdf

# Uncomment below for code to download the textbook
import os
from urllib.request import urlretrieve

url = "http://jeffe.cs.illinois.edu/teaching/algorithms/book/Algorithms-JeffE.pdf"
filename = os.path.basename(url)

if not os.path.exists(filename):
    # Download if file doesn't exist
    pdf_path, headers = urlretrieve(url, filename)

import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_folder_path = "."  # update path to point to the relevant directory
documents = []
for file in os.listdir(pdf_folder_path):
    if file.endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder_path, file)
        loader = PyMuPDFLoader(pdf_path)
        documents.extend(loader.load())

for each in documents:
    #     # print(each.page_content) # Uncomment this line to see the individual page_content
    print(each.metadata)

# Write your code here to load the PDF document as a Langchain Document objects

Document Embeddings to Qdrant Vector Database#

Once you’ve figured out how to retrieve and load your documents to Langchain Document objects, you can then proceed to loading these documents to Qdrant Vector Database collection.

See the following documentation for some guidance on Langchain Qdrant integration.

from langchain_huggingface import HuggingFaceEmbeddings

# Setup the embedding, we are using the MiniLM model here
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

Setup Vector DB#

from qdrant_client import models
from langchain_qdrant import Qdrant

# Write your code here to load your data into the database

# uncomment below to set the Qdrant path and collection name
# for an "local mode" on-disk storage
# See https://python.langchain.com/v0.2/docs/integrations/vectorstores/qdrant/#on-disk-storage
# qdrant_path = "./my_qdrant_database"
qdrant_collection = "algorithms_book"

if not client.collection_exists(qdrant_collection):
    print("Creating collection:", qdrant_collection)
    client.create_collection(
        qdrant_collection,
        vectors_config=models.VectorParams(
            size=embedding.client.get_sentence_embedding_dimension(),
            distance=models.Distance.COSINE,
        ),
    )
    lcqdrant = Qdrant(
        client=client, collection_name=qdrant_collection, embeddings=embedding
    )
    uuids = lcqdrant.add_documents(documents=documents)
else:
    lcqdrant = Qdrant(
        client=client, collection_name=qdrant_collection, embeddings=embedding
    )

Test out the Qdrant collection#

At this step, you should have a Qdrant object (langchain_qdrant.vectorstores.Qdrant) that has your document loaded into it in a collection. You can test out the collection by querying for a documents and checking if the results are as expected.

To do this, you’ll need to create a VectorStoreRetriever.

Note

A sample question example to ask from the document can be "What is the most familiar method for multiplying large numbers?". An answer to this question can be found on page 3, section 0.2 Multiplication, Lattice Multiplication.

Tip

You’ll probably need to tweak the arguments for creating a VectorStoreRetriever object for the best search type and limiting the number of documents. This part is a bit of trial and error, so don’t be afraid to experiment. It is a critical part of RAG system to get the right documents for the question as that is what the LLM would use to generate the answer.

# Write your code here to try out the vector database retrieval with a question query
retriever = lcqdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

retriever.invoke("What is the most familiar method for multiplying large numbers?")

Setup OLMo Model#

At this stage now we have the Retrieval-Augmented (RA) in RAG system. Let’s now setup the Generation (G) part with the OLMo model.

from ssec_tutorials import download_olmo_model

# This will download the OLMO model to the cache directory
OLMO_MODEL = download_olmo_model()

# Uncomment this line to understand your available options for LlamaCpp Class
# LlamaCpp?

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler

# Here we've setup the LlamaCpp model,
# but you'll need to add additional arguments to `LlamaCpp`
# to make it work for your specific use case
olmo = LlamaCpp(
    model_path=str(OLMO_MODEL),
    callbacks=[StreamingStdOutCallbackHandler()],
    verbose=False,
    n_ctx=2048,
)

Tip

Try asking some questions to OLMo about any content of the document you’ve loaded in the Qdrant collection. You will find that the OLMo model is not trained on your specific domain, so it might not give you the best results.

_ = olmo.invoke(input="What is the most familiar method for multiplying large numbers?")

Prompt Engineering#

Rather than a just a simple question, we’ll need to refine the prompt to include instruction and context for the model to generate the answer. To do this, we’ll need to setup the proper string PromptTemplate.

from langchain_core.prompts import PromptTemplate

# Create the initial prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata["tokenizer.chat_template"],
    template_format="jinja2",
    partial_variables={"add_generation_prompt": True, "eos_token": "<|endoftext|>"},
)

Set the question for the prompt

question = "What is the most familiar method for multiplying large numbers?"

Set the context for the prompt. This is where you’ll need to use the VectorStoreRetriever and format the document object with format_docs or simply add your own text to the variable.

# Uncomment variable below to set the context
# context = "Enter code or string here"

Set the instruction for the prompt.

instruction = """You are a computer science professor.
Please answer the following question based on the given context."""

The original OLMo chat template takes in multiple messages with a role and content key. You can use this template to ask questions to the model. For simplicity, we’ll just use a single message.

# Uncomment below to set the input text template
# input_text_template = f"""\
# {instruction}

# Context: {context}

# Question: {question}
# """

# Uncomment below to set the message dictionary
# message = {
#     "role": "user",
#     "content": input_text_template,
# }

# Uncomment below to try out the prompt template
# print(prompt_template.format(
#     messages=[message]
# ))

You can see above what the final prompt looks like. There are tags like <|user|> that signify the model that this is a user input and so on. This final string is sent to the model for generating the answer.

RAG#

At this point you have all the parts for RAG system setup. Now let’s chain the prompt engineering, OLMo model and the Qdrant collection to get a more accurate answer.

# 1. Set the question
question = "What is the most familiar method for multiplying large numbers?"

# 2. Set the context
context = format_docs(retriever.invoke(question))

# 3. Set the instruction
instruction = """You are a computer science professor.
Please answer the following question based on the given context."""

# 4. Set the input text template
input_text_template = f"""\
{instruction}

Context: {context}

Question: {question}
"""

# 5. Set the message dictionary
message = {
    "role": "user",
    "content": input_text_template,
}

# 6. Chain the prompt template and olmo model
llm_chain = prompt_template | olmo

# 7. Invoke the chain
llm_chain.invoke(input={"messages": [message]})

Answer Example Code

# 1. Set the question
question = "What is the most familiar method for multiplying large numbers?"

# 2. Set the context
context = format_docs(retriever.invoke(question))

# 3. Set the instruction
instruction = """You are a computer science professor.
Please answer the following question based on the given context."""

# 4. Set the input text template
input_text_template = f"""\
{instruction}

Context: {context}

Question: {question}
"""

# 5. Set the message dictionary
message = {
    "role": "user",
    "content": input_text_template,
}

# 6. Chain the prompt template and olmo model
llm_chain = prompt_template | olmo

# 7. Invoke the chain
llm_chain.invoke(input={"messages": [message]})

import panel as pn

pn.extension()

from langchain_core.callbacks import CallbackManager, BaseCallbackHandler
from langchain_core.runnables import RunnablePassthrough
from uuid import uuid4
import textwrap

def get_chain(callback_handlers: list[BaseCallbackHandler], input_prompt_template: str):
    # 1. Set up the vector database retriever.
    # This line of code will create a retriever object that
    # will be used to retrieve documents from the vector database.
    retriever = lcqdrant.as_retriever(
        callbacks=callback_handlers,  # pass the result of the retrieval to the callback handler
        search_type="mmr",  # the mmr (maximal marginal relevance, a typical information retrieval tactic) search
        search_kwargs={"k": 2},  # return top 2 results
    )

    # 2. Setup the Langchain callback manager to handle callbacks from Langchain LLM object.
    # At which results are passed to the callback handler.
    callback_manager = CallbackManager(callback_handlers)

    # 3. Setup the Langchain llama.cpp model object.
    # In our case, we are using the `OLMo-7B-Instruct` model.
    # llama-cpp-python is a Python binding for llama.cpp C++ library as mentioned in previous modules.
    olmo = LlamaCpp(
        model_path=str(OLMO_MODEL),  # the path to the OLMo model in GGUF file format
        callback_manager=callback_manager,  # set the callback manager to handle callbacks
        temperature=0.8,  # set the randomness of the model's output
        n_ctx=4096,  # set limit for the length of the input context
        max_tokens=512,  # set limit for the length of the generated text
        verbose=False,  # determines whether the model should print out debug information
        echo=False,  # determines whether the input prompt should be included in the output
    )

    # 4. Set up the initial Langchain Prompt Template using text based jinja2 format
    prompt_template = PromptTemplate.from_template(
        template=olmo.client.metadata[
            "tokenizer.chat_template"
        ],  # get the chat template from the model metadata
        template_format="jinja2",  # set the template format to jinja2
        partial_variables={
            "add_generation_prompt": True,  # add generation prompt to the template, this option is from the model metadata
            "eos_token": "<|endoftext|>",  # set the end of sentence token
        },
    )

    # 5. Transform the Prompt Template to include the user role and the context
    # This will allow the model to generate text based on the context provided.
    # However, after setting this new template, the model will be limited to
    # generating text based on the created prompt template with input of
    # `context` and `question` keys.
    transformed_prompt_template = PromptTemplate.from_template(
        prompt_template.partial(
            # The default chat template takes a list of messages with a role and content
            # to setup this particular app, we will only pass a single message with the user role
            # and the input prompt content
            messages=[
                {
                    "role": "user",  # set the role to user, this allows for user input to be passed to the model
                    "content": input_prompt_template,  # the input prompt template, must have `context` and `question` keys to work
                }
            ]
        ).format()
    )

    # 6. Define the `format_docs` function to format the retrieved Langchain documents object to simple string
    def format_docs(docs):
        text = "\n\n".join([d.page_content for d in docs])
        return text

    # 7. Define the `show_docs` function to display the retrieved documents to app panel
    # this is currently a small hack to display the retrieved documents to the app panel
    # as mentioned in https://github.com/langchain-ai/langchain/issues/7290
    def show_docs(docs):
        for callback_handler in callback_handlers:
            callback_handler.on_retriever_end(
                docs,  # pass the retrieved documents to the callback handler
                run_id=uuid4(),  # generate a random run id
            )
        return docs

    # 8. Return the Langchain chain object
    # The way the chain reads is as follows:
    return (
        {
            # The Vector Database retriever documents,
            # which is then passed to the `show_docs` function,
            # which is then passed to the `format_docs` function for formatting
            "context": retriever | show_docs | format_docs,
            # The Question asked by the user from the Chat Text Input Interface is passed in as well
            "question": RunnablePassthrough(),
        }
        # The dictionary above that contains text values for `context` and `question` is now passed
        # to the transformed prompt template so that the final prompt text can be generated
        | transformed_prompt_template
        # The full final prompt text with both context and question is passed to the OLMo model
        # for generation of the final output. Note that this final prompt text cannot exceed the maximum
        # `n_ctx` input context value set in the OLMo model above.
        | olmo
    )

input_prompt_template = textwrap.dedent(
    """\
You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

{context}

Question: {question}
"""
)

Now we will use that chain workflow to create a simple chat application using Panel. In the diagram above, this would be our “Enterprise App”, but obviously much simpler and not ready for production at this stage.

To begin, we will setup the asynchronous callback function for the pn.chat.ChatInterface layout component. This will allow us to interact with the chat interface and ask questions.

The ChatInterface is a high-level layout, it provides front-end interface for inputting different kinds of messages: text, images, PDFs, etc.

This layout provides front-end methods to:

Input (append) messages to the chat log.
Re-run (resend) the most recent user input ChatMessage.
Remove messages until the previous user input ChatMessage.
Clear the chat log, erasing all ChatMessage objects.

async def callback(contents, user, instance):
    # 1. Create a panel callback handler
    # The Langchain PanelCallbackHandler is useful for rendering and streaming the chain of thought
    # from Langchain objects like Tools, Agents, and Chains.
    # It inherits from Langchain’s BaseCallbackHandler.
    # Here we set the user to be the model name "OLMo" with an avatar of a tree emoji "🌳"
    # for the tree of knowledge.
    callback_handler = pn.chat.langchain.PanelCallbackHandler(
        instance, user="OLMo", avatar="🌳"
    )

    # 2. Set to not return the full generated result at the end of the generation;
    # this prevents the model from repeating the result in the interface
    callback_handler.on_llm_end = lambda response, *args, **kwargs: None

    # 3. Create and setup the Langchain chain object with the callback handler and input prompt template
    chain = get_chain(
        callback_handlers=[callback_handler],
        input_prompt_template=input_prompt_template,
    )

    # 4. Run the chain with the input contents
    _ = await chain.ainvoke(contents)

Once we have a callback function, now we’re ready to pass that to the ChatInterface layout component.

The code below takes in the asynchronous callback function from above and serves the chat interface to the user. This callback function will run every time the user sends a message in the chat interface. The callback function callback will receive the input text as part of the contents. The contents will be passed to the Langchain chain where:

the retriever will fetch document based on the input text generate prompt with instruction, document results, and question input text generate answer based on the prompt return the generated answer and retrieved document text to the user

pn.chat.ChatInterface(callback=callback).servable()

If you’re running this notebook on JupyterLab, there should be a Panel logo in the menu bar of your notebook. You can clear output and restart the kernel, then enable the Preview for this app by clicking on Panel’s logo in the menu bar of your notebook. Once clicked, you should see a new tab being opened next to your notebook tab, and after some moment your app will be rendered in this tab.

qdrant.stop()

Bonus: Try to create a simple chat app, by modifying the 1-olmo-chat-rag.ipynb notebook with your use case.

Please fill out the survey feedback form to help us improve the tutorial.