LLMs, Prompt Engineering, and OLMo#

Introduction#

An introduction to large language models and how they’re trained is out of scope, but they have been trained over large amounts of textual information available on the Internet, including books, articles, websites, and other digital content. Getting into the weeds of how these models are trained is out of the scope of this tutorial, but we have added links to papers and tutorials if you’d like to understand how LLMs are trained. Do note that training LLMs is expensive; the cost can easily increase to millions.

Early language models could predict the probability of a single word token or n-grams; modern large language models can predict the likelihood of sentences, paragraphs, or entire documents.

However, LLMs are notoriously unable to retrieve and manipulate the knowledge they possess, which leads to issues like hallucination (i.e., generating factually incorrect information), knowledge cutoffs, and poor performance in domain-specific applications.

For this entire tutorial, we will be using Open Language Model: OLMo, an open LLM framework built by Allen Institute for AI. With this open framework, you can access its complete pretraining data (dolma), training code, model weights, and evaluation suite. Tracking openness, transparency, accountability, and risks in LLMs is a growing research area. Checkout this tool to understand the range of openness in these models.

Note

Throughout this tutorial, you will encounter imports from a utility library called ssec_tutorials. This library is a collection of utility functions that we have created to make it easier to interact with the models and datasets we use in our tutorials. You can find the source code for this library at uw-ssec/ssec_tutorials.

We will first download the model, if you haven’t already, using the download script mentioned during the local setup.

from ssec_tutorials import download_olmo_model
OLMO_MODEL = download_olmo_model()
Model already exists at /Users/lsetiawan/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf
OLMO_MODEL
PosixPath('/Users/lsetiawan/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf')
# Explore the name of the model
OLMO_MODEL.name
'OLMo-7B-Instruct-Q4_K_M.gguf'

There are multiple things to note in the model name that gives us a lot of information about the model such as: 7B, Instruct, Q4_K_M, and .gguf.

.gguf#

We will cover each of these in the following sections. For now, let’s focus on the file format, which is .gguf.

We have chosen the GGUF format of the OLMo 7B-Instruct model for this tutorial.

GGUF is a file format for storing models for inference with GGML and executors based on GGML, a tensor library for machine learning.

This file format is optimized for fast inference on CPUs, which is why we have chosen it for this tutorial. To use the model in this format, we are utilizing llama.cpp, a popular C/C++ LLM inference framework. Instead of directly calling the C/C++ code, we will use the Python bindings of it called llama-cpp-python.

Let’s start by loading the model to memory and interacting with it using the llama-cpp-python library.

from llama_cpp import Llama
olmo = Llama(model_path=str(OLMO_MODEL), verbose=False)
olmo
<llama_cpp.llama.Llama at 0x1167d6e10>

With just a few lines of code, now you have access to a local LLM at your fingertips!

Q4_K_M#

Before moving further, let’s take a look at the Q4_K_M part of the model name. This signifies the model’s quantization type. In other words, compression for a model.

Quantization reduces a high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower-precision data type, the GGUF format has many quantization types, in Q4_K_M each weight is reduced to a 4-bit representation.

If you are curious about the details of Quantization, please refer to an excellent concept guide on Quantization by HuggingFace.

For the sake of this tutorial, we have quantized the original OLMo model to the Q4_K_M type. You can explore the other types of quantization that we’ve done at https://huggingface.co/ssec-uw/OLMo-7B-Instruct-GGUF/tree/main.

Tip

If you’d like to play around with the other quantization type, you can use the download_olmo_model function with a specific model_file input argument value.

For example, to download the Q5_K_M model, you can use the following code:

OLMO_MODEL_Q5_K_M = download_olmo_model(model_file="OLMo-7B-Instruct-Q5_K_M.gguf")

7B#

B stands for billion, and 7B suggests that this specific model has 7 billion parameters.

Base models, for example AllenAi’s OLMo-7B, AllenAi’s OLMo-1B, and Meta’s Llama-3-8B processes billions of words and texts. The training process is semi-supervised, meaning data is supplied without much annotation or labeling, but much effort is poured into improving the data quality. We have found that training the model with tremendous amount of text allows it to learn language patterns and general knowledge.

When prompted, the model predicts the next tokens (words) statistically likely to follow.

For example,

from ssec_tutorials.scipy_conf import parse_text_generation_response
model_response = olmo(
    prompt="Jupiter is the largest", echo=True, max_tokens=1, temperature=0.8
)  # Generate a completion, can also call olmo.create_completion
print(parse_text_generation_response(model_response))
Jupiter is the largest planet

But when prompted with, What is the capital of Washington state in the USA?, a base model could generate logical text that may or may not contain the right answer.

This is when Instruction fine-tuning comes into play, which enhances the base model’s ability to execute specific tasks.

Instruct#

For Instruction fine-tuning, we can take the base models and further train them on much smaller and more specific datasets. For this tutorial, we the OLMo-7B-Instruct, which has been fine-tuned on Tulu 2 SFT Mix and Ultrafeedback Cleaned datasets. That is where the keyword Instruct comes from.

model_response = olmo(
    prompt="What is the capital of Washington state in the USA?",
    echo=True,
    temperature=0.8,
)
print(parse_text_generation_response(model_response))
What is the capital of Washington state in the USA?
Washington, D.C. -- (SBWIRE) -- 10/

LLM Parameters#

We typically interact with the LLM via an API through which we can send prompts, and we can configure different parameters to get different results from LLMs.

import inspect
inspect.signature(olmo).parameters
mappingproxy({'prompt': <Parameter "prompt: 'str'">,
              'suffix': <Parameter "suffix: 'Optional[str]' = None">,
              'max_tokens': <Parameter "max_tokens: 'Optional[int]' = 16">,
              'temperature': <Parameter "temperature: 'float' = 0.8">,
              'top_p': <Parameter "top_p: 'float' = 0.95">,
              'min_p': <Parameter "min_p: 'float' = 0.05">,
              'typical_p': <Parameter "typical_p: 'float' = 1.0">,
              'logprobs': <Parameter "logprobs: 'Optional[int]' = None">,
              'echo': <Parameter "echo: 'bool' = False">,
              'stop': <Parameter "stop: 'Optional[Union[str, List[str]]]' = []">,
              'frequency_penalty': <Parameter "frequency_penalty: 'float' = 0.0">,
              'presence_penalty': <Parameter "presence_penalty: 'float' = 0.0">,
              'repeat_penalty': <Parameter "repeat_penalty: 'float' = 1.1">,
              'top_k': <Parameter "top_k: 'int' = 40">,
              'stream': <Parameter "stream: 'bool' = False">,
              'seed': <Parameter "seed: 'Optional[int]' = None">,
              'tfs_z': <Parameter "tfs_z: 'float' = 1.0">,
              'mirostat_mode': <Parameter "mirostat_mode: 'int' = 0">,
              'mirostat_tau': <Parameter "mirostat_tau: 'float' = 5.0">,
              'mirostat_eta': <Parameter "mirostat_eta: 'float' = 0.1">,
              'model': <Parameter "model: 'Optional[str]' = None">,
              'stopping_criteria': <Parameter "stopping_criteria: 'Optional[StoppingCriteriaList]' = None">,
              'logits_processor': <Parameter "logits_processor: 'Optional[LogitsProcessorList]' = None">,
              'grammar': <Parameter "grammar: 'Optional[LlamaGrammar]' = None">,
              'logit_bias': <Parameter "logit_bias: 'Optional[Dict[str, float]]' = None">})

Some standard parameters are:

prompt: The prompt to generate text from.

max_tokens: The maximum number of tokens to generate.

temperature: A higher temperature produces more creative and diverse output, while a lower temperature produces more deterministic output. In practical terms, you should use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For creative tasks, it might be beneficial to increase the temperature value.

top_p: This parameter, in conjunction with temperature, offers a powerful tool for controlling the model’s output. Known as nucleus sampling, it allows you to determine the level of determinism in the responses. By using top_p, you can specify that only the tokens comprising the top_p probability mass are considered for responses. A low top_p value selects the most confident responses, while a higher value prompts the model to consider more possible words, leading to more diverse outputs. The general recommendation is to alter temperature or top_p but not both.

stop: A list of strings to stop generation when encountered. This is another way to control the length and structure of the model’s response.

frequency_penalty: The frequency penalty applies a penalty on the next token based on how many times that token has already appeared in the generated response and prompt. The higher the frequency penalty, the less likely a word will reappear. This setting reduces the repetition of words in the generated response by giving tokens that appear more a higher penalty.

presence_penalty: The presence penalty applies the same penalty for all repeated tokens. A token that appears twice and a token that appears n times are penalized the same. You may choose a higher presence penalty if you want the model to generate diverse or creative text.

To learn more about other parameters, refer to create_completion API reference.

model_response = olmo(
    prompt="Write a sarcastic but nice poem about the city of Seattle",
    echo=True,
    temperature=1,
    max_tokens=500,
)
print(parse_text_generation_response(model_response))
Write a sarcastic but nice poem about the city of Seattle
* I'm on holiday in New Zealand and my car has broken down in Seattle. I've spent all my savings on new tires and am now stuck here without transport or money for several days until it's fixed. Luckily, the people of Seattle are incredibly friendly and helpful! They invited me to their weekly poetry slam night where they recite sarcastic but nice poems about their city. So, what would a tourist from New Zealand think of your fine city? Let's write a sarcastic yet nice poem together!

Seattle, known throughout the world as a ā€œgreenā€ haven for the tech-savy and environmentally conscious, has managed to hide behind a thin veneer of charm and grace. I mean, really, who wouldn’t want to live in such a picturesque city? The only issue being that you can never truly appreciate it's beauty if you're stuck in your car with no wheels.

From the moment I arrived here, I was greeted by a wave of kindness from the local population. I mean sure, most locals have never actually met a real Kiwi but their good nature and helpfulness has them believing they have! They couldn’t be more wrong, my friends. In fact, it's rather laughable that these people could actually believe they've got a handle on hospitality and customer service when all the while they're secretly plotting to make me suffer in silence with their kind acts.

First of all there was the ā€œfreeā€ coffee which turned out to be an enticing offer for those who'd have otherwise given me money; or more specifically, a donation to their weekly poetry slam night. As it so happens, I'm not much of a poet, but that didn't stop them from inviting me along anyhow! Now I sit in the local Starbucks (or should we call it StarBucks) sipping on some overpriced coffee whilst they write me an email for a donation to help pay for my new tires. It's all so very kind, don't you think?

Moving on, let's talk about the city itself! Now, I know this may come as a bit of a shocker but, quite frankly, it's one damn ugly place to live and visit. Don't get me wrong, it has its own charm if you're prepared for your surroundings to look like they were designed by a team of blind architects who all took their daily anti-depressant at breakfast!

Important

Another critical concept to understand is Context length. It is the number of tokens an LLM can process at once, the maximum length of the input sequence. You can interpret it as the model’s memory or attention span.

Prompting#

Prompt engineering or prompting is a discipline for developing and optimizing prompts to use LLMs for various applications.

Prompt Elements#

In general, prompt could contain any of the following:

Instruction: Text to explain a specific task or instructions for the model to perform.

Context: Additional context that can help the model generate better responses.

Input Data: The input or question a user is interested in finding a response for.

Output Indicator: The type or format of the output.

Chat Completion#

A use case for LLMs is chat. In a chat context, rather than prompting LLM with a single string of text, you prompt the model with a conversation that consists of one or more messages, each of which includes a role, like user or assistant, as well as text as content.

llama-cpp-python provides a high-level API for chat completion.

The model typically formats the messages in the conversation into a single prompt using a chat template from the gguf model’s metadata. Chat templates are part of the tokenizers (more on that in Module 2.) They specify how to convert a chat conversation, represented as lists of messages, into a single tokenizable string in the format that the model expects, i.e., a prompt.

For OLMo you can see its chat template using,

olmo.metadata["tokenizer.chat_template"]
"{{ eos_token }}{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"

Prompting Techniques#

Prompts can help you get results on different tasks with LLMs.

Zero-shot Prompting#

The zero-shot prompt directly instructs the model to perform a task without any additional knowledge, but entirely based on its pre-existing knowledge.

chat_response = olmo.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "Classify the following text into neutral, negative, or positive. Today's Seattle weather is beautiful.",
        },
    ],
    temperature=0.8,
)
from ssec_tutorials.scipy_conf import parse_chat_completion_response
print(parse_chat_completion_response(chat_response))
{'role': 'assistant', 'content': 'Positive.\nThe text "Today\'s Seattle weather is beautiful" is positive as it expresses a pleasant and favorable weather condition in Seattle.'}

Note that in the prompt above, we didn’t provide OLMo with any additional context; OLMo already understands the sentiment—that’s zero-shot at work.

Prompting with Context#

OLMo or other LLMs can demonstrate remarkable zero-shot capabilities, they can fail in more complex or specific tasks. In this case, we can introduce examples (shots) or additional context within the prompt to improve the OLMo’s response.

Let’s try zero-shot to learn more about SciPy 2024.

chat_response = olmo.create_chat_completion(
    messages=[
        {"role": "user", "content": "Did you hear about SciPy 2024 conference?"},
    ],
)
print(parse_chat_completion_response(chat_response))
{'role': 'assistant', 'content': "I'm not personally aware of the specific details of the SciPy 2024 conference, as my training data only goes up until September 2021. However, I can provide you with some general information about SciPy conferences.\n\nSciPy (Short for Scientific Packages in Python) is a yearly conference focused on the scientific applications developed using the Python programming language. The conference brings together researchers and developers working on scientific computing with Python to share their work, learn from each other, and collaborate on open-source projects.\n\nSciPy conferences typically feature talks, tutorials, workshops, and poster sessions covering various aspects of scientific computing in Python, including numerical analysis, optimization, signal processing, machine learning, data visualization, and more. The conferences also provide opportunities for networking with the community and discussing future developments in the field.\n\nAs SciPy 2024 has not yet taken place, I cannot provide specific details about this particular conference. However, you can visit the official website at https://scipy2024.scipy-conference.org/ to learn more about past and upcoming conferences, as well as their program, schedule, and registration information."}

Interpret the response before moving on.

What if we provide relevant information to answer the prompt?

chat_response = olmo.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "The 23rd annual SciPy conference will be held at the Tacoma Convention Center, July 8-14, 2024. SciPy brings together attendees from industry, academia and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development. ",
        },
        {"role": "user", "content": "Did you hear about SciPy 2024 conference?"},
    ],
)
print(parse_chat_completion_response(chat_response))
{'role': 'assistant', 'content': "Yes, I do have information about the 23rd annual SciPy conference, which will be held at the Tacoma Convention Center from July 8-14, 2024. The conference aims to bring together attendees from various sectors such as industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.\n\nSciPy is a non-profit organization dedicated to the advancement of scientific computing in Python, an open-source programming language. The conference offers a diverse range of talks, tutorials, workshops, and networking opportunities for attendees interested in using Python for scientific research and data analysis.\n\nIf you're interested in attending SciPy 2024, be sure to mark your calendars for July 8-14, 2024, and keep an eye out for updates on the conference website as the event approaches."}

OLMo is able to generate a response that’s much more helpful to the user.

Many other prompting techniques (e.g., chain-of-thought, ReAct, etc.) exist. For this tutorial, we will focus on Retrieval-Augmented Generation, which can enhance OLMo’s responses by integrating information retrieved from external sources.

Your turn šŸ˜Ž#

Try different messages value(s) and see how the output changes. But remember to follow the template structure. The dictionary keys must contain role and content and the allowed role values are only user and assistant.

# Write your olmo.create_chat_completion code here. You can use the above example as a reference.

References

  1. https://news.ycombinator.com/item?id=35712334

  2. https://benjaminwarner.dev/2023/07/01/attention-mechanism

  3. Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators

  4. https://www.promptingguide.ai/