Ollama and Llama3 Revolution

Local Inference with Meta’s Latest Llama 3.2 LLMs Using Ollama, LangChain, and Streamlit

Meta’s latest Llama 3.2 1B and 3B models are available from Ollama. Learn how to install and interact with these models locally using Streamlit and LangChain.

Introduction

Meta just announced the release of Llama 3.2, a revolutionary set of open, customizable edge AI and vision models, including “small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions.” According to Meta, the lightweight, text-only Llama 3.2 1B and 3B models “support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge.”

On the same day, Ollama announced that the lightweight, text-only Llama 3.2 1B and 3B models were available on their platform. For those unfamiliar with it, Ollama allows you to “Get up and running with large language models” quickly and easily in your local environment. Ollama is available for macOS, Linux, and Windows (preview). Ollama works with a model file — a blueprint for creating and sharing models with Ollama. To learn more about Ollama, I suggest reviewing the Ollama FAQ document on GitHub.

Running Llama 3.2 3B with Ollama

Using Ollama, you can easily install either of the two lightweight models and start using them immediately. I had previously installed Ollama on my 2020-era Apple M1 MacBook Pro and newer 2023-era Apple M2 MacBook Air. Downloading and installing Ollama is simple and easy.

Once Ollama is installed, use the following command to pull the Llama 3.2 3B model: ollama pull llama3.2. This command will install a 4-bit quantized version of the 3B model, which requires 2.0 GB of disk space and has an identical hash to the 3b-instruct-q4_K_M model.

Installing llama3.2:1b 1B model in iTerm with Ollama

For the smaller 1B model, use the ollama pull llama3.2:1b command. This command will pull an 8-bit quantized version of the 1B model, requiring 1.3 GB of disk space, and has an identical hash to the 1b-instruct-q8_0 model.

Interacting with Llama 3.2

You can use the Ollama terminal interface to interact with Llama 3.2 1B or 3B. Run the ollama run llama3.2 command and enter your prompt, such as “When was Meta founded?” Next, try some follow-up questions that require the previous session context to answer correctly, like “How old is its founder?” and “What is their estimated net worth?”

Using the Ollama terminal interface to interact with the Llama 3.2 3B model

Llama 3.2 with Streamlit and LangChain

Although interacting with Llama 3.2 using the terminal interface is straightforward, it is not visually appealing. Streamlit and Gradio are very popular tools for quickly building sophisticated user interfaces (UIs) for Generative AI POCs and MVPs. We will use Streamlit and LangChain to interact with the Llama 3.2 1B and 3B models using a chat application I have written in Python, app.py. All code is available on GitHub.

# Ollama-Streamlit-LangChain-Chat-App
# Streamlit app for chatting with Meta Llama 3.2 using Ollama and LangChain
# Author: Gary A. Stafford
# Date: 2024-09-26

import logging
from typing import Dict, Any

import streamlit as st
from langchain_community.chat_message_histories import StreamlitChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_ollama import ChatOllama

# Constants
PAGE_TITLE = "Llama 3.2 Chat"
PAGE_ICON = "🦙"
SYSTEM_PROMPT = "You are a friendly AI chatbot having a conversation with a human."
DEFAULT_MODEL = "llama3.2:latest"

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)


def initialize_session_state() -> None:
    defaults: Dict[str, Any] = {
        "model": DEFAULT_MODEL,
        "input_tokens": 0,
        "output_tokens": 0,
        "total_tokens": 0,
        "total_duration": 0,
        "num_predict": 2048,
        "seed": 1,
        "temperature": 0.5,
        "top_p": 0.9,
    }
    for key, value in defaults.items():
        if key not in st.session_state:
            st.session_state[key] = value


def create_sidebar() -> None:
    with st.sidebar:
        st.header("Inference Settings")
        st.session_state.system_prompt = st.text_area(
            label="System",
            value=SYSTEM_PROMPT,
            help="Sets the context in which to interact with the AI model. It typically includes rules, guidelines, or necessary information that help the model respond effectively.",
        )

        st.session_state.model = st.selectbox(
            "Model",
            ["llama3.2:1b", "llama3.2:latest"],
            index=1,
            help="Select the model to use.",
        )
        st.session_state.seed = st.slider(
            "Seed",
            min_value=1,
            max_value=9007199254740991,
            value=round(9007199254740991 / 2),
            step=1,
            help="Controls the randomness of how the model selects the next tokens during text generation.",
        )
        st.session_state.temperature = st.slider(
            "Temperature",
            min_value=0.0,
            max_value=1.0,
            value=0.5,
            step=0.01,
            help="Sets an LLM's entropy. Low temperatures render outputs that are predictable and repetitive. Conversely, high temperatures encourage LLMs to produce more random, creative responses.",
        )
        st.session_state.top_p = st.slider(
            "Top P",
            min_value=0.0,
            max_value=1.0,
            value=0.90,
            step=0.01,
            help="Sets the probability threshold for the nucleus sampling algorithm. It controls the diversity of the model's responses.",
        )
        st.session_state.num_predict = st.slider(
            "Response Tokens",
            min_value=0,
            max_value=8192,
            value=2048,
            step=16,
            help="Sets the maximum number of tokens the model can generate in response to a prompt.",
        )

        st.markdown("---")
        st.text(
            f"""Stats:
- model: {st.session_state.model}
- seed: {st.session_state.seed}
- temperature: {st.session_state.temperature}
- top_p: {st.session_state.top_p}
- num_predict: {st.session_state.num_predict}
        """
        )


def create_chat_model() -> ChatOllama:
    return ChatOllama(
        model=st.session_state.model,
        seed=st.session_state.seed,
        temperature=st.session_state.temperature,
        top_p=st.session_state.top_p,
        num_predict=st.session_state.num_predict,
    )


def create_chat_chain(chat_model: ChatOllama):
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", st.session_state.system_prompt),
            MessagesPlaceholder(variable_name="chat_history"),
            ("human", "{input}"),
        ]
    )
    return prompt | chat_model


def update_sidebar_stats(response: Any) -> None:
    total_duration = response.response_metadata["total_duration"] / 1e9
    st.session_state.total_duration = f"{total_duration:.2f} s"
    st.session_state.input_tokens = response.usage_metadata["input_tokens"]
    st.session_state.output_tokens = response.usage_metadata["output_tokens"]
    st.session_state.total_tokens = response.usage_metadata["total_tokens"]
    token_per_second = (
        response.response_metadata["eval_count"]
        / response.response_metadata["eval_duration"]
    ) * 1e9
    st.session_state.token_per_second = f"{token_per_second:.2f} tokens/s"

    with st.sidebar:
        st.text(
            f"""
- input_tokens: {st.session_state.input_tokens}
- output_tokens: {st.session_state.output_tokens}
- total_tokens: {st.session_state.total_tokens}
- total_duration: {st.session_state.total_duration}
- token_per_second: {st.session_state.token_per_second}
        """
        )


def main() -> None:
    st.set_page_config(page_title=PAGE_TITLE, page_icon=PAGE_ICON, layout="wide")
    st.markdown(
        """
        <style>
            MainMenu {visibility: hidden;}
            footer {visibility: hidden;}
            header {visibility: hidden;}
        </style>
        """,
        unsafe_allow_html=True,
    )

    st.title(f"{PAGE_TITLE} {PAGE_ICON}")

    st.markdown("##### Chat")

    initialize_session_state()
    create_sidebar()

    chat_model = create_chat_model()
    chain = create_chat_chain(chat_model)

    msgs = StreamlitChatMessageHistory(key="special_app_key")
    if not msgs.messages:
        msgs.add_ai_message("How can I help you?")

    chain_with_history = RunnableWithMessageHistory(
        chain,
        lambda session_id: msgs,
        input_messages_key="input",
        history_messages_key="chat_history",
    )

    for msg in msgs.messages:
        st.chat_message(msg.type).write(msg.content)

    if prompt := st.chat_input("Type your message here..."):
        st.chat_message("human").write(prompt)

        with st.spinner("Thinking..."):
            config = {"configurable": {"session_id": "any"}}
            response = chain_with_history.invoke({"input": prompt}, config)
            logger.info({"input": prompt}, config)
            st.chat_message("ai").write(response.content)
            logger.info(response)
            update_sidebar_stats(response)

    if st.button("Clear Chat History"):
        msgs.clear()
        st.rerun()


if __name__ == "__main__":
    main()

The Streamlit application leverages an instance of LangChain’s ChatOllama Class, which directly integrates the Ollama chat model. The application also uses the RunnableWithMessageHistory Class, part of langchain_core. This wraps another Runnable and manages the chat message history for it; it is responsible for reading and updating the chat message history. A chat message history is a sequence of messages that represents a conversation. Runnable is a unit of work that can be invoked, batched, streamed, transformed, and composed. Finally, the application implements the StreamlitChatMessageHistory Class, part of the langchain_community.chat_message_histories Class. The StreamlitChatMessageHistory Class stores chat message history in the Streamlit session state.

Required Python Packages

We also need a requirements.txt file to install the required Python packages:

langchain
langchain-community
langchain-core
langchain-ollama
streamlit
watchdog # optional

Lastly, best practices dictate that we use a Python virtual environment to run the Streamlit application. Here are the commands necessary to create the environment and install the required packages into it:

python3 --version # I am running Python 3.12.2

python3 -m venv ollama_ui
source ollama_ui/bin/activate

python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt --upgrade

Starting the Application

We are ready to run our Streamlit application, which will access Ollama using LangChain and perform inference on the Llama 3.2 1B or 3B models:

streamlit run app.py

Alternatively, if you prefer something slightly more visually appealing, we can configure Streamlit to minimize its additional UI features, apply the light or dark mode theme, and use Meta’s official blue logo color for the UI’s primary color:

streamlit run app.py \
    --server.runOnSave true \
    --theme.base "dark" \
    --theme.primaryColor "#0081FB" \
    --ui.hideTopBar "true" \
    --client.toolbarMode "minimal"

The Streamlit application should start locally on http://localhost:8501 and automatically open in your default web browser.

Streamlit application successfully started in iTerm

Using the Application

In addition to the prompt, the application accepts inference parameters on the sidebar, including system role prompt, model, seed, temperature, top_p, and maximum response tokens (aka num_predict). Play around with different parameters and compare the results. The application also calculates metrics, including input tokens, output tokens, total tokens, total inference duration in seconds, and response tokens/second.

Prompt Example 1: Meta Conversation

Test the application using the previous prompts, starting with “When was Meta founded?”. Since this is a chat interface with conversational memory, try some follow-up questions that require the previous chat history as context to answer correctly, like “How old is its founder?” and then “What is their estimated net worth?”

Streamlit application performing inference locally on Llama 3.2 1B using Ollama

Switching back to the terminal window where you ran the command to start the application, you should now see log output showing the user prompt and the response from the model.

Log output showing the prompt and response from the model

Prompt Example 2: Speech Excerpt

Next, let’s try improving the grammar of an excerpt of a speech by Barack Obama (all prompts are in the README file):

Improve the grammar of the following speech excerpt. Explain what has changed and why:

There’s not a liberal America and a conservative America; there’s the United States of America. There’s not a Black America and white America and Latino America and Asian America; there’s the United States of America. We are one people, all of us pledging allegiance to the stars and stripes, all of us defending the United States of America. In the end, that’s what this election is about. Do we participate in a politics of cynicism, or do we participate in a politics of hope?

Streamlit application performing inference locally on Llama 3.2 3B using Ollama

We could then follow up with the question, “Describe the speech excerpt’s sentiment.” The 3B model does an excellent job of analyzing sentiment.

Sentiment analysis of the speech excerpt

Prompt Example 3: The Three Little Pigs

For a third example, let’s analyze the story of The Three Little Pigs using the 3B models. We will extract the main characters and character types from the story and return them in JSON format. We will also ask the model to explain its choices. Llama 3.2 3B can output a formatted response. We will use this prompt (all prompts are in the README file):

Analyze the following children’s story. Identify all the characters and their corresponding character types from the list below. Explain why you have chosen a particular character type. Output the characters and their corresponding character types in JSON format, which adheres to the following structure:

### FORMAT ###
{
    "characters": [
        {
            "character": "character A",
            "character_type": "type 1"
        },
        {
            "character": "character B",
            "character_type": "type 2"
        },
        {
            "character": "character C",
            "character_type": "type 3"
        }
    ]
}

### CHARACTER TYPES ###
- Antagonist
- Antihero
- Confidant
- Contagonist
- Deuteragonist
- Foil
- Guide
- Henchmen
- Love Interest
- Protagonist
- Temptress

### STORY ###
Once upon a time, an old mother pig had three piglets. Unfortunately, she didn’t have enough food to keep them, so she sent them out to seek their own luck.
...

Prompt to extract the story’s characters and character types in JSON format

Often, the first response contains the JSON and the explanation of character type choices, but the three men in the story who supplied the materials to the pigs are missing. As well, the JSON is not displayed in a pretty format. We can follow up with two additional prompts, “What about the three men in the story?” and then “Format the JSON with markdown tags for code.”

The final JSON output includes all the characters in the story.

Prompt Example 4: Multilingual Geography

According to the Llama 3.2 1B/3B model card, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai languages are officially supported. Llama 3.2 has been trained in more languages than these eight supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy.

For this example, let’s ask the model to respond to the following geography questions: “What is the tallest peak in Austria?”, “What are three famous landmarks in Paris?”, and “What is the largest temple in Thailand?” We would like the model to respond in the dominant native language of the country or region based on the geographic context of the user’s prompt, assuming the model was trained in those languages. To achieve this, we will use the system role to provide context to the model. According to Meta, the system role “sets the context in which to interact with the AI model. It typically includes rules, guidelines, or necessary information that help the model respond effectively.” Here is the system role prompt for this example (all prompts are in the README file):

You are an expert in geography and linguistics. Based on the geographic context of the user’s prompt, you respond in the dominant native language of the country or region.

Response to user prompts in three different languages (note system role prompt in the upper left)

Prompt Example 5: Code Generation / Data Extraction

For the final example, we will use Llama 3.2 3B to write Python scripts to extract data from a CSV file. We will then ask the same model to refactor and improve its own generated code. Using this two-step method typically results in better-quality code. Once again, we will the system role to provide context to the model (all prompts are in the README file):

You are an expert programmer who writes Python 3 code in a Pythonic style. Pythonic refers to an approach to Python programming that embraces the idioms and practices considered natural or idiomatic in the Python programming language. It embodies the philosophy and best practices that lead to clear, concise, and readable code. Pythonic code is also performant, resilient, efficiently catches specific exceptions, and uses the latest Python 3 features.

Important: You should always optimize code for performance over the use of convenience libraries and use Python functions to separate functional concerns, including a main() function.

The user prompt describes the task we want to solve by generating the Python script (all prompts are in the README file). According to Meta, the user role “represents the human interacting with the model. It includes the inputs, commands, and questions to the model.” Note that we are also asking the model to explain its decisions. This allows us to learn from the model, as well as confirm the model’s understanding of the instructions we provided.

Write a Python 3 script to extract all values from the 'First Name' column of a CSV file as a Python list of dictionary objects containing the values as strings ('names'), sorted in ascending order, along with the count of each unique value ('count'). Do not repeat any values. Require a command-line argument for the 'path' to CSV file. Output the results as Name: {name}, Count: {count}, sorted in descending order by counts and secondarily, in ascending order by name. Explain your decisions.

Below is a sample of that CSV file's header row:

Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website

Python code generated by the model (note system role prompt in the upper left)

Explanation of programming decisions by the model

Let’s see if we can improve and optimize the Python code. We will use a new user role prompt but keep the system role prompt (all prompts are in the README file):

Refactor the code to adhere to PEP 8 guidelines and optimize it for performance, taking into account any existing constraints or requirements.

Refactored Python code with an explanation of improvements

The resulting script correctly extracted a list of first names from 100K records in the CSV file in ascending order without duplicates and output in the requested format and order, all in 0.261 seconds.

Name: Joan, Count: 183
Name: Audrey, Count: 182
Name: Bridget, Count: 182
Name: Anne, Count: 180
Name: Melinda, Count: 177
...
Name: Jay, Count: 115
Name: George, Count: 114
Name: Jessica, Count: 114
Name: Tanner, Count: 114

0.23s user 0.01s system 94% cpu 0.261 total

GPU Performance

You can monitor GPU performance on your Mac using several tools, including asitop.

Another option for monitoring your Mac’s GPU performance is Mx Power Gadget.

Mx Power GadMx Power Gadget running on Mac M1get

Conclusion

In this brief post, we saw how easy it is to start locally with Meta’s latest Llama 3.2 1B and 3B LLMs using a combination of Ollama, LangChain, and Streamlit. Install Ollama and download this post’s sample application from GitHub to start today.

Ollama & Llama3

Ollama and Llama3 Revolution

Local Inference with Meta’s Latest Llama 3.2 LLMs Using Ollama, LangChain, and Streamlit

Meta’s latest Llama 3.2 1B and 3B models are available from Ollama. Learn how to install and interact with these models locally using Streamlit and LangChain.

Introduction

Running Llama 3.2 3B with Ollama

Interacting with Llama 3.2

Llama 3.2 with Streamlit and LangChain

Required Python Packages

Starting the Application

Using the Application

Prompt Example 1: Meta Conversation

Prompt Example 2: Speech Excerpt

Prompt Example 3: The Three Little Pigs

Prompt Example 4: Multilingual Geography

Prompt Example 5: Code Generation / Data Extraction

GPU Performance

Conclusion

Leave a Reply Cancel reply

Ollama and Llama3 Revolution

Meta’s latest Llama 3.2 1B and 3B models are available from Ollama. Learn how to install and interact with these models locally using Streamlit and LangChain.

Introduction

Running Llama 3.2 3B with Ollama

Interacting with Llama 3.2

Llama 3.2 with Streamlit and LangChain

Required Python Packages

Starting the Application

Using the Application

Prompt Example 1: Meta Conversation

Prompt Example 2: Speech Excerpt

Prompt Example 3: The Three Little Pigs

Prompt Example 4: Multilingual Geography

Prompt Example 5: Code Generation / Data Extraction

GPU Performance

Conclusion

Please Share This Share this content

Leave a Reply Cancel reply

Share this content