Learn
Multimodal Input

Multimodal Input

And how you can use it with function calling

What it is

The multi-modal input capabilities of a model enable it to understand more than just text - you can use image, video, and audio files in the conversation as well.

The Gemma 3 (27B parameter) model we used in the previous tutorials supports adding image files to the conversation. This tutorial will go through using this capability together with function calling, with the simple expense tracker we built in the last tutorial.

Here's a peak at the conversation you'll be having by the end of this tutorial:

you

image of bill
Add this to my expenses, and let me know if I'm going over my food budget of 1000 this month.

model

 You've spent a total of 1278.5 on food this month, which exceeds your budget of 1000 by 278.5. It looks like that dinner put you over the limit!

How to do it

Reading and encoding images

Making use of a model's multimodal capabilities is easy to do using Ollama. Recall the function we wrote to make requests to the Ollama REST API in the first tutorial:

model = "gemma3:27b"
from aiohttp import ClientSession, ClientTimeout

timeout = ClientTimeout(total=300)
session = ClientSession(timeout=timeout)

async def chat(session, messages, model, server="http://localhost:11434"):
    endpoint = f"{server}/api/chat"
    payload = {
        "model": model,
        "messages": messages,
        "stream": False,
        "options": {
            "num_ctx": 8192,
            "top_p": 0.95,
        }
    }

    async with session.post(endpoint, json=payload) as response:
        response.raise_for_status()

        response = await response.json()
        content = response["message"]["content"]
        messages.append({
          "role": "assistant", "content": content
        })

        return content

Each message in the messages array passed to the above function looks like this:

{
    "role": "user" or "assisstant",
    "content": "What is the answer to life, the universe and everything?"
}

Ollama supports passing multiple images as an array of base64-encoded strings along with the text content. So our new message object should look like this:

{
    "role": "user",
    "content": "Tell me what is in this image.",
    "images": ["iVBORw0KGgoAAAANSUhEUgAAAe0AAAKACAY..."]
}

In this message, the string "iVBORw0KGgoAAAANSUhEUgAAAe0AAAKACAY..." is the bytes of an image encoded using base64.

Let us write the code to read an image (given a path on the disk), encode it using base64, and pass it to the API:

from base64 import b64encode

def encode_image(path: str) -> str:
    try:
        with open(path, "rb") as file:
            return b64encode(file.read()).decode("utf-8")
    except FileNotFoundError:
        raise FileNotFoundError(f"Could not find file: {path}")
    except Exception as error:
        raise Exception(f"Failed to encode image {path}: {str(error)}")

Now, we can add this to the existing chat function:

 session = ClientSession(timeout=timeout)

 async def chat(session, messages, model, server="http://localhost:11434"):
+    for message in messages:
+        if "images" in message:
+            encoded = [encode_image(path) for path in message["images"]]
+            message["images"] = encoded

     endpoint = f"{server}/api/chat"
     payload = {
         "model": model,

Using the images with function calling

First, let us provide it the prompt and the function specifications from the previous tutorial.

prompt = "You are a helpful assistant to me, the user..."
functions = "```function_spec\n{ \"name\": \"record_expense\"..."

messages.append({ "role": "user", "content": prompt })
messages.append({ "role": "user", "content": functions })

Now that we can use images in the chat, let's take this picture of a restaurant bill and ask Gemma to add it to our expenses.

image = "media/bill.jpg"
task = "Add this bill to my expenses."

messages.append({ "role": "user", "content": task, "images": [image] })
response = await chat(session, messages, model)

print(response)

output

```function_call
{
    "id": "2",
    "function": "record_expense",
    "parameters": {
        "date": "2025-05-04",
        "category": "food",
        "amount": 588.0,
        "description": "dinner"
    }
}
```

Nice! The model extracted the relevant information from the image, and used it to call the relevant function. To parse and execute this function call, use the code from the previous tutorial.

Congratulations! You've completed your first conversation that uses multimodal input and function calling to accomplish a task, and all of it completely offline :)