Ollama — Local Model Inference

OllamaDeveloperKit lets you run any open-source LLM on your own hardware with zero API key and zero data leaving your machine. It connects to a locally-running Ollama server and exposes the same six-method interface as every other RactoGateway kit.

Installation

# 1. Install Ollama  →  https://ollama.com/download
# 2. Pull any model
ollama pull llama3.2          # 2 GB — great for everyday tasks
ollama pull mistral           # 4 GB — excellent instruction following
ollama pull qwen2.5:7b        # 4.5 GB — strong multilingual model
ollama pull nomic-embed-text  # lightweight embeddings model

# 3. Install the Python extra
pip install ractogateway[ollama]

Ollama starts automatically on most platforms. If not, run:

ollama serve

Quick Start

from ractogateway import ollama_developer_kit as local, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer the user clearly and concisely.",
    constraints=["Stay on topic.", "Do not hallucinate."],
    tone="Friendly",
    output_format="text",
)

# No API key — Ollama listens at http://localhost:11434 by default
kit = local.Chat(model="llama3.2", default_prompt=prompt)

response = kit.chat(local.ChatConfig(user_message="What is a transformer model?"))
print(response.content)

Constructor Parameters

class ractogateway.ollama_developer_kit.kit.OllamaDeveloperKit(model='llama3.2', *, base_url='http://localhost:11434', embedding_model='nomic-embed-text', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]

Bases: object

Complete Ollama local-model developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.

Connects to a locally-running Ollama server. No API key required.

Parameters:

model (str) – Model name as reported by ollama list (e.g. "llama3.2", "mistral", "qwen2.5"). Use "auto" when a CostAwareRouter is provided — the router will select the model per-request.
base_url (str) – Ollama server base URL. Defaults to http://localhost:11434.
embedding_model (str) – Default model for embedding calls. Defaults to "nomic-embed-text".
default_prompt (RactoPrompt | None) – RACTO prompt used when ChatConfig.prompt is None.
exact_cache (ExactMatchCache | None) – Optional ExactMatchCache.
semantic_cache (SemanticCache | None) – Optional SemanticCache.
router (CostAwareRouter | None) – Optional CostAwareRouter. Required when model="auto".
truncator (TokenTruncator | None) – Optional TokenTruncator.
tracer (RactoTracer | None) – Optional RactoTracer.
metrics (GatewayMetricsMiddleware | None) – Optional GatewayMetricsMiddleware.

provider: str = 'ollama'

chat(config)[source]

Synchronous chat completion with optional middleware pipeline.

Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.

Return type:: LLMResponse

async achat(config)[source]

Async chat completion with optional middleware pipeline.

Return type:: LLMResponse

stream(config)[source]

Synchronous streaming — yields StreamChunk objects.

Example:

for chunk in kit.stream(config):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")

Return type:: Iterator[StreamChunk]

async astream(config)[source]

Async streaming — yields StreamChunk objects.

Return type:: AsyncIterator[StreamChunk]

embed(config)[source]

Synchronous embedding via Ollama’s embed API.

Example:

resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])

Return type:: EmbeddingResponse

async aembed(config)[source]

Async embedding via Ollama’s embed API.

Return type:: EmbeddingResponse

Parameter	Type	Default	Description
`model`	`str`	`"llama3.2"`	Model name from `ollama list`
`base_url`	`str`	`"http://localhost:11434"`	Ollama server base URL
`embedding_model`	`str`	`"nomic-embed-text"`	Default embedding model
`default_prompt`	`RactoPrompt \| None`	`None`	Kit-level default prompt
`exact_cache`	`ExactMatchCache \| None`	`None`	In-process exact-match cache
`semantic_cache`	`SemanticCache \| None`	`None`	Cosine-similarity cache
`router`	`CostAwareRouter \| None`	`None`	Required when `model="auto"`
`truncator`	`TokenTruncator \| None`	`None`	Auto-trim long histories
`tracer`	`RactoTracer \| None`	`None`	OpenTelemetry spans
`metrics`	`GatewayMetricsMiddleware \| None`	`None`	Prometheus metrics

Streaming

for chunk in kit.stream(local.ChatConfig(user_message="Write a haiku about Python.")):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")

Async

import asyncio

async def main() -> None:
    response = await kit.achat(local.ChatConfig(user_message="Explain async/await."))
    print(response.content)

    async for chunk in kit.astream(local.ChatConfig(user_message="Count to five.")):
        print(chunk.delta.text, end="", flush=True)

asyncio.run(main())

Embeddings

Ollama requires a dedicated embedding model. Pull it first:

ollama pull nomic-embed-text

embed_kit = local.Chat(
    model="llama3.2",
    embedding_model="nomic-embed-text",
    default_prompt=prompt,
)

resp = embed_kit.embed(local.EmbeddingConfig(texts=["hello world", "goodbye world"]))
for vec in resp.vectors:
    print(f"[{vec.index}] '{vec.text}' — dim={len(vec.embedding)}")

Vision Models (Image Input)

Ollama supports multimodal / vision models such as llava, llava-llama3, and minicpm-v. Pass image files via ChatConfig.attachments:

from ractogateway.prompts.engine import RactoFile

# Load an image
img = RactoFile.from_path("/tmp/photo.jpg")

# Or from raw bytes
img = RactoFile.from_bytes(open("photo.jpg", "rb").read(), "image/jpeg")

kit = local.Chat(model="llava", default_prompt=prompt)
response = kit.chat(
    local.ChatConfig(
        user_message="Describe what you see in this image.",
        attachments=[img],
    )
)
print(response.content)

Pull a vision model first:

ollama pull llava          # 4.5 GB — general vision model
ollama pull llava-llama3   # 5 GB — Llama 3 backbone
ollama pull minicpm-v      # 5.5 GB — strong at charts / documents

Tool Calling

Ollama supports function calling on models that were trained with tool support (e.g. llama3.1, llama3.2, mistral-nemo).

from ractogateway import ToolRegistry, tool

@tool
def get_weather(city: str) -> str:
    """Return current weather for a city."""
    return f"Sunny, 22 °C in {city}"

registry = ToolRegistry([get_weather])

response = kit.chat(
    local.ChatConfig(
        user_message="What's the weather in Paris?",
        tools=registry,
        auto_execute_tools=True,
    )
)
print(response.content)

Embedded Server Management

RactoGateway can start and stop Ollama for you so you don’t need to run ollama serve separately. This is especially useful when:

You need a custom port (e.g. to avoid conflicts with an existing server).
You want programmatic lifecycle control inside tests or long-running services.

How It Works

OllamaServerManager launches an ollama serve subprocess and configures it to listen on the port you choose via the OLLAMA_HOST environment variable. It registers an atexit handler so the process is always cleaned up — even if your program crashes.

Context Manager (Recommended)

from ractogateway import ollama_developer_kit as local

with local.OllamaServerManager(port=11500) as srv:
    # srv.base_url == "http://127.0.0.1:11500"
    kit = local.Chat(model="llama3.2", base_url=srv.base_url)
    response = kit.chat(local.ChatConfig(user_message="Hello!"))
    print(response.content)
# Server is automatically stopped here

Manual Start / Stop

srv = local.OllamaServerManager(port=11500)
srv.start()   # blocks until the server is ready (default timeout: 30 s)

kit = local.Chat(model="llama3.2", base_url=srv.base_url)
print(kit.chat(local.ChatConfig(user_message="What is 2+2?")).content)

srv.stop()    # graceful SIGTERM → SIGKILL if needed

Pull Models Programmatically

with local.OllamaServerManager(port=11500) as srv:
    srv.pull("llama3.2")                # equivalent to: ollama pull llama3.2
    print(srv.list_models())            # ['llama3.2:latest', ...]
    kit = local.Chat(model="llama3.2", base_url=srv.base_url)

Constructor Parameters

Parameter	Type	Default	Description
`host`	`str`	`"127.0.0.1"`	Bind address
`port`	`int`	`11434`	TCP port for the REST API
`startup_timeout`	`float`	`30.0`	Seconds to wait for readiness
`ollama_bin`	`str`	`"ollama"`	Path to the Ollama binary

Pointing at a Remote Ollama Server

If Ollama runs on another machine (e.g. a GPU box), pass its address:

kit = local.Chat(
    model="llama3.2",
    base_url="http://192.168.1.42:11434",
    default_prompt=prompt,
)

Validated JSON Output

from pydantic import BaseModel

class Summary(BaseModel):
    key_points: list[str]
    sentiment: str

typed_prompt = RactoPrompt(
    role="You are a text analyser.",
    aim="Summarise the text.",
    constraints=["Return only the JSON."],
    tone="Neutral",
    output_format=Summary,
)

kit = local.Chat(model="llama3.2", default_prompt=typed_prompt)
response = kit.chat(
    local.ChatConfig(
        user_message="Python is great for AI and scripting.",
        response_model=Summary,
    )
)
print(response.parsed)   # validated Summary instance dict

Recommended Models

Use case	Model	Size
General chat	`llama3.2`	2 GB
Code generation	`deepseek-coder-v2`	9 GB
Long context	`llama3.1:8b`	4.7 GB
Multilingual	`qwen2.5:7b`	4.5 GB
Instruction following	`mistral`	4 GB
Embeddings	`nomic-embed-text`	274 MB
Small + fast	`phi3.5`	2.2 GB

Browse all models at https://ollama.com/library.