Ollama — Local Model Inference

OllamaDeveloperKit lets you run any open-source LLM on your own hardware with zero API key and zero data leaving your machine. It connects to a locally-running Ollama server and exposes the same six-method interface as every other RactoGateway kit.

Installation

# 1. Install Ollama  →  https://ollama.com/download
# 2. Pull any model
ollama pull llama3.2          # 2 GB — great for everyday tasks
ollama pull mistral           # 4 GB — excellent instruction following
ollama pull qwen2.5:7b        # 4.5 GB — strong multilingual model
ollama pull nomic-embed-text  # lightweight embeddings model

# 3. Install the Python extra
pip install ractogateway[ollama]

Ollama starts automatically on most platforms. If not, run:

ollama serve

Quick Start

from ractogateway import ollama_developer_kit as local, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer the user clearly and concisely.",
    constraints=["Stay on topic.", "Do not hallucinate."],
    tone="Friendly",
    output_format="text",
)

# No API key — Ollama listens at http://localhost:11434 by default
kit = local.Chat(model="llama3.2", default_prompt=prompt)

response = kit.chat(local.ChatConfig(user_message="What is a transformer model?"))
print(response.content)

Constructor Parameters

class ractogateway.ollama_developer_kit.kit.OllamaDeveloperKit(model='llama3.2', *, base_url='http://localhost:11434', embedding_model='nomic-embed-text', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]

Bases: object

Complete Ollama local-model developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.

Connects to a locally-running Ollama server. No API key required.

Parameters:
  • model (str) – Model name as reported by ollama list (e.g. "llama3.2", "mistral", "qwen2.5"). Use "auto" when a CostAwareRouter is provided — the router will select the model per-request.

  • base_url (str) – Ollama server base URL. Defaults to http://localhost:11434.

  • embedding_model (str) – Default model for embedding calls. Defaults to "nomic-embed-text".

  • default_prompt (RactoPrompt | None) – RACTO prompt used when ChatConfig.prompt is None.

  • exact_cache (ExactMatchCache | None) – Optional ExactMatchCache.

  • semantic_cache (SemanticCache | None) – Optional SemanticCache.

  • router (CostAwareRouter | None) – Optional CostAwareRouter. Required when model="auto".

  • truncator (TokenTruncator | None) – Optional TokenTruncator.

  • tracer (RactoTracer | None) – Optional RactoTracer.

  • metrics (GatewayMetricsMiddleware | None) – Optional GatewayMetricsMiddleware.

provider: str = 'ollama'
chat(config)[source]

Synchronous chat completion with optional middleware pipeline.

Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.

Return type:

LLMResponse

async achat(config)[source]

Async chat completion with optional middleware pipeline.

Return type:

LLMResponse

stream(config)[source]

Synchronous streaming — yields StreamChunk objects.

Example:

for chunk in kit.stream(config):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")
Return type:

Iterator[StreamChunk]

async astream(config)[source]

Async streaming — yields StreamChunk objects.

Return type:

AsyncIterator[StreamChunk]

embed(config)[source]

Synchronous embedding via Ollama’s embed API.

Example:

resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])
Return type:

EmbeddingResponse

async aembed(config)[source]

Async embedding via Ollama’s embed API.

Return type:

EmbeddingResponse

Parameter

Type

Default

Description

model

str

"llama3.2"

Model name from ollama list

base_url

str

"http://localhost:11434"

Ollama server base URL

embedding_model

str

"nomic-embed-text"

Default embedding model

default_prompt

RactoPrompt | None

None

Kit-level default prompt

exact_cache

ExactMatchCache | None

None

In-process exact-match cache

semantic_cache

SemanticCache | None

None

Cosine-similarity cache

router

CostAwareRouter | None

None

Required when model="auto"

truncator

TokenTruncator | None

None

Auto-trim long histories

tracer

RactoTracer | None

None

OpenTelemetry spans

metrics

GatewayMetricsMiddleware | None

None

Prometheus metrics

Streaming

for chunk in kit.stream(local.ChatConfig(user_message="Write a haiku about Python.")):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")

Async

import asyncio

async def main() -> None:
    response = await kit.achat(local.ChatConfig(user_message="Explain async/await."))
    print(response.content)

    async for chunk in kit.astream(local.ChatConfig(user_message="Count to five.")):
        print(chunk.delta.text, end="", flush=True)

asyncio.run(main())

Embeddings

Ollama requires a dedicated embedding model. Pull it first:

ollama pull nomic-embed-text
embed_kit = local.Chat(
    model="llama3.2",
    embedding_model="nomic-embed-text",
    default_prompt=prompt,
)

resp = embed_kit.embed(local.EmbeddingConfig(texts=["hello world", "goodbye world"]))
for vec in resp.vectors:
    print(f"[{vec.index}] '{vec.text}' — dim={len(vec.embedding)}")

Vision Models (Image Input)

Ollama supports multimodal / vision models such as llava, llava-llama3, and minicpm-v. Pass image files via ChatConfig.attachments:

from ractogateway.prompts.engine import RactoFile

# Load an image
img = RactoFile.from_path("/tmp/photo.jpg")

# Or from raw bytes
img = RactoFile.from_bytes(open("photo.jpg", "rb").read(), "image/jpeg")

kit = local.Chat(model="llava", default_prompt=prompt)
response = kit.chat(
    local.ChatConfig(
        user_message="Describe what you see in this image.",
        attachments=[img],
    )
)
print(response.content)

Pull a vision model first:

ollama pull llava          # 4.5 GB — general vision model
ollama pull llava-llama3   # 5 GB — Llama 3 backbone
ollama pull minicpm-v      # 5.5 GB — strong at charts / documents

Tool Calling

Ollama supports function calling on models that were trained with tool support (e.g. llama3.1, llama3.2, mistral-nemo).

from ractogateway import ToolRegistry, tool

@tool
def get_weather(city: str) -> str:
    """Return current weather for a city."""
    return f"Sunny, 22 °C in {city}"

registry = ToolRegistry([get_weather])

response = kit.chat(
    local.ChatConfig(
        user_message="What's the weather in Paris?",
        tools=registry,
        auto_execute_tools=True,
    )
)
print(response.content)

Embedded Server Management

RactoGateway can start and stop Ollama for you so you don’t need to run ollama serve separately. This is especially useful when:

  • You need a custom port (e.g. to avoid conflicts with an existing server).

  • You want programmatic lifecycle control inside tests or long-running services.

How It Works

OllamaServerManager launches an ollama serve subprocess and configures it to listen on the port you choose via the OLLAMA_HOST environment variable. It registers an atexit handler so the process is always cleaned up — even if your program crashes.

Manual Start / Stop

srv = local.OllamaServerManager(port=11500)
srv.start()   # blocks until the server is ready (default timeout: 30 s)

kit = local.Chat(model="llama3.2", base_url=srv.base_url)
print(kit.chat(local.ChatConfig(user_message="What is 2+2?")).content)

srv.stop()    # graceful SIGTERM → SIGKILL if needed

Pull Models Programmatically

with local.OllamaServerManager(port=11500) as srv:
    srv.pull("llama3.2")                # equivalent to: ollama pull llama3.2
    print(srv.list_models())            # ['llama3.2:latest', ...]
    kit = local.Chat(model="llama3.2", base_url=srv.base_url)

Constructor Parameters

Parameter

Type

Default

Description

host

str

"127.0.0.1"

Bind address

port

int

11434

TCP port for the REST API

startup_timeout

float

30.0

Seconds to wait for readiness

ollama_bin

str

"ollama"

Path to the Ollama binary

Pointing at a Remote Ollama Server

If Ollama runs on another machine (e.g. a GPU box), pass its address:

kit = local.Chat(
    model="llama3.2",
    base_url="http://192.168.1.42:11434",
    default_prompt=prompt,
)

Validated JSON Output

from pydantic import BaseModel

class Summary(BaseModel):
    key_points: list[str]
    sentiment: str

typed_prompt = RactoPrompt(
    role="You are a text analyser.",
    aim="Summarise the text.",
    constraints=["Return only the JSON."],
    tone="Neutral",
    output_format=Summary,
)

kit = local.Chat(model="llama3.2", default_prompt=typed_prompt)
response = kit.chat(
    local.ChatConfig(
        user_message="Python is great for AI and scripting.",
        response_model=Summary,
    )
)
print(response.parsed)   # validated Summary instance dict