Ollama — Local Model Inference
OllamaDeveloperKit lets you run any open-source LLM on your own hardware with
zero API key and zero data leaving your machine. It connects to a locally-running
Ollama server and exposes the same six-method interface as
every other RactoGateway kit.
Installation
# 1. Install Ollama → https://ollama.com/download
# 2. Pull any model
ollama pull llama3.2 # 2 GB — great for everyday tasks
ollama pull mistral # 4 GB — excellent instruction following
ollama pull qwen2.5:7b # 4.5 GB — strong multilingual model
ollama pull nomic-embed-text # lightweight embeddings model
# 3. Install the Python extra
pip install ractogateway[ollama]
Ollama starts automatically on most platforms. If not, run:
ollama serve
Quick Start
from ractogateway import ollama_developer_kit as local, RactoPrompt
prompt = RactoPrompt(
role="You are a helpful assistant.",
aim="Answer the user clearly and concisely.",
constraints=["Stay on topic.", "Do not hallucinate."],
tone="Friendly",
output_format="text",
)
# No API key — Ollama listens at http://localhost:11434 by default
kit = local.Chat(model="llama3.2", default_prompt=prompt)
response = kit.chat(local.ChatConfig(user_message="What is a transformer model?"))
print(response.content)
Constructor Parameters
- class ractogateway.ollama_developer_kit.kit.OllamaDeveloperKit(model='llama3.2', *, base_url='http://localhost:11434', embedding_model='nomic-embed-text', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]
Bases:
objectComplete Ollama local-model developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.
Connects to a locally-running Ollama server. No API key required.
- Parameters:
model (
str) – Model name as reported byollama list(e.g."llama3.2","mistral","qwen2.5"). Use"auto"when aCostAwareRouteris provided — the router will select the model per-request.base_url (
str) – Ollama server base URL. Defaults tohttp://localhost:11434.embedding_model (
str) – Default model for embedding calls. Defaults to"nomic-embed-text".default_prompt (
RactoPrompt|None) – RACTO prompt used whenChatConfig.promptisNone.exact_cache (
ExactMatchCache|None) – OptionalExactMatchCache.semantic_cache (
SemanticCache|None) – OptionalSemanticCache.router (
CostAwareRouter|None) – OptionalCostAwareRouter. Required whenmodel="auto".truncator (
TokenTruncator|None) – OptionalTokenTruncator.tracer (
RactoTracer|None) – OptionalRactoTracer.metrics (
GatewayMetricsMiddleware|None) – OptionalGatewayMetricsMiddleware.
- provider: str = 'ollama'
- chat(config)[source]
Synchronous chat completion with optional middleware pipeline.
Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.
- Return type:
- async achat(config)[source]
Async chat completion with optional middleware pipeline.
- Return type:
- stream(config)[source]
Synchronous streaming — yields
StreamChunkobjects.Example:
for chunk in kit.stream(config): print(chunk.delta.text, end="", flush=True) if chunk.is_final: print(f"\nTokens: {chunk.usage}")
- Return type:
Iterator[StreamChunk]
- async astream(config)[source]
Async streaming — yields
StreamChunkobjects.- Return type:
AsyncIterator[StreamChunk]
- embed(config)[source]
Synchronous embedding via Ollama’s embed API.
Example:
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"])) print(resp.vectors[0].embedding[:5])
- Return type:
EmbeddingResponse
- async aembed(config)[source]
Async embedding via Ollama’s embed API.
- Return type:
EmbeddingResponse
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model name from |
|
|
|
Ollama server base URL |
|
|
|
Default embedding model |
|
|
|
Kit-level default prompt |
|
|
|
In-process exact-match cache |
|
|
|
Cosine-similarity cache |
|
|
|
Required when |
|
|
|
Auto-trim long histories |
|
|
|
OpenTelemetry spans |
|
|
|
Prometheus metrics |
Streaming
for chunk in kit.stream(local.ChatConfig(user_message="Write a haiku about Python.")):
print(chunk.delta.text, end="", flush=True)
if chunk.is_final:
print(f"\nTokens: {chunk.usage}")
Async
import asyncio
async def main() -> None:
response = await kit.achat(local.ChatConfig(user_message="Explain async/await."))
print(response.content)
async for chunk in kit.astream(local.ChatConfig(user_message="Count to five.")):
print(chunk.delta.text, end="", flush=True)
asyncio.run(main())
Embeddings
Ollama requires a dedicated embedding model. Pull it first:
ollama pull nomic-embed-text
embed_kit = local.Chat(
model="llama3.2",
embedding_model="nomic-embed-text",
default_prompt=prompt,
)
resp = embed_kit.embed(local.EmbeddingConfig(texts=["hello world", "goodbye world"]))
for vec in resp.vectors:
print(f"[{vec.index}] '{vec.text}' — dim={len(vec.embedding)}")
Vision Models (Image Input)
Ollama supports multimodal / vision models such as llava, llava-llama3,
and minicpm-v. Pass image files via ChatConfig.attachments:
from ractogateway.prompts.engine import RactoFile
# Load an image
img = RactoFile.from_path("/tmp/photo.jpg")
# Or from raw bytes
img = RactoFile.from_bytes(open("photo.jpg", "rb").read(), "image/jpeg")
kit = local.Chat(model="llava", default_prompt=prompt)
response = kit.chat(
local.ChatConfig(
user_message="Describe what you see in this image.",
attachments=[img],
)
)
print(response.content)
Pull a vision model first:
ollama pull llava # 4.5 GB — general vision model
ollama pull llava-llama3 # 5 GB — Llama 3 backbone
ollama pull minicpm-v # 5.5 GB — strong at charts / documents
Tool Calling
Ollama supports function calling on models that were trained with tool support
(e.g. llama3.1, llama3.2, mistral-nemo).
from ractogateway import ToolRegistry, tool
@tool
def get_weather(city: str) -> str:
"""Return current weather for a city."""
return f"Sunny, 22 °C in {city}"
registry = ToolRegistry([get_weather])
response = kit.chat(
local.ChatConfig(
user_message="What's the weather in Paris?",
tools=registry,
auto_execute_tools=True,
)
)
print(response.content)
Embedded Server Management
RactoGateway can start and stop Ollama for you so you don’t need to run
ollama serve separately. This is especially useful when:
You need a custom port (e.g. to avoid conflicts with an existing server).
You want programmatic lifecycle control inside tests or long-running services.
How It Works
OllamaServerManager launches an ollama serve subprocess and configures
it to listen on the port you choose via the OLLAMA_HOST environment
variable. It registers an atexit handler so the process is always cleaned
up — even if your program crashes.
Context Manager (Recommended)
from ractogateway import ollama_developer_kit as local
with local.OllamaServerManager(port=11500) as srv:
# srv.base_url == "http://127.0.0.1:11500"
kit = local.Chat(model="llama3.2", base_url=srv.base_url)
response = kit.chat(local.ChatConfig(user_message="Hello!"))
print(response.content)
# Server is automatically stopped here
Manual Start / Stop
srv = local.OllamaServerManager(port=11500)
srv.start() # blocks until the server is ready (default timeout: 30 s)
kit = local.Chat(model="llama3.2", base_url=srv.base_url)
print(kit.chat(local.ChatConfig(user_message="What is 2+2?")).content)
srv.stop() # graceful SIGTERM → SIGKILL if needed
Pull Models Programmatically
with local.OllamaServerManager(port=11500) as srv:
srv.pull("llama3.2") # equivalent to: ollama pull llama3.2
print(srv.list_models()) # ['llama3.2:latest', ...]
kit = local.Chat(model="llama3.2", base_url=srv.base_url)
Constructor Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Bind address |
|
|
|
TCP port for the REST API |
|
|
|
Seconds to wait for readiness |
|
|
|
Path to the Ollama binary |
Pointing at a Remote Ollama Server
If Ollama runs on another machine (e.g. a GPU box), pass its address:
kit = local.Chat(
model="llama3.2",
base_url="http://192.168.1.42:11434",
default_prompt=prompt,
)
Validated JSON Output
from pydantic import BaseModel
class Summary(BaseModel):
key_points: list[str]
sentiment: str
typed_prompt = RactoPrompt(
role="You are a text analyser.",
aim="Summarise the text.",
constraints=["Return only the JSON."],
tone="Neutral",
output_format=Summary,
)
kit = local.Chat(model="llama3.2", default_prompt=typed_prompt)
response = kit.chat(
local.ChatConfig(
user_message="Python is great for AI and scripting.",
response_model=Summary,
)
)
print(response.parsed) # validated Summary instance dict
Recommended Models
Use case |
Model |
Size |
|---|---|---|
General chat |
|
2 GB |
Code generation |
|
9 GB |
Long context |
|
4.7 GB |
Multilingual |
|
4.5 GB |
Instruction following |
|
4 GB |
Embeddings |
|
274 MB |
Small + fast |
|
2.2 GB |
Browse all models at https://ollama.com/library.