# Ollama — Local Model Inference `OllamaDeveloperKit` lets you run any open-source LLM on your own hardware with zero API key and zero data leaving your machine. It connects to a locally-running [Ollama](https://ollama.com/) server and exposes the same six-method interface as every other RactoGateway kit. ## Installation ```bash # 1. Install Ollama → https://ollama.com/download # 2. Pull any model ollama pull llama3.2 # 2 GB — great for everyday tasks ollama pull mistral # 4 GB — excellent instruction following ollama pull qwen2.5:7b # 4.5 GB — strong multilingual model ollama pull nomic-embed-text # lightweight embeddings model # 3. Install the Python extra pip install ractogateway[ollama] ``` Ollama starts automatically on most platforms. If not, run: ```bash ollama serve ``` ## Quick Start ```python from ractogateway import ollama_developer_kit as local, RactoPrompt prompt = RactoPrompt( role="You are a helpful assistant.", aim="Answer the user clearly and concisely.", constraints=["Stay on topic.", "Do not hallucinate."], tone="Friendly", output_format="text", ) # No API key — Ollama listens at http://localhost:11434 by default kit = local.Chat(model="llama3.2", default_prompt=prompt) response = kit.chat(local.ChatConfig(user_message="What is a transformer model?")) print(response.content) ``` ## Constructor Parameters ```{eval-rst} .. autoclass:: ractogateway.ollama_developer_kit.kit.OllamaDeveloperKit :members: :no-index: ``` | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `model` | `str` | `"llama3.2"` | Model name from `ollama list` | | `base_url` | `str` | `"http://localhost:11434"` | Ollama server base URL | | `embedding_model` | `str` | `"nomic-embed-text"` | Default embedding model | | `default_prompt` | `RactoPrompt \| None` | `None` | Kit-level default prompt | | `exact_cache` | `ExactMatchCache \| None` | `None` | In-process exact-match cache | | `semantic_cache` | `SemanticCache \| None` | `None` | Cosine-similarity cache | | `router` | `CostAwareRouter \| None` | `None` | Required when `model="auto"` | | `truncator` | `TokenTruncator \| None` | `None` | Auto-trim long histories | | `tracer` | `RactoTracer \| None` | `None` | OpenTelemetry spans | | `metrics` | `GatewayMetricsMiddleware \| None` | `None` | Prometheus metrics | ## Streaming ```python for chunk in kit.stream(local.ChatConfig(user_message="Write a haiku about Python.")): print(chunk.delta.text, end="", flush=True) if chunk.is_final: print(f"\nTokens: {chunk.usage}") ``` ## Async ```python import asyncio async def main() -> None: response = await kit.achat(local.ChatConfig(user_message="Explain async/await.")) print(response.content) async for chunk in kit.astream(local.ChatConfig(user_message="Count to five.")): print(chunk.delta.text, end="", flush=True) asyncio.run(main()) ``` ## Embeddings Ollama requires a dedicated embedding model. Pull it first: ```bash ollama pull nomic-embed-text ``` ```python embed_kit = local.Chat( model="llama3.2", embedding_model="nomic-embed-text", default_prompt=prompt, ) resp = embed_kit.embed(local.EmbeddingConfig(texts=["hello world", "goodbye world"])) for vec in resp.vectors: print(f"[{vec.index}] '{vec.text}' — dim={len(vec.embedding)}") ``` ## Vision Models (Image Input) Ollama supports multimodal / vision models such as ``llava``, ``llava-llama3``, and ``minicpm-v``. Pass image files via ``ChatConfig.attachments``: ```python from ractogateway.prompts.engine import RactoFile # Load an image img = RactoFile.from_path("/tmp/photo.jpg") # Or from raw bytes img = RactoFile.from_bytes(open("photo.jpg", "rb").read(), "image/jpeg") kit = local.Chat(model="llava", default_prompt=prompt) response = kit.chat( local.ChatConfig( user_message="Describe what you see in this image.", attachments=[img], ) ) print(response.content) ``` Pull a vision model first: ```bash ollama pull llava # 4.5 GB — general vision model ollama pull llava-llama3 # 5 GB — Llama 3 backbone ollama pull minicpm-v # 5.5 GB — strong at charts / documents ``` ## Tool Calling Ollama supports function calling on models that were trained with tool support (e.g. `llama3.1`, `llama3.2`, `mistral-nemo`). ```python from ractogateway import ToolRegistry, tool @tool def get_weather(city: str) -> str: """Return current weather for a city.""" return f"Sunny, 22 °C in {city}" registry = ToolRegistry([get_weather]) response = kit.chat( local.ChatConfig( user_message="What's the weather in Paris?", tools=registry, auto_execute_tools=True, ) ) print(response.content) ``` ## Embedded Server Management RactoGateway can start and stop Ollama for you so you don't need to run ``ollama serve`` separately. This is especially useful when: * You need a **custom port** (e.g. to avoid conflicts with an existing server). * You want **programmatic lifecycle control** inside tests or long-running services. ### How It Works ``OllamaServerManager`` launches an ``ollama serve`` subprocess and configures it to listen on the port you choose via the ``OLLAMA_HOST`` environment variable. It registers an ``atexit`` handler so the process is always cleaned up — even if your program crashes. ### Context Manager (Recommended) ```python from ractogateway import ollama_developer_kit as local with local.OllamaServerManager(port=11500) as srv: # srv.base_url == "http://127.0.0.1:11500" kit = local.Chat(model="llama3.2", base_url=srv.base_url) response = kit.chat(local.ChatConfig(user_message="Hello!")) print(response.content) # Server is automatically stopped here ``` ### Manual Start / Stop ```python srv = local.OllamaServerManager(port=11500) srv.start() # blocks until the server is ready (default timeout: 30 s) kit = local.Chat(model="llama3.2", base_url=srv.base_url) print(kit.chat(local.ChatConfig(user_message="What is 2+2?")).content) srv.stop() # graceful SIGTERM → SIGKILL if needed ``` ### Pull Models Programmatically ```python with local.OllamaServerManager(port=11500) as srv: srv.pull("llama3.2") # equivalent to: ollama pull llama3.2 print(srv.list_models()) # ['llama3.2:latest', ...] kit = local.Chat(model="llama3.2", base_url=srv.base_url) ``` ### Constructor Parameters | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `host` | `str` | `"127.0.0.1"` | Bind address | | `port` | `int` | `11434` | TCP port for the REST API | | `startup_timeout` | `float` | `30.0` | Seconds to wait for readiness | | `ollama_bin` | `str` | `"ollama"` | Path to the Ollama binary | ## Pointing at a Remote Ollama Server If Ollama runs on another machine (e.g. a GPU box), pass its address: ```python kit = local.Chat( model="llama3.2", base_url="http://192.168.1.42:11434", default_prompt=prompt, ) ``` ## Validated JSON Output ```python from pydantic import BaseModel class Summary(BaseModel): key_points: list[str] sentiment: str typed_prompt = RactoPrompt( role="You are a text analyser.", aim="Summarise the text.", constraints=["Return only the JSON."], tone="Neutral", output_format=Summary, ) kit = local.Chat(model="llama3.2", default_prompt=typed_prompt) response = kit.chat( local.ChatConfig( user_message="Python is great for AI and scripting.", response_model=Summary, ) ) print(response.parsed) # validated Summary instance dict ``` ## Recommended Models | Use case | Model | Size | | --- | --- | --- | | General chat | `llama3.2` | 2 GB | | Code generation | `deepseek-coder-v2` | 9 GB | | Long context | `llama3.1:8b` | 4.7 GB | | Multilingual | `qwen2.5:7b` | 4.5 GB | | Instruction following | `mistral` | 4 GB | | Embeddings | `nomic-embed-text` | 274 MB | | Small + fast | `phi3.5` | 2.2 GB | Browse all models at .