ractogateway.ollama_developer_kit.kit

Ollama Developer Kit — production-grade local model interface.

Usage:

from ractogateway import ollama_developer_kit as local

kit = local.OllamaDeveloperKit(model="llama3.2", default_prompt=my_prompt)
response = kit.chat(local.ChatConfig(user_message="Hello"))

for chunk in kit.stream(local.ChatConfig(user_message="Hello")):
    print(chunk.delta.text, end="", flush=True)

No API key is needed. Start the Ollama server and pull a model first:

ollama serve          # starts server at http://localhost:11434
ollama pull llama3.2  # download the model
class ractogateway.ollama_developer_kit.kit.OllamaDeveloperKit(model='llama3.2', *, base_url='http://localhost:11434', embedding_model='nomic-embed-text', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]

Bases: object

Complete Ollama local-model developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.

Connects to a locally-running Ollama server. No API key required.

Parameters:
  • model (str) – Model name as reported by ollama list (e.g. "llama3.2", "mistral", "qwen2.5"). Use "auto" when a CostAwareRouter is provided — the router will select the model per-request.

  • base_url (str) – Ollama server base URL. Defaults to http://localhost:11434.

  • embedding_model (str) – Default model for embedding calls. Defaults to "nomic-embed-text".

  • default_prompt (RactoPrompt | None) – RACTO prompt used when ChatConfig.prompt is None.

  • exact_cache (ExactMatchCache | None) – Optional ExactMatchCache.

  • semantic_cache (SemanticCache | None) – Optional SemanticCache.

  • router (CostAwareRouter | None) – Optional CostAwareRouter. Required when model="auto".

  • truncator (TokenTruncator | None) – Optional TokenTruncator.

  • tracer (RactoTracer | None) – Optional RactoTracer.

  • metrics (GatewayMetricsMiddleware | None) – Optional GatewayMetricsMiddleware.

provider: str = 'ollama'
chat(config)[source]

Synchronous chat completion with optional middleware pipeline.

Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.

Return type:

LLMResponse

async achat(config)[source]

Async chat completion with optional middleware pipeline.

Return type:

LLMResponse

stream(config)[source]

Synchronous streaming — yields StreamChunk objects.

Example:

for chunk in kit.stream(config):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")
Return type:

Iterator[StreamChunk]

async astream(config)[source]

Async streaming — yields StreamChunk objects.

Return type:

AsyncIterator[StreamChunk]

embed(config)[source]

Synchronous embedding via Ollama’s embed API.

Example:

resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])
Return type:

EmbeddingResponse

async aembed(config)[source]

Async embedding via Ollama’s embed API.

Return type:

EmbeddingResponse