ractogateway.ollama_developer_kit.kit
Ollama Developer Kit — production-grade local model interface.
Usage:
from ractogateway import ollama_developer_kit as local
kit = local.OllamaDeveloperKit(model="llama3.2", default_prompt=my_prompt)
response = kit.chat(local.ChatConfig(user_message="Hello"))
for chunk in kit.stream(local.ChatConfig(user_message="Hello")):
print(chunk.delta.text, end="", flush=True)
No API key is needed. Start the Ollama server and pull a model first:
ollama serve # starts server at http://localhost:11434
ollama pull llama3.2 # download the model
- class ractogateway.ollama_developer_kit.kit.OllamaDeveloperKit(model='llama3.2', *, base_url='http://localhost:11434', embedding_model='nomic-embed-text', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]
Bases:
objectComplete Ollama local-model developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.
Connects to a locally-running Ollama server. No API key required.
- Parameters:
model (
str) – Model name as reported byollama list(e.g."llama3.2","mistral","qwen2.5"). Use"auto"when aCostAwareRouteris provided — the router will select the model per-request.base_url (
str) – Ollama server base URL. Defaults tohttp://localhost:11434.embedding_model (
str) – Default model for embedding calls. Defaults to"nomic-embed-text".default_prompt (
RactoPrompt|None) – RACTO prompt used whenChatConfig.promptisNone.exact_cache (
ExactMatchCache|None) – OptionalExactMatchCache.semantic_cache (
SemanticCache|None) – OptionalSemanticCache.router (
CostAwareRouter|None) – OptionalCostAwareRouter. Required whenmodel="auto".truncator (
TokenTruncator|None) – OptionalTokenTruncator.tracer (
RactoTracer|None) – OptionalRactoTracer.metrics (
GatewayMetricsMiddleware|None) – OptionalGatewayMetricsMiddleware.
- provider: str = 'ollama'
- chat(config)[source]
Synchronous chat completion with optional middleware pipeline.
Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.
- Return type:
- async achat(config)[source]
Async chat completion with optional middleware pipeline.
- Return type:
- stream(config)[source]
Synchronous streaming — yields
StreamChunkobjects.Example:
for chunk in kit.stream(config): print(chunk.delta.text, end="", flush=True) if chunk.is_final: print(f"\nTokens: {chunk.usage}")
- Return type:
Iterator[StreamChunk]
- async astream(config)[source]
Async streaming — yields
StreamChunkobjects.- Return type:
AsyncIterator[StreamChunk]
- embed(config)[source]
Synchronous embedding via Ollama’s embed API.
Example:
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"])) print(resp.vectors[0].embedding[:5])
- Return type:
EmbeddingResponse
- async aembed(config)[source]
Async embedding via Ollama’s embed API.
- Return type:
EmbeddingResponse