ractogateway.ollama_developer_kit
Ollama Developer Kit — from ractogateway import ollama_developer_kit as local.
Short usage:
from ractogateway import ollama_developer_kit as local
kit = local.Chat(model="llama3.2") # short alias
kit = local.OllamaDeveloperKit(model="llama3.2") # full name (same class)
No API key required — Ollama runs locally:
ollama serve
ollama pull llama3.2
- ractogateway.ollama_developer_kit.Chat
Short alias —
local.Chat(model="llama3.2")is identical tolocal.OllamaDeveloperKit(...).
- class ractogateway.ollama_developer_kit.ChatConfig(**data)[source]
Bases:
BaseModelValidated input for every
chat/achat/stream/astreamcall.Pass a single
ChatConfigto any developer-kit method. Every field has a safe default so you only need to supply what you actually need.Minimal example:
config = ChatConfig(user_message="Explain Python generators.") response = kit.chat(config)
Vision / multimodal example:
from ractogateway.prompts.engine import RactoFile config = ChatConfig( user_message="Describe this chart.", attachments=[RactoFile.from_path("sales_q4.png")], )
Structured JSON output example:
class Sentiment(BaseModel): label: str score: float config = ChatConfig( user_message="I love this library!", response_model=Sentiment, )
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- user_message: str
- prompt: RactoPrompt | None
- temperature: float
- max_tokens: int
- tools: ToolRegistry | None
- auto_execute_tools: bool
- max_tool_turns: int
- max_validation_retries: int
- history: list[Message]
- chain_of_thought: bool
- native_thinking: bool
- thinking_budget: int
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.CostAwareRouter(tiers)[source]
Bases:
objectRoutes LLM requests to the appropriate model tier based on message complexity — without making any extra API calls.
- Parameters:
tiers (
list[RoutingTier]) – Ordered list ofRoutingTierobjects, sorted ascending bymax_score(cheapest first). The last tier’smax_scoreshould be100to act as fallback.- Raises:
ValueError – If
tiersis empty or not sorted ascending bymax_score.Example — 3-tier OpenAI ladder:: – from ractogateway.routing import CostAwareRouter, RoutingTier router = CostAwareRouter([ RoutingTier(model=”gpt-4o-mini”, max_score=30), RoutingTier(model=”gpt-4o”, max_score=70), RoutingTier(model=”o3-mini”, max_score=100), ]) model = router.route(“What is 2+2?”) # → “gpt-4o-mini” model = router.route(“Analyze the trade-offs between Redis Cluster and ” “Cassandra for a write-heavy time-series workload …”) # → “o3-mini”
Example — binary routing (2 tiers):: – router = CostAwareRouter([ RoutingTier(model=”claude-haiku-4-5-20251001”, max_score=40), RoutingTier(model=”claude-opus-4-6”, max_score=100), ])
- score(text)[source]
Compute a complexity score in [0, 100] for text.
A higher score means a more complex task.
- Return type:
Algorithm
token_pts = min(len(text)//4, SAT) * (MAX_TP / SAT) kw_pts = min(matches * PPK, MAX_KP) score = clamp(token_pts + kw_pts, 0, 100)
- route(text)[source]
Return the model identifier for text.
Walks tiers (cheapest first) and returns the first model whose
max_score ≥ complexity_score. Always returns a model because the last tier hasmax_score == 100(validated at construction).Complexity: O(k) where k = number of tiers.
- Return type:
- property tiers: tuple[RoutingTier, ...]
Immutable view of the configured tiers.
- class ractogateway.ollama_developer_kit.EmbeddingConfig(**data)[source]
Bases:
BaseModelValidated input for
embed/aembedcalls.Example:
config = EmbeddingConfig(texts=["Hello world", "Goodbye world"]) response = kit.embed(config)
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.EmbeddingResponse(**data)[source]
Bases:
BaseModelUnified response from an embedding call.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- vectors: list[EmbeddingVector]
- model: str
- raw: Any
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.EmbeddingVector(**data)[source]
Bases:
BaseModelA single embedding result.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- index: int
- text: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.ExactMatchCache(max_size=1024, ttl_seconds=None)[source]
Bases:
objectUltra-low-latency key-value cache for identical LLM requests.
- Parameters:
max_size (
int) – LRU capacity.0= unlimited (no eviction).ttl_seconds (
float|None) – Entries older than ttl_seconds are treated as misses and transparently evicted.Nonedisables expiry.Example:: –
from ractogateway.cache import ExactMatchCache
cache = ExactMatchCache(max_size=512, ttl_seconds=3600)
# Wire into a kit: kit = OpenAIDeveloperKit(model=”gpt-4o”, exact_cache=cache)
- get(user_message, system_prompt, model, temperature, max_tokens)[source]
Return a cached response or
Noneon a miss.O(1) — dictionary lookup + optional move-to-end.
- Return type:
- put(user_message, system_prompt, model, temperature, max_tokens, response)[source]
Store a response. Evicts LRU entry when at capacity.
O(1) amortised — dictionary insert + optional popitem(last=False).
- Return type:
- invalidate(user_message, system_prompt, model, temperature, max_tokens)[source]
Remove a specific entry. Returns
Trueif it was present.- Return type:
- property stats: CacheStats
Return a snapshot of hit/miss/size counters.
- class ractogateway.ollama_developer_kit.FinishReason(*values)[source]
-
Why the model stopped generating.
- STOP = 'stop'
- TOOL_CALL = 'tool_call'
- LENGTH = 'length'
- CONTENT_FILTER = 'content_filter'
- ERROR = 'error'
- class ractogateway.ollama_developer_kit.LLMResponse(**data)[source]
Bases:
BaseModelUnified, provider-agnostic response envelope.
Every adapter’s
run()method returns one of these, regardless of whether the underlying provider is OpenAI, Gemini, or Anthropic.Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- tool_calls: list[ToolCallResult]
- finish_reason: FinishReason
- raw: Any
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.Message(**data)[source]
Bases:
BaseModelA single conversation turn.
Used inside
ChatConfig.historyto provide prior conversation context to the model for multi-turn conversations.Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- role: MessageRole
- content: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.MessageRole(*values)[source]
-
Role of a single message in a conversation.
- SYSTEM = 'system'
- USER = 'user'
- ASSISTANT = 'assistant'
- class ractogateway.ollama_developer_kit.OllamaDeveloperKit(model='llama3.2', *, base_url='http://localhost:11434', embedding_model='nomic-embed-text', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]
Bases:
objectComplete Ollama local-model developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.
Connects to a locally-running Ollama server. No API key required.
- Parameters:
model (
str) – Model name as reported byollama list(e.g."llama3.2","mistral","qwen2.5"). Use"auto"when aCostAwareRouteris provided — the router will select the model per-request.base_url (
str) – Ollama server base URL. Defaults tohttp://localhost:11434.embedding_model (
str) – Default model for embedding calls. Defaults to"nomic-embed-text".default_prompt (
RactoPrompt|None) – RACTO prompt used whenChatConfig.promptisNone.exact_cache (
ExactMatchCache|None) – OptionalExactMatchCache.semantic_cache (
SemanticCache|None) – OptionalSemanticCache.router (
CostAwareRouter|None) – OptionalCostAwareRouter. Required whenmodel="auto".truncator (
TokenTruncator|None) – OptionalTokenTruncator.tracer (
RactoTracer|None) – OptionalRactoTracer.metrics (
GatewayMetricsMiddleware|None) – OptionalGatewayMetricsMiddleware.
- provider: str = 'ollama'
- chat(config)[source]
Synchronous chat completion with optional middleware pipeline.
Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.
- Return type:
- async achat(config)[source]
Async chat completion with optional middleware pipeline.
- Return type:
- stream(config)[source]
Synchronous streaming — yields
StreamChunkobjects.Example:
for chunk in kit.stream(config): print(chunk.delta.text, end="", flush=True) if chunk.is_final: print(f"\nTokens: {chunk.usage}")
- Return type:
Iterator[StreamChunk]
- async astream(config)[source]
Async streaming — yields
StreamChunkobjects.- Return type:
AsyncIterator[StreamChunk]
- embed(config)[source]
Synchronous embedding via Ollama’s embed API.
Example:
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"])) print(resp.vectors[0].embedding[:5])
- Return type:
EmbeddingResponse
- async aembed(config)[source]
Async embedding via Ollama’s embed API.
- Return type:
EmbeddingResponse
- class ractogateway.ollama_developer_kit.OllamaServerManager(*, host='127.0.0.1', port=11434, startup_timeout=30.0, ollama_bin='ollama')[source]
Bases:
objectManage the lifecycle of an Ollama server subprocess.
The server is started with the
OLLAMA_HOSTenvironment variable set to{host}:{port}, which makes Ollama listen on the requested address.- Parameters:
host (
str) – Bind address. Defaults to"127.0.0.1"(localhost only).port (
int) – TCP port for the Ollama REST API. Defaults to11434(the standard Ollama port). Change this to run multiple Ollama instances or avoid conflicts with an already-running server.startup_timeout (
float) – Seconds to wait for the server to become ready after starting the subprocess. RaisesTimeoutErrorif the server doesn’t respond within this window.ollama_bin (
str) – Path to theollamaexecutable. Defaults to"ollama"(looked up via PATH).
- base_url
The full
http://{host}:{port}URL of the managed server. Use this to construct aOllamaDeveloperKit:kit = local.Chat(model="llama3.2", base_url=srv.base_url)
- Type:
Examples
Context manager (recommended — guarantees cleanup):
with OllamaServerManager(port=11500) as srv: kit = local.Chat(model="llama3.2", base_url=srv.base_url) print(kit.chat(local.ChatConfig(user_message="Hi")).content)
Manual start / stop:
srv = OllamaServerManager(port=11500) srv.start() try: ... finally: srv.stop()
- property base_url: str
Return
http://{host}:{port}.
- property is_running: bool
Truewhen the subprocess is alive.
- start()[source]
Start the Ollama server subprocess.
Returns self so that the call can be chained:
srv = OllamaServerManager(port=11500).start()
- Raises:
RuntimeError – If the server is already running.
FileNotFoundError – If the
ollamabinary cannot be found.TimeoutError – If the server does not become ready within startup_timeout seconds.
- Return type:
OllamaServerManager
- stop()[source]
Stop the Ollama server subprocess gracefully.
Sends SIGTERM first; if the process doesn’t exit within 5 seconds, SIGKILL is used. Silently does nothing if the server is not running.
- Return type:
- pull(model)[source]
Pull model from the Ollama library into the running server.
Equivalent to running
ollama pull <model>in a shell, but scoped to the server managed by this instance viaOLLAMA_HOST.- Parameters:
model (
str) – Model name, e.g."llama3.2","nomic-embed-text".- Raises:
RuntimeError – If the server is not running.
subprocess.CalledProcessError – If
ollama pullexits with a non-zero status.
- Return type:
- list_models()[source]
Return the names of locally-available models on this server.
Uses a lightweight HTTP request to the
/api/tagsendpoint instead of a subprocess so it works even whenollamaCLI is not on PATH.
- class ractogateway.ollama_developer_kit.RoutingTier(**data)[source]
Bases:
BaseModelOne tier in the cost-aware routing ladder.
The router evaluates a complexity score (0-100) for each incoming message and selects the first tier whose
max_scoreis >= that score. The last tier in the list always acts as the catch-all fallback.- Parameters:
model (str) – The LLM model identifier to use for requests that fall in this tier (e.g.
"gpt-4o-mini","gemini-2.0-flash","claude-haiku-4-5-20251001").max_score (float) – Inclusive upper bound on the complexity score that routes to this model. Range: 0-100. Set to
100for the last (most powerful) tier so it catches everything.
Examples
tiers = [ RoutingTier(model="gpt-4o-mini", max_score=30), RoutingTier(model="gpt-4o", max_score=70), RoutingTier(model="o3-mini", max_score=100), ]
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model: str
- max_score: float
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.SemanticCache(embed_fn, similarity_threshold=0.95, max_size=512, ttl_seconds=None)[source]
Bases:
objectVector-similarity cache — returns cached answers for semantically similar queries, costing $0 in API calls.
- Parameters:
embed_fn (
Callable[[str],list[float]]) – Any callable(text: str) -> list[float]. Called once per new query (cache miss) and once atput()time.similarity_threshold (
float) – Minimum cosine similarity to declare a hit. Default0.95is intentionally strict to avoid incorrect responses.max_size (
int) – Maximum number of entries (LRU eviction).0= unlimited.ttl_seconds (
float|None) – Optional per-entry TTL.Nonedisables expiry.
Examples
import ractogateway.openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit(model="gpt-4o") def embed(text: str) -> list[float]: import openai r = openai.OpenAI().embeddings.create( model="text-embedding-3-small", input=text ) return r.data[0].embedding cache = SemanticCache(embed_fn=embed, similarity_threshold=0.95)
- get(query)[source]
Embed query and return a cached response if cosine-sim ≥ threshold.
Returns
Noneon a cache miss (caller should make the real API call and then invokeput()).Complexity: O(n·d) where n = number of entries, d = embedding dim.
- Return type:
- put(query, response)[source]
Embed query and store response for future similar queries.
Evicts LRU entry when at capacity.
- Return type:
- property stats: CacheStats
Return a snapshot of hit/miss/size counters.
- class ractogateway.ollama_developer_kit.StreamChunk(**data)[source]
Bases:
BaseModelA single piece of a streaming response.
Consumers iterate over
StreamChunkobjects — they never touch raw provider events directly.- delta
The incremental content for this chunk.
- accumulated_text
Running concatenation of all
delta.textvalues so far.
- finish_reason
Nonefor intermediate chunks; set on the final chunk.
- tool_calls
Empty until the final chunk (
is_final=True).
- usage
Token counts — populated on the final chunk only.
- is_final
Trueonly for the very last chunk in the stream.
- raw
The underlying provider event (escape-hatch for advanced users).
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- delta: StreamDelta
- accumulated_text: str
- accumulated_thinking: str
- is_thinking: bool
- finish_reason: FinishReason | None
- tool_calls: list[ToolCallResult]
- is_final: bool
- raw: Any
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.StreamDelta(**data)[source]
Bases:
BaseModelIncremental content produced by a single streaming event.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- text: str
- thinking: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.TokenTruncator(config=None)[source]
Bases:
objectSmart conversation-history trimmer.
- Parameters:
config (
TruncationConfig|None) –TruncationConfiginstance. If omitted a default config is used (approximate counter, 8 k limit).
Examples
from ractogateway.truncation import TokenTruncator, TruncationConfig import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") truncator = TokenTruncator( TruncationConfig( token_counter=lambda t: len(enc.encode(t)), keep_first_n=2, keep_last_n=8, ) ) kit = OpenAIDeveloperKit(model="gpt-4o", truncator=truncator)
- truncate(chat_config, model)[source]
Return a copy of chat_config with trimmed history if necessary.
If the total estimated token count (system prompt + history + user_message) fits within the model’s context limit, the original
ChatConfigis returned unchanged.- Parameters:
chat_config (
ChatConfig) – The chat configuration to potentially truncate.model (
str) – The resolved model name used to look up the context-window limit.
- Return type:
ChatConfig- Returns:
ChatConfig – A new
ChatConfiginstance with (possibly shorter) history. Theuser_messageand all other fields are preserved verbatim.
- class ractogateway.ollama_developer_kit.ToolCallResult(**data)[source]
Bases:
BaseModelA single tool/function call returned by the model.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- id: str
- name: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.ollama_developer_kit.TruncationConfig(**data)[source]
Bases:
BaseModelConfiguration for
TokenTruncator.- Parameters:
max_context_tokens (int | None) – Hard cap on total prompt tokens before calling the API. When
None, the truncator looks up the model inMODEL_CONTEXT_LIMITS(falling back to8 192).keep_first_n (int) – Number of history messages to always preserve from the start of the conversation (anchors context). Defaults to
2.keep_last_n (int) – Number of history messages to always preserve from the most recent end of the conversation. Defaults to
6.token_counter (Callable[[str], int]) –
Callable
(text: str) -> int. Defaults to the built-in approximate counter (len // 4). Swap fortiktokenfor exact OpenAI token counts:import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") config = TruncationConfig(token_counter=lambda t: len(enc.encode(t)))
safety_margin (int) – Extra token budget reserved beyond the system prompt and user message. Defaults to
512.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- keep_first_n: int
- keep_last_n: int
- safety_margin: int
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- resolve_limit(model)[source]
Return the effective token limit for model.
Priority:
max_context_tokens→MODEL_CONTEXT_LIMITSlookup →_DEFAULT_CONTEXT.- Return type: