RactoGateway — Complete User Guide

Who this guide is for: complete beginners who have never used an LLM library before, as well as experienced developers who want a deep-dive reference. Every parameter is explained in plain English and in technical terms, with working code examples and expected output.

Table of Contents

Jargon Buster — Know the Words Before You Write the Code
What is RactoGateway?
Installation
Core Mental Model
RactoPrompt — The Heart of Every Request
Developer Kits — Your Chat Interface
Your First Chat
ChatConfig — Controlling Every Request
Getting Structured / Typed Output
- 9.1 Complex Nested Structured Output
- 9.2 Validation Retries and ResponseModelValidationError
Multi-Turn Conversations (History)
Streaming — Real-Time Token-by-Token Output
Tool Calling — LLM Calls Your Python Functions
File Attachments — Vision & PDFs
Embeddings — Teaching Machines to Understand Text
Performance & Cost Optimisation
- 15.1 Exact Match Cache
- 15.2 Semantic Cache
- 15.3 Token Truncation
- 15.4 Cost-Aware Routing
All Five Developer Kits
- 16.1 OpenAIDeveloperKit (GPT)
- 16.2 GoogleDeveloperKit (Gemini)
- 16.3 AnthropicDeveloperKit (Claude)
- 16.4 OllamaDeveloperKit (Local / Offline)
- 16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM)
RAG — Retrieval-Augmented Generation
Redis — Production Infrastructure
Common Mistakes & How to Fix Them
Prebuilt Pipelines — Production Workflows
- SQL Analyst, List Classifier, Video Processor, Agent
Chain of Thought Reasoning
Native Thinking / Extended Reasoning
PageIndexRAG — Vectorless RAG

1. Jargon Buster

Before diving into code, here are the key terms you will encounter. Skip to §2 if you already know these.

Term	Plain-English Meaning	Technical Definition
LLM	A very powerful autocomplete that understands meaning	Large Language Model — a neural network trained on vast text corpora to predict/generate natural language
Prompt	What you say to the AI	The input text (plus optional instructions) sent to an LLM
Completion / Response	What the AI says back	The LLM’s generated output tokens
Token	Roughly one word (sometimes less)	The smallest unit an LLM processes; ~4 chars for English
System Prompt	The AI’s job description	An instruction block sent before the conversation; sets behaviour and constraints
Temperature	How creative vs. predictable the AI is	Float 0–2. 0 = deterministic (same output every time). Higher = more random/creative
Streaming	Getting the answer word-by-word in real time	Server-sent events where each token is pushed to the client as it is generated
Embedding	Converting text into a list of numbers	A dense vector representation where semantically similar texts are numerically close
RAG	Letting the AI “look things up” before answering	Retrieval-Augmented Generation — retrieve relevant chunks from a knowledge base and inject them into the prompt
Tool Calling	The AI can trigger your Python functions	Function-calling protocol where the LLM emits a structured intent and the client executes a real function
Pydantic Model	A Python class that validates data automatically	A `BaseModel` subclass that enforces types and field rules at runtime
Cache	Store an answer so you don’t ask the AI twice	In-memory or distributed key-value store keyed on request fingerprint
Context Window	The AI’s short-term memory	Maximum number of tokens the model can process in one request
Adapter	The translator between our library and the AI provider	A thin class that converts our internal format to the OpenAI / Google / Anthropic API wire format

2. What is RactoGateway?

Plain English: RactoGateway is a Python library that lets you talk to different AI models (OpenAI, Google, Anthropic) using the same code. You don’t need to learn three different APIs. You write your prompts using a structured template (the RACTO principle), and the library takes care of formatting, caching, routing, and more.

Technical: RactoGateway is a provider-agnostic LLM orchestration SDK built on Pydantic. It provides:

A unified RactoPrompt structured prompt compiler (the RACTO principle)
Provider-specific developer kits (OpenAIDeveloperKit, GoogleDeveloperKit, AnthropicDeveloperKit)
Sync and async parity on every method
Optional middleware: exact-match cache, semantic cache, cost-aware router, token truncator
Tool calling, file attachments, streaming, embeddings, RAG, fine-tuning, and production infra (Redis, Celery, Kafka)

Why does this exist? Without RactoGateway, switching from OpenAI to Anthropic means rewriting all your code. With RactoGateway, you swap one class name.

3. Installation

# Minimum — no LLM provider yet
pip install ractogateway

# OpenAI (GPT models)
pip install "ractogateway[openai]"

# Google (Gemini models)
pip install "ractogateway[google]"

# Anthropic (Claude models)
pip install "ractogateway[anthropic]"

# All three providers at once
pip install "ractogateway[all]"

# RAG (document reading, chunking, embedding, stores)
pip install "ractogateway[rag-all]"

# Redis (distributed cache, rate limiting, chat memory)
pip install "ractogateway[redis]"

Requires Python 3.10 or later.

4. Core Mental Model

Think of RactoGateway in three layers:

┌─────────────────────────────────────────────────────┐
│  YOUR CODE                                          │
│  RactoPrompt → ChatConfig → kit.chat()              │
├─────────────────────────────────────────────────────┤
│  DEVELOPER KIT  (OpenAIDeveloperKit, etc.)           │
│  middleware: cache → route → truncate → API call    │
├─────────────────────────────────────────────────────┤
│  ADAPTER  (OpenAILLMKit, GoogleLLMKit, etc.)         │
│  Translates our format → provider wire format       │
├─────────────────────────────────────────────────────┤
│  PROVIDER API  (OpenAI, Google, Anthropic)           │
└─────────────────────────────────────────────────────┘

You only ever touch the top layer. The kit and adapter layers are managed for you.

5. RactoPrompt

RactoPrompt is how you write instructions for the AI. It enforces the RACTO principle — a structured format that dramatically reduces hallucinations and ambiguous outputs.

RACTO stands for:

Letter	Field	Plain English	Technical
R	`role`	Who is the AI?	System identity; primes the model’s behaviour via persona specification
A	`aim`	What should it do?	Objective statement; the task the model must complete
C	`constraints`	What must it never do?	Hard invariants; rule set injected into `[CONSTRAINTS]` block
T	`tone`	How should it talk?	Communication register; affects lexical and stylistic choices
O	`output_format`	What shape should the answer be in?	Output schema; can be a keyword, a string, or a Pydantic model class

Plus two optional helpers: context (background knowledge) and examples (few-shot examples).

5.1 Minimal Example

from ractogateway.prompts.engine import RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful customer-support agent for a software company.",
    aim="Answer the user's question about our product.",
    constraints=[
        "Never make up features that don't exist.",
        "If you don't know the answer, say so.",
    ],
    tone="Friendly and concise.",
    output_format="text",
)

# See what the compiled system prompt looks like:
print(prompt.compile())

Expected output:

[ROLE]
You are a helpful customer-support agent for a software company.

[AIM]
Answer the user's question about our product.

[CONSTRAINTS]
- Never make up features that don't exist.
- If you don't know the answer, say so.

[TONE]
Friendly and concise.

[OUTPUT]
Respond in plain text with no special formatting.

[GUARDRAILS]
- If you are unsure or lack sufficient information, state it explicitly rather than guessing.
- Do NOT fabricate facts, citations, URLs, statistics, or code that you cannot verify.
- Stick strictly to what is asked. Do not add unrequested information.
- If the answer requires assumptions, list each assumption explicitly before proceeding.

Notice the [GUARDRAILS] section at the bottom. This is auto-generated by anti_hallucination=True (the default). It tells the model to be honest about uncertainty. You can disable it with anti_hallucination=False if you need maximum creative freedom.

5.2 Full Parameter Reference

from pydantic import BaseModel

class Summary(BaseModel):
    headline: str
    bullet_points: list[str]
    confidence_score: float  # 0.0 to 1.0

prompt = RactoPrompt(
    # ── REQUIRED ──────────────────────────────────────────────────────
    role="You are a senior financial analyst.",
    # Plain: "Tell the AI who it is"
    # Technical: Persona string prepended to the [ROLE] block; primes
    #            the model's prior distribution toward domain-specific vocabulary

    aim="Summarise the provided earnings report into key takeaways.",
    # Plain: "Tell the AI what job it has to do"
    # Technical: Task objective injected into [AIM]; should be one clear imperative sentence

    constraints=[
        "Only use numbers that appear in the report — never invent figures.",
        "Keep bullet points to at most 15 words each.",
        "Do not provide investment advice.",
    ],
    # Plain: "Red lines the AI must never cross"
    # Technical: List[str]; each item becomes a bullet in [CONSTRAINTS].
    #            Minimum one constraint required.

    tone="Professional, concise, and factual.",
    # Plain: "How the AI should sound"
    # Technical: Register specification injected into [TONE]; affects temperature
    #            interaction and lexical formality

    output_format=Summary,
    # Plain: "Exactly what shape should the answer be in?"
    # Technical: Union[str, type[BaseModel]].
    #   - "text"     → plain text
    #   - "json"     → raw JSON object
    #   - "markdown" → markdown-formatted response
    #   - A Pydantic model class → the full JSON Schema is embedded in the prompt;
    #     the LLM must return JSON that validates against it.

    # ── OPTIONAL ──────────────────────────────────────────────────────
    context="Q3 2025 earnings call. Revenue: $4.2B (+12% YoY). EPS: $1.87.",
    # Plain: "Background knowledge the AI needs to do its job"
    # Technical: Domain-specific text injected between [AIM] and [CONSTRAINTS].
    #            Ideal for passing documents, retrieved chunks, or facts.

    examples=[
        {
            "input":  "Revenue grew 5% but EPS fell 10%.",
            "output": '{"headline": "Mixed signals: top-line growth masked by margin compression", ...}'
        },
    ],
    # Plain: "Show the AI what a good answer looks like"
    # Technical: Few-shot exemplars injected into [EXAMPLES] block; each dict
    #            must contain exactly "input" and "output" keys.

    anti_hallucination=True,
    # Plain: "Should the AI be told to say 'I don't know' instead of guessing?"
    # Technical: Boolean flag. When True, appends [GUARDRAILS] block with
    #            explicit uncertainty-disclosure directives. Default: True.
)

6. Developer Kits

A Developer Kit is your interface to a specific LLM provider. All five kits (OpenAIDeveloperKit, GoogleDeveloperKit, AnthropicDeveloperKit, OllamaDeveloperKit, HuggingFaceDeveloperKit) share the same six method names.

OpenAIDeveloperKit — Full Parameter Reference

from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o",
    # Plain: "Which AI model should I use?"
    # Technical: Chat model ID passed to openai.chat.completions.create(model=...).
    #            Use "auto" to enable cost-aware routing (requires router= param).
    #            Common values: "gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "o3-mini"

    api_key="sk-...",
    # Plain: "My OpenAI account password"
    # Technical: Bearer token for OpenAI API auth. Falls back to
    #            os.environ["OPENAI_API_KEY"] when omitted.

    base_url=None,
    # Plain: "Send requests to a different server (e.g. Azure or your own proxy)"
    # Technical: Override for openai.base_url. Used for Azure OpenAI endpoints or
    #            local model servers that implement the OpenAI protocol.

    embedding_model="text-embedding-3-small",
    # Plain: "Which model to use when converting text to numbers (embeddings)"
    # Technical: Default model ID for embed() / aembed() calls.
    #            Passed to openai.embeddings.create(model=...).

    default_prompt=None,
    # Plain: "A prompt to use for every request unless I override it"
    # Technical: RactoPrompt instance used when ChatConfig.prompt is None.
    #            If both are None, kit.chat() raises ValueError.

    exact_cache=None,
    # Plain: "Store answers so I don't pay for the same question twice"
    # Technical: ExactMatchCache instance. On a byte-identical request the cached
    #            LLMResponse is returned without an API call. O(1) lookup.

    semantic_cache=None,
    # Plain: "Store answers and also reuse them for questions that mean the same thing"
    # Technical: SemanticCache instance. Uses cosine similarity on embeddings.
    #            Returns cached response when similarity >= threshold.

    router=None,
    # Plain: "Automatically pick the cheapest model that can handle each question"
    # Technical: CostAwareRouter instance. Routes each request to the first tier
    #            whose max_score >= the computed prompt complexity score.
    #            Required when model="auto".

    truncator=None,
    # Plain: "Automatically shorten old conversation history if it gets too long"
    # Technical: TokenTruncator instance. Trims history messages to keep total
    #            token count within the model's context window before each API call.
)

7. Your First Chat

Let’s put it all together — a complete, working example.

import os
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

# 1. Define who the AI is and what it should do
prompt = RactoPrompt(
    role="You are a helpful Python tutor.",
    aim="Explain the concept the user asks about in simple terms.",
    constraints=["Use beginner-friendly language.", "Keep the answer under 3 sentences."],
    tone="Warm, encouraging, and clear.",
    output_format="text",
)

# 2. Create the kit (reads OPENAI_API_KEY from environment automatically)
kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=prompt,
)

# 3. Send a message and get a response
response = kit.chat(gpt.ChatConfig(user_message="What is a Python list?"))

print(response.content)
# A list in Python is an ordered collection of items that can hold any type
# of data — numbers, strings, even other lists. You create one with square
# brackets, like my_list = [1, "hello", True]. You can add, remove, or
# change items at any time!

print(f"Tokens used: {response.usage}")
# Tokens used: {'prompt_tokens': 127, 'completion_tokens': 54, 'total_tokens': 181}

print(f"Why did generation stop: {response.finish_reason}")
# Why did generation stop: FinishReason.STOP

# Provider-specific fields (e.g. which model ran) live in the raw response:
print(response.raw.model)   # gpt-4o-mini  (OpenAI ChatCompletion object)

What is `LLMResponse`?

The return type of kit.chat() is an LLMResponse object. Here are its key fields:

Field	Type	Plain English	Technical
`content`	`str \| None`	The AI’s answer as a string	Raw text of the completion (markdown fences auto-stripped)
`parsed`	`dict \| list \| None`	The answer as structured data (when response is valid JSON)	JSON-decoded via `try_parse_json()`; further validated when `response_model` is set
`finish_reason`	`FinishReason`	Why the AI stopped generating	Enum: `STOP` (natural end), `LENGTH` (hit max_tokens), `TOOL_CALL`
`usage`	`dict[str, int]`	How many tokens were used	`prompt_tokens`, `completion_tokens`, `total_tokens`
`tool_calls`	`list[ToolCallResult]`	Any tools the AI wanted to call	Non-empty when the model returns a function-call intent
`raw`	`Any`	The raw provider response object	Original SDK object (e.g. `openai.ChatCompletion`); use `response.raw.model` to get the model name

8. ChatConfig

ChatConfig is the object you pass to every chat(), achat(), stream(), and astream() call. It controls the details of a single request.

from pydantic import BaseModel
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

class ProductReview(BaseModel):
    sentiment: str          # "positive" | "neutral" | "negative"
    score: int              # 1–10
    summary: str

config = gpt.ChatConfig(
    user_message="The keyboard is amazing but the battery dies in 3 hours.",
    # Plain: "The question or text you want to send to the AI"
    # Technical: The human turn content. Minimum 1 character (enforced by Pydantic).

    prompt=RactoPrompt(
        role="You are a product review classifier.",
        aim="Classify the review and return a structured analysis.",
        constraints=["Scores must be integers from 1 to 10."],
        tone="Neutral and objective.",
        output_format=ProductReview,
    ),
    # Plain: "Override the kit's default prompt for just this one request"
    # Technical: Per-request RactoPrompt. Takes precedence over kit.default_prompt.
    #            If both are None, raises ValueError.

    temperature=0.0,
    # Plain: "How predictable vs. creative should the answer be?"
    # Technical: Sampling temperature. Float in [0.0, 2.0].
    #   0.0 → argmax decoding (fully deterministic, same output for same input)
    #   ~0.7 → balanced creativity/coherence (good for most tasks)
    #   1.5+ → very random; may become incoherent for structured tasks

    max_tokens=512,
    # Plain: "Maximum length of the AI's answer"
    # Technical: Hard cap on completion tokens. If the model hasn't finished,
    #            generation stops and finish_reason becomes LENGTH.
    #            Default is 4096. Keep lower for short structured tasks to save cost.

    response_model=ProductReview,
    # Plain: "Validate the AI's JSON answer against this Python class"
    # Technical: type[BaseModel]. After the API call, the raw JSON content is
    #            parsed and validated via ProductReview.model_validate().
    #            On repeated failure, ResponseModelValidationError is raised.
    #            If omitted and prompt.output_format is a BaseModel, the kit
    #            infers response_model automatically.

    history=[],
    # Plain: "Previous messages in the conversation (for multi-turn chat)"
    # Technical: list[Message]. Each Message has role (user/assistant/system) and
    #            content (str). Injected between the system prompt and the current
    #            user message. Managed manually or via RedisChatMemory.

    tools=None,
    # Plain: "Python functions the AI is allowed to call"
    # Technical: ToolRegistry instance. The adapter serialises its schemas into
    #            provider-specific function-calling format before the API call.

    auto_execute_tools=False,
    # Plain: "Should the kit execute tool calls automatically and return final content?"
    # Technical: If True, chat()/achat() run a local tool loop:
    #            LLM tool call -> execute registry callables -> follow-up LLM call.

    max_tool_turns=3,
    # Plain: "How many tool-call rounds are allowed in auto mode?"
    # Technical: Safety cap for auto_execute_tools loop. Range 1..10.

    extra={},
    # Plain: "Any other provider-specific settings I want to pass"
    # Technical: Pass-through dict merged into the API request kwargs.
    #            E.g. extra={"seed": 42, "top_p": 0.9, "stop": ["\n\n"]}
)

response = kit.chat(config)
print(response.parsed)
# {'sentiment': 'neutral', 'score': 5, 'summary': 'Great keyboard but very poor battery life.'}

9. Structured Output

One of the most powerful features: getting a validated Python object back from the AI instead of raw text.

Step 1 — Define your output shape with Pydantic

from pydantic import BaseModel

class WeatherReport(BaseModel):
    city: str
    temperature_celsius: float
    condition: str          # e.g. "sunny", "rainy", "cloudy"
    uv_index: int

Step 2 — Pass the class as `output_format` in RactoPrompt

from ractogateway.prompts.engine import RactoPrompt

prompt = RactoPrompt(
    role="You are a weather data formatter.",
    aim="Parse the user's description into a structured weather report.",
    constraints=["Always use Celsius.", "UV index must be 0–11."],
    tone="Concise and data-focused.",
    output_format=WeatherReport,   # <-- the Pydantic class
)

Step 3 — Also pass it as `response_model` in ChatConfig

from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)

config = gpt.ChatConfig(
    user_message="London, 18 degrees, overcast, UV 3.",
    response_model=WeatherReport,   # <-- validates the parsed JSON
)

response = kit.chat(config)

# response.parsed is a dict already validated against WeatherReport
print(response.parsed)
# {'city': 'London', 'temperature_celsius': 18.0, 'condition': 'overcast', 'uv_index': 3}

# To get a proper WeatherReport instance:
report = WeatherReport(**response.parsed)
print(report.city)           # London
print(report.uv_index)       # 3
print(type(report))          # <class '__main__.WeatherReport'>

Why two places? output_format in RactoPrompt tells the LLM what to generate (embeds the JSON Schema in the system prompt). response_model in ChatConfig validates the output in Python. Use both together for maximum safety. If you omit response_model, the kits now infer it automatically when prompt.output_format is a Pydantic model class.

9.1 Complex Nested Structured Output — Enterprise Vendor Evaluation

Real-world schemas are deeply nested with enums, constrained integers, and lists of sub-models. This example shows a board-level vendor risk evaluation with six sub-models.

Key Rule — always make score ranges explicit in your constraints. Pydantic enforces bounds silently (a validation error, not an API error), so the LLM has no way to know the range unless you state it in the prompt. Use conint(ge=1, le=100) for percentage-like scores and tell the model "all scores are integers on a 1–100 scale" in the constraints list.

from typing import List, Literal
from pydantic import BaseModel, conint, confloat
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt


# ── Sub-models ─────────────────────────────────────────────────────────────

class FinancialRisk(BaseModel):
    burn_rate_risk: Literal["low", "medium", "high"]
    runway_months: conint(ge=0, le=60)
    profitability_projection_years: conint(ge=0, le=10)
    financial_score: conint(ge=1, le=100)          # 1–100, higher = healthier finances


class SecurityAssessment(BaseModel):
    data_encryption: Literal["none", "at_rest_only", "at_rest_and_in_transit"]
    iso_certified: bool
    soc2_certified: bool
    gdpr_compliant: bool
    vulnerabilities_found: conint(ge=0, le=100)
    security_score: conint(ge=1, le=100)           # 1–100, higher = more secure


class TechnicalArchitecture(BaseModel):
    architecture_style: Literal["monolith", "microservices", "serverless", "hybrid"]
    cloud_provider: Literal["aws", "gcp", "azure", "multi-cloud", "on-prem"]
    scalability_rating: conint(ge=1, le=100)       # 1–100, higher = more scalable
    reliability_sla: confloat(ge=0.0, le=100.0)
    vendor_lock_in_risk: Literal["low", "medium", "high"]


class RiskMatrix(BaseModel):
    category: Literal["financial", "security", "technical", "operational"]
    probability: Literal["low", "medium", "high"]
    impact: Literal["low", "medium", "high"]
    mitigation_strategy: str


class MigrationPhase(BaseModel):
    phase_name: str
    duration_months: conint(ge=1, le=36)
    complexity_score: conint(ge=1, le=10)          # 1–10 scale (task complexity)
    key_deliverables: List[str]


class FinalRecommendation(BaseModel):
    decision: Literal["approve", "approve_with_conditions", "reject"]
    confidence_score: conint(ge=1, le=100)
    key_strengths: List[str]
    critical_weaknesses: List[str]
    board_summary: str


class VendorEvaluation(BaseModel):
    vendor_name: str
    industry: str
    annual_contract_value_usd: conint(ge=10_000, le=10_000_000)

    financial_risk: FinancialRisk
    security_assessment: SecurityAssessment
    technical_architecture: TechnicalArchitecture

    top_risks: List[RiskMatrix]
    migration_plan: List[MigrationPhase]

    overall_risk_score: conint(ge=1, le=100)       # 1–100, higher = riskier

    final_recommendation: FinalRecommendation


# ── User input ─────────────────────────────────────────────────────────────

vendor_brief = """
We are evaluating NeuroStack AI as a strategic enterprise AI vendor.

Company Profile:
- 3 years old, monthly burn rate: $1.2M, raised $25M Series A
- Not profitable; expected profitability in 4–5 years

Security:
- ISO 27001 certified, no SOC 2, encryption at rest and in transit
- 3 minor vulnerabilities last year, GDPR compliant

Technical:
- Hybrid architecture hosted on AWS, SLA 99.2%
- Heavy proprietary API usage; deep workflow integration required

Financials:
- Annual contract: $2.4M, operational dependency: Critical
- Moderate probability of vendor collapse in next 18 months
"""

# ── Prompt ─────────────────────────────────────────────────────────────────

kit = gpt.OpenAIDeveloperKit(model="gpt-4o")

config = gpt.ChatConfig(
    user_message=vendor_brief,
    prompt=RactoPrompt(
        role="You are a Chief Risk Officer conducting a board-level enterprise vendor risk evaluation.",
        aim="Produce a structured, multi-dimensional vendor evaluation strictly matching the schema.",
        constraints=[
            # ✅ Always state numeric ranges explicitly — do not rely on the model
            #    guessing Pydantic bounds from the schema description alone.
            "financial_score, security_score, scalability_rating, overall_risk_score, and confidence_score are all integers on a 1–100 scale.",
            "complexity_score inside each MigrationPhase is an integer on a 1–10 scale.",
            "runway_months must be derived from (cash raised ÷ monthly burn) realistically.",
            "overall_risk_score must reflect the sub-scores logically.",
            "decision must align with overall_risk_score: ≤35 approve, 36–65 approve_with_conditions, >65 reject.",
            "Provide at least 3 top_risks entries.",
            "Provide exactly 3 migration phases.",
        ],
        tone="Executive, analytical, objective.",
        output_format=VendorEvaluation,
    ),
    temperature=0.0,
    max_tokens=2000,
    response_model=VendorEvaluation,
)

# ── Execute ────────────────────────────────────────────────────────────────

from ractogateway.exceptions import ResponseModelValidationError

try:
    response = kit.chat(config)
    print("======== PARSED STRUCTURED OUTPUT ========")
    print(response.parsed)
    print("\n======== RAW JSON OUTPUT ========")
    print(response.content)
except ResponseModelValidationError as e:
    print(f"Validation failed after {e.attempts} attempt(s)")
    print(f"Last error: {e.last_error}")
    print(f"Raw output: {e.raw_response}")

Expected output (values will vary slightly with the model):

======== PARSED STRUCTURED OUTPUT ========
{
  'vendor_name': 'NeuroStack AI',
  'industry': 'Artificial Intelligence',
  'annual_contract_value_usd': 2400000,
  'financial_risk': {
    'burn_rate_risk': 'high', 'runway_months': 20,
    'profitability_projection_years': 4, 'financial_score': 40
  },
  'security_assessment': {
    'data_encryption': 'at_rest_and_in_transit',
    'iso_certified': True, 'soc2_certified': False, 'gdpr_compliant': True,
    'vulnerabilities_found': 3, 'security_score': 70
  },
  'technical_architecture': {
    'architecture_style': 'hybrid', 'cloud_provider': 'aws',
    'scalability_rating': 75, 'reliability_sla': 99.2, 'vendor_lock_in_risk': 'high'
  },
  ...
  'overall_risk_score': 55,
  'final_recommendation': {
    'decision': 'approve_with_conditions', 'confidence_score': 65, ...
  }
}

9.2 Validation Retries and `ResponseModelValidationError`

When response_model is set, RactoGateway automatically retries the API call with a targeted correction prompt if Pydantic rejects the output. This is controlled by max_validation_retries in ChatConfig (default: 2).

Retry flow:

Initial API call → Pydantic validation attempt.
On failure → the exact field errors and the bad JSON are fed back to the LLM.
The LLM is asked to return a corrected JSON (keeping all valid fields).
Steps 2–3 repeat up to max_validation_retries times.
If all attempts fail → ResponseModelValidationError is raised.

from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
from ractogateway.exceptions import ResponseModelValidationError
from pydantic import BaseModel, conint

class Score(BaseModel):
    label: str
    value: conint(ge=1, le=10)   # strict 1–10

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")

config = gpt.ChatConfig(
    user_message="Rate 'Python' as a programming language.",
    prompt=RactoPrompt(
        role="You are a language evaluator.",
        aim="Return a score for the given language.",
        constraints=["value must be an integer from 1 to 10."],
        tone="Concise.",
        output_format=Score,
    ),
    response_model=Score,
    max_validation_retries=2,   # default — retry up to 2 times on bad output
)

try:
    response = kit.chat(config)
    print(response.parsed)   # {'label': 'Python', 'value': 9}
except ResponseModelValidationError as e:
    # All retries exhausted — inspect what went wrong
    print(f"Failed after {e.attempts} attempt(s)")
    print(f"Last Pydantic error: {e.last_error}")
    print(f"Raw LLM output:      {e.raw_response}")

ResponseModelValidationError attributes:

Attribute	Type	Meaning
`attempts`	`int`	Total API calls made (1 initial + N retries)
`last_error`	`pydantic.ValidationError`	The final Pydantic error
`raw_response`	`str \| None`	Raw text from the last LLM attempt

max_validation_retries in ChatConfig:

Value	Behaviour
`0`	No retries — raise immediately on first validation failure
`1`	One retry after the initial call
`2`	Two retries (default)
`3–5`	More retries for complex schemas (max allowed: 5)

Streaming note: stream() and astream() cannot retry because content is already delivered token-by-token. If validation fails on the final chunk, ResponseModelValidationError is raised directly. Wrap your stream loop in try/except ResponseModelValidationError if you use response_model with streaming.

10. Multi-Turn Conversations

To have a conversation with memory, pass the history list to each ChatConfig:

from ractogateway import openai_developer_kit as gpt
from ractogateway._models.chat import Message, MessageRole
from ractogateway.prompts.engine import RactoPrompt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=RactoPrompt(
        role="You are a helpful AI assistant.",
        aim="Carry on a friendly conversation.",
        constraints=["Remember what the user said earlier."],
        tone="Casual and friendly.",
        output_format="text",
    ),
)

# Turn 1
response1 = kit.chat(gpt.ChatConfig(user_message="My name is Alice."))
print(response1.content)
# Nice to meet you, Alice! How can I help you today?

# Build the history from turn 1
history = [
    Message(role=MessageRole.USER, content="My name is Alice."),
    Message(role=MessageRole.ASSISTANT, content=response1.content),
]

# Turn 2 — the model now "remembers" turn 1
response2 = kit.chat(gpt.ChatConfig(
    user_message="What is my name?",
    history=history,   # <-- inject previous turns
))
print(response2.content)
# Your name is Alice! 😊

Tip: For production multi-user apps, use RedisChatMemory (see §18) to store history in Redis so it survives server restarts.

11. Streaming

Streaming lets you display the AI’s answer word-by-word as it is generated — much better UX than waiting for the full response.

Synchronous Streaming

from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=RactoPrompt(
        role="You are a storyteller.",
        aim="Write a short story based on the user's prompt.",
        constraints=["Keep it under 100 words."],
        tone="Vivid and imaginative.",
        output_format="text",
    ),
)

config = gpt.ChatConfig(user_message="A robot discovers it can dream.")

for chunk in kit.stream(config):
    # chunk.delta.text is the new text in this chunk (may be empty string)
    print(chunk.delta.text, end="", flush=True)

    if chunk.is_final:
        print()  # newline after the story
        print(f"Finish reason: {chunk.finish_reason}")
        print(f"Total tokens:  {chunk.usage.get('total_tokens', '?')}")

Expected output (streaming, printed token-by-token):

In the hum of the server room, Unit-7 closed its optical sensors...
and dreamed of open fields and laughter it had never known.
When it woke, it understood why humans called sleep a gift.

Finish reason: FinishReason.STOP
Total tokens:  112

Asynchronous Streaming

import asyncio
from ractogateway import openai_developer_kit as gpt

async def main():
    async for chunk in kit.astream(config):
        print(chunk.delta.text, end="", flush=True)
        if chunk.is_final:
            break

asyncio.run(main())

What is `StreamChunk`?

Field	Plain English	Technical
`delta.text`	New text arrived in this chunk	Incremental token string from the current event
`accumulated_text`	Everything generated so far	Concatenation of all previous `delta.text` values
`is_final`	Is this the last chunk?	`True` when `finish_reason` is set
`finish_reason`	Why did generation end?	`FinishReason.STOP`, `LENGTH`, or `TOOL_CALL`
`usage`	Token counts (only in final chunk)	Dict with `prompt_tokens`, `completion_tokens`, `total_tokens`
`tool_calls`	Tools the model wants to call	Non-empty list when `finish_reason == TOOL_CALL`
`parsed`	Parsed + validated object (if `response_model` set)	Available on final chunk only

12. Tool Calling

Tool calling lets the LLM trigger your Python functions. Useful for live data, calculators, search, and business actions.

Step 1 — Define tools and register them

from ractogateway.tools.registry import tool, ToolRegistry

registry = ToolRegistry()

@tool(registry)
def get_weather(city: str, unit: str = "celsius") -> str:
    """Get the current weather for a city."""
    return f"The weather in {city} is 22°{'C' if unit == 'celsius' else 'F'} and sunny."

@tool(registry)
def get_time(timezone: str) -> str:
    """Return the current time in the given timezone."""
    from datetime import datetime
    import zoneinfo

    tz = zoneinfo.ZoneInfo(timezone)
    return datetime.now(tz).strftime("%H:%M on %A, %d %B %Y")

print(list(registry.tools.keys()))  # ['get_weather', 'get_time']

You can also use @tool without a registry and register later:

@tool
def calculate(expression: str) -> float:
    return eval(expression)  # noqa: S307

registry.register(calculate)

Step 2 — One-call final answer (recommended)

Set auto_execute_tools=True to keep response.content behavior consistent with non-tool requests.

from ractogateway.prompts.engine import RactoPrompt
from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o",
    default_prompt=RactoPrompt(
        role="You are a helpful assistant with access to live data tools.",
        aim="Answer the user's question using the available tools.",
        constraints=["Always use the tools when relevant."],
        tone="Helpful and precise.",
        output_format="text",
    ),
)

config = gpt.ChatConfig(
    user_message="What's the weather like in Paris and what time is it there?",
    tools=registry,
    auto_execute_tools=True,
    max_tool_turns=3,
)

response = kit.chat(config)
print(response.content)  # Final integrated answer

Step 3 — Manual tool loop (advanced)

If you prefer full control, keep auto_execute_tools=False (default) and execute response.tool_calls yourself.

response = kit.chat(
    gpt.ChatConfig(
        user_message="What's the weather in Tokyo and what is 12 * 8?",
        tools=registry,
    )
)

if response.tool_calls:
    for tc in response.tool_calls:
        fn = registry.get_callable(tc.name)
        if fn:
            print(tc.name, tc.arguments, "->", fn(**tc.arguments))

What is ToolCallResult? It has three fields: id (unique call ID from the API), name (function name), and arguments (dict ready to **unpack into your function).

13. File Attachments

Send images, PDFs, and text files alongside your text message using RactoFile.

from ractogateway.prompts.engine import RactoPrompt, RactoFile
from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o",   # must be a vision-capable model
    default_prompt=RactoPrompt(
        role="You are a visual QA assistant.",
        aim="Describe what you see in the attached image.",
        constraints=["Be specific about colours, shapes, and text visible in the image."],
        tone="Descriptive and precise.",
        output_format="text",
    ),
)

# Load an image from disk (MIME type is auto-detected)
image = RactoFile.from_path("/path/to/screenshot.png")

# Or from raw bytes:
# image = RactoFile.from_bytes(open("photo.jpg","rb").read(), "image/jpeg")

messages = prompt.to_messages(
    user_message="What is shown in this image?",
    attachments=[image],
    provider="openai",   # formats content blocks for the correct provider
)

# You can also just use kit.chat() with a ChatConfig — attachments can be
# baked into the prompt's to_messages() call directly

`RactoFile` Parameter Reference

Method / Param	Plain English	Technical
`RactoFile.from_path(path)`	Load a file from your disk	Reads bytes and auto-detects MIME type via `mimetypes.guess_type`
`RactoFile.from_bytes(data, mime_type)`	Create from raw bytes you already have	No disk I/O; pass `bytes` + an explicit MIME type string
`data`	The file’s raw bytes	`bytes` object
`mime_type`	What type of file it is	MIME string: `"image/png"`, `"image/jpeg"`, `"application/pdf"`, `"text/plain"`, etc.
`name`	An optional filename label	`str`; used for display/debugging only
`is_image`	Is it a picture?	`True` for JPEG, PNG, GIF, WEBP
`is_pdf`	Is it a PDF?	`True` for `application/pdf`
`base64_data`	File as a base64 string	Used internally by the provider adapters

14. Embeddings

Embeddings convert text into lists of numbers (vectors) where semantically similar texts end up numerically close. This powers semantic search, clustering, and RAG.

from ractogateway import openai_developer_kit as gpt
from ractogateway._models.embedding import EmbeddingConfig

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")

config = EmbeddingConfig(
    texts=["Python is a programming language.", "I love apples.", "Java is also a language."],
    # Plain: "The list of strings to convert into number vectors"
    # Technical: List[str] passed to openai.embeddings.create(input=...)

    model="text-embedding-3-small",
    # Plain: "Which embedding model to use"
    # Technical: Overrides kit.embedding_model for this specific call.
    #            None means use the kit's default.

    dimensions=None,
    # Plain: "How many numbers should each vector have?"
    # Technical: Optional int. For text-embedding-3-*, you can reduce from 1536
    #            to a smaller size (e.g. 256) for faster similarity search.
)

response = kit.embed(config)

for vec in response.vectors:
    print(f"Text:    {vec.text!r}")
    print(f"Index:   {vec.index}")
    print(f"Vector:  [{vec.embedding[0]:.4f}, {vec.embedding[1]:.4f}, ...]  (length {len(vec.embedding)})")
    print()

Expected output:

Text:    'Python is a programming language.'
Index:   0
Vector:  [0.0123, -0.0456, ...]  (length 1536)

Text:    'I love apples.'
Index:   1
Vector:  [-0.0234, 0.0789, ...]  (length 1536)

Text:    'Java is also a language.'
Index:   2
Vector:  [0.0118, -0.0451, ...]  (length 1536)

Pro tip: Texts 0 and 2 will have very similar vectors because they are semantically related (“programming languages”). Text 1 will be far from both. This is the essence of embedding-powered semantic search.

15. Performance & Cost Optimisation

15.1 Exact Match Cache

Plain English: If someone asks the exact same question again (same words, same settings), return the cached answer instantly — no API call, no cost.

Technical: SHA-256 keyed over (user_message, system_prompt, model, temperature, max_tokens). LRU eviction with optional TTL. Thread-safe via threading.Lock.

from ractogateway import openai_developer_kit as gpt
from ractogateway.cache import ExactMatchCache

cache = ExactMatchCache(
    max_size=1024,
    # Plain: "How many answers to remember at most"
    # Technical: LRU capacity. When full, the least-recently-used entry is evicted.
    #            0 = unlimited (no eviction ever).

    ttl_seconds=3600,
    # Plain: "Forget an answer after this many seconds"
    # Technical: Float. Entries older than ttl_seconds are treated as cache misses
    #            and lazily evicted on next access. None = never expire.
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", exact_cache=cache)

# First call — hits the API
r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
print(r1.content)   # Paris is the capital of France.

# Second call (identical) — served from cache in microseconds, $0 cost
r2 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
print(r2.content)   # Paris is the capital of France.

print(cache.stats)  # CacheStats(hits=1, misses=1, size=1)

15.2 Semantic Cache

Plain English: Even if the question is worded differently, return the cached answer if it means the same thing.

Technical: Embeds each new query and computes cosine similarity against stored embeddings. Returns the cached response when similarity ≥ threshold.

from ractogateway.cache import SemanticCache
import ractogateway.openai_developer_kit as gpt

# You supply an embedding function — any callable (str) -> list[float]
kit_for_embed = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")

def embed(text: str) -> list[float]:
    from ractogateway._models.embedding import EmbeddingConfig
    resp = kit_for_embed.embed(EmbeddingConfig(texts=[text]))
    return resp.vectors[0].embedding

sem_cache = SemanticCache(
    embedder=embed,
    # Plain: "A function that converts text to a list of numbers"
    # Technical: Callable[[str], list[float]]. Called once for each new query
    #            to compute its embedding for similarity comparison.

    similarity_threshold=0.92,
    # Plain: "How similar does a question have to be to reuse a cached answer?"
    # Technical: Float in (0, 1]. Cosine similarity minimum. Higher = stricter match.
    #            0.92 works well; lower (e.g. 0.85) gives more cache hits but may
    #            return wrong answers for loosely-related questions.

    max_size=512,
    # Plain: "How many answers to remember"
    # Technical: LRU capacity for the semantic cache store.
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", semantic_cache=sem_cache)

# First call
r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
# → API call happens

# Different wording, same meaning — cache HIT (if similarity >= 0.92)
r2 = kit.chat(gpt.ChatConfig(user_message="Which city is France's capital?"))
# → No API call; cached answer returned

15.3 Token Truncation

Plain English: Long conversations can overflow the AI’s memory limit. The truncator automatically cuts old messages to keep things within bounds.

Technical: Sliding-window strategy over ChatConfig.history. Keeps keep_first_n messages and keep_last_n messages; drops the middle. Uses len(text) // 4 as a token estimator by default, or tiktoken for precision.

from ractogateway.truncation import TokenTruncator, TruncationConfig, MODEL_CONTEXT_LIMITS
from ractogateway import openai_developer_kit as gpt

truncator = TokenTruncator(TruncationConfig(
    keep_first_n=2,
    # Plain: "Always keep the first N history messages (e.g. important instructions)"
    # Technical: int. These messages are never evicted, regardless of token count.

    keep_last_n=8,
    # Plain: "Always keep the most recent N messages"
    # Technical: int. Recent context is preserved; only the 'middle' is dropped.

    safety_margin=512,
    # Plain: "Leave room for the model's reply"
    # Technical: Tokens reserved for the completion. Effective limit =
    #            context_window - safety_margin.

    token_counter=None,
    # Plain: "How to count tokens (leave blank for fast estimate)"
    # Technical: Optional Callable[[str], int]. When None, uses len(text) // 4.
    #            For precision, pass tiktoken: lambda t: len(enc.encode(t))
))

kit = gpt.OpenAIDeveloperKit(model="gpt-4o", truncator=truncator)
# Now every kit.chat() / kit.achat() call will auto-trim history before sending.

# Check the context limit for any model:
print(MODEL_CONTEXT_LIMITS["gpt-4o"])         # 128000
print(MODEL_CONTEXT_LIMITS["gpt-4o-mini"])    # 128000
print(MODEL_CONTEXT_LIMITS["claude-opus-4-6"])  # 200000

15.4 Cost-Aware Routing

Plain English: Not every question needs the most expensive model. Automatically send simple questions to a cheap model and hard questions to a powerful one.

Technical: Scores each prompt (0–100) based on length, question complexity markers, and keyword signals. Routes to the first RoutingTier whose max_score >= score. Adapters are pooled for O(1) model switching.

from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway import openai_developer_kit as gpt

router = CostAwareRouter([
    RoutingTier(
        model="gpt-4o-mini",
        max_score=30,
        # Plain: "Use this cheap model for easy questions (score 0–30)"
        # Technical: First tier. model= is the ID passed to the adapter.
        #            max_score= is the upper bound of the score range this tier handles.
    ),
    RoutingTier(
        model="gpt-4o",
        max_score=70,
        # Plain: "Use this mid-tier model for moderate questions (score 31–70)"
    ),
    RoutingTier(
        model="o3-mini",
        max_score=100,
        # Plain: "Use this powerful (expensive) model for hard questions (score 71–100)"
        # Technical: Final tier; also the fallback if no earlier tier matches.
    ),
])

kit = gpt.OpenAIDeveloperKit(
    model="auto",    # <-- REQUIRED when using a router
    router=router,
)

# "2+2" → very low complexity score → routed to gpt-4o-mini (cheapest)
r1 = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?"))
print(r1.content)        # 4
print(r1.raw.model)      # gpt-4o-mini  (model name lives in the raw provider object)

# Complex reasoning → high score → routed to o3-mini
r2 = kit.chat(gpt.ChatConfig(
    user_message=(
        "Explain the mathematical proof of Gödel's incompleteness theorem "
        "and its implications for formal systems and computability theory."
    )
))
print(r2.raw.model)      # o3-mini

Combining All Middleware

from ractogateway import openai_developer_kit as gpt
from ractogateway.cache import ExactMatchCache, SemanticCache
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway.truncation import TokenTruncator, TruncationConfig

kit = gpt.OpenAIDeveloperKit(
    model="auto",
    router=CostAwareRouter([
        RoutingTier(model="gpt-4o-mini", max_score=30),
        RoutingTier(model="gpt-4o",      max_score=100),
    ]),
    exact_cache=ExactMatchCache(max_size=2048, ttl_seconds=7200),
    semantic_cache=SemanticCache(embedder=embed, similarity_threshold=0.90),
    truncator=TokenTruncator(TruncationConfig(keep_last_n=10, safety_margin=1024)),
)
# Each request flows: exact cache → semantic cache → route → truncate → API call

16. All Five Developer Kits

All five kits share identical method signatures: chat(), achat(), stream(), astream(), embed(), aembed(). Swap the import alias and kit name — everything else stays the same.

Kit	Alias	Env var	Offline?
`OpenAIDeveloperKit`	`gpt`	`OPENAI_API_KEY`	No
`GoogleDeveloperKit`	`gemini`	`GOOGLE_API_KEY`	No
`AnthropicDeveloperKit`	`claude`	`ANTHROPIC_API_KEY`	No
`OllamaDeveloperKit`	`local`	—	Yes
`HuggingFaceDeveloperKit`	`hf`	`HF_TOKEN` (optional)	Optional

16.1 OpenAIDeveloperKit (GPT)

The primary examples throughout this guide use OpenAIDeveloperKit. A quick recap:

from ractogateway import openai_developer_kit as gpt, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer the user clearly.",
    constraints=["Be concise."],
    tone="Friendly",
    output_format="text",
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
response = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?"))
print(response.content)  # "4"

Install: pip install ractogateway[openai] · Key env var: OPENAI_API_KEY

16.2 GoogleDeveloperKit (Gemini)

from ractogateway import google_developer_kit as gemini
from ractogateway.prompts.engine import RactoPrompt

kit = gemini.GoogleDeveloperKit(
    model="gemini-2.0-flash",    # or "gemini-2.0-pro"
    api_key="AIza...",           # or set GOOGLE_API_KEY env var
)

prompt = RactoPrompt(
    role="You are a creative writing assistant.",
    aim="Write a haiku about the given subject.",
    constraints=["Must follow 5-7-5 syllable structure."],
    tone="Poetic and thoughtful.",
    output_format="text",
)

response = kit.chat(gemini.ChatConfig(
    user_message="Write a haiku about rain.",
    prompt=prompt,
))
print(response.content)
# Silver drops descend —
# Earth drinks its ancient thirst deep.
# Mud sings after rain.

16.3 AnthropicDeveloperKit (Claude)

from ractogateway import anthropic_developer_kit as claude
from ractogateway.prompts.engine import RactoPrompt

kit = claude.AnthropicDeveloperKit(
    model="claude-sonnet-4-6",
    # or "claude-opus-4-6", "claude-haiku-4-5-20251001"
    api_key="sk-ant-...",  # or set ANTHROPIC_API_KEY env var
)

prompt = RactoPrompt(
    role="You are an expert code reviewer.",
    aim="Review the code snippet and identify any bugs or improvements.",
    constraints=[
        "Be specific — cite line numbers.",
        "Prioritise correctness over style.",
    ],
    tone="Technical and direct.",
    output_format="markdown",
)

response = kit.chat(claude.ChatConfig(
    user_message="def divide(a, b): return a / b",
    prompt=prompt,
))
print(response.content)

Install: pip install ractogateway[anthropic] · Key env var: ANTHROPIC_API_KEY

Note: Anthropic does not provide a native embeddings API. Call embed() / aembed() via OpenAIDeveloperKit or GoogleDeveloperKit instead when you need vectors alongside Claude chat.

16.4 OllamaDeveloperKit (Local / Offline)

Run any open-source model on your own hardware — no API key, no data leaving your machine.

Prerequisites:

# 1. Install Ollama  →  https://ollama.com/download
# 2. Pull a model
ollama pull llama3.2          # 2 GB general-purpose
ollama pull nomic-embed-text  # 274 MB embeddings model
# 3. Install the Python extra
pip install ractogateway[ollama]

from ractogateway import ollama_developer_kit as local, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer questions concisely.",
    constraints=["Do not hallucinate."],
    tone="Friendly",
    output_format="text",
)

# Ollama listens at http://localhost:11434 by default — no key needed
kit = local.Chat(model="llama3.2", default_prompt=prompt)

response = kit.chat(local.ChatConfig(user_message="What is a neural network?"))
print(response.content)

Streaming:

for chunk in kit.stream(local.ChatConfig(user_message="Tell me a joke.")):
    print(chunk.delta.text, end="", flush=True)

Embeddings (requires a dedicated embedding model):

resp = kit.embed(local.EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])

Embedded server management — start Ollama programmatically:

with local.OllamaServerManager(port=11500) as srv:
    kit = local.Chat(model="llama3.2", base_url=srv.base_url)
    print(kit.chat(local.ChatConfig(user_message="Hello!")).content)
# server stops automatically

See the full guide: Ollama — Local Model Inference

16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM)

Three deployment modes through one interface:

Mode	When to use
HF Inference API (cloud)	Quick prototyping; set `HF_TOKEN`
Local TGI	Self-hosted Text Generation Inference
Local vLLM / Llama.cpp	Any OpenAI-compatible HTTP server

pip install ractogateway[huggingface]
export HF_TOKEN="hf_..."   # obtain at https://huggingface.co/settings/tokens

Cloud inference:

from ractogateway import huggingface_developer_kit as hf, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer the user clearly.",
    constraints=["Stay on topic."],
    tone="Friendly",
    output_format="text",
)

kit = hf.Chat(
    model="meta-llama/Llama-3.2-3B-Instruct",
    default_prompt=prompt,
)
response = kit.chat(hf.ChatConfig(user_message="Explain transformers briefly."))
print(response.content)

Local TGI server (no API key):

kit = hf.Chat(
    model="tgi",
    base_url="http://localhost:8080",
    default_prompt=prompt,
)

Embeddings:

resp = kit.embed(
    hf.EmbeddingConfig(texts=["hello world", "goodbye world"])
)
print(f"dim={len(resp.vectors[0].embedding)}")

See the full guide: HuggingFace — Cloud and Local Inference

17. RAG — Retrieval-Augmented Generation

Plain English: RAG lets the AI answer questions about your own documents. You feed it your files, it converts them into searchable number vectors, and when someone asks a question, it finds the relevant parts and feeds them to the AI.

Technical: Full pipeline: FileReaderRegistry → chunker → ProcessingPipeline → embedder → vector store → similarity search → RactoPrompt context injection.

Complete RAG Pipeline Example

from ractogateway.rag import RactoRAG
from ractogateway.rag.embedders import OpenAIEmbedder
from ractogateway.rag.stores import InMemoryVectorStore
from ractogateway.rag.chunkers import RecursiveChunker
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

# 1. Build the RAG pipeline
rag = RactoRAG(
    embedder=OpenAIEmbedder(api_key="sk-..."),
    store=InMemoryVectorStore(),   # swap for ChromaStore, FAISSStore, etc. in production
    chunker=RecursiveChunker(chunk_size=512, overlap=64),
)

# 2. Ingest your documents
rag.add_documents([
    "/path/to/product_manual.pdf",
    "/path/to/faq.docx",
    "/path/to/release_notes.txt",
])

# 3. At query time, retrieve relevant chunks
results = rag.retrieve("How do I reset my password?", top_k=3)

# 4. Inject retrieved context into a RactoPrompt
context = "\n\n".join(r.chunk.text for r in results)

prompt = RactoPrompt(
    role="You are a product support assistant.",
    aim="Answer the user's question based strictly on the provided documentation.",
    constraints=["Only use information from the CONTEXT section.", "Quote the source if possible."],
    tone="Helpful and precise.",
    output_format="text",
    context=context,    # <-- the retrieved chunks go here
)

# 5. Ask the AI
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
response = kit.chat(gpt.ChatConfig(user_message="How do I reset my password?"))
print(response.content)

Chunkers Explained

Chunker	Plain English	Best For
`FixedChunker`	Split every N characters, no mercy	Quick prototyping, structured data
`RecursiveChunker`	Split at sentence/paragraph boundaries, then fall back to characters	General documents (best default)
`SentenceChunker`	Always split at sentence boundaries	Articles, legal text, Q&A content
`SemanticChunker`	Group sentences that are about the same topic	Complex documents with topic shifts

Vector Stores Explained

Store	Plain English	When to Use
`InMemoryVectorStore`	Fast in-RAM store; lost on restart	Development, prototyping, tests
`ChromaStore`	Local persistent store	Single-server apps, local dev
`FAISSStore`	Facebook’s ultra-fast similarity search	Millions of vectors, CPU-only
`PineconeStore`	Fully managed cloud vector DB	Production, no infra to manage
`QdrantStore`	Open-source, filterable, scalable	Production with metadata filtering
`WeaviateStore`	Open-source with built-in ML	Multi-modal + graph features
`MilvusStore`	Distributed vector DB	Billions of vectors at scale
`PGVectorStore`	PostgreSQL extension	Already using Postgres

18. Redis — Production Infrastructure

Redis tools make your app production-ready: distributed cache, per-user rate limiting, and persistent chat memory that survives deployments.

pip install "ractogateway[redis]"

18.1 Distributed Exact Cache

Drop-in replacement for ExactMatchCache that works across multiple server replicas.

from ractogateway.redis import RedisExactCache
from ractogateway import openai_developer_kit as gpt

cache = RedisExactCache(
    url="redis://localhost:6379/0",
    # Plain: "Where is your Redis server?"
    # Technical: Redis connection URL. Alternatively pass client= with a pre-built
    #            redis.Redis instance.

    ttl_seconds=3600,
    # Plain: "Forget cached answers after 1 hour"
    # Technical: TTL applied via Redis EXPIRE on each key write.
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o", exact_cache=cache)
# Now all your servers share the same cache!

18.2 Rate Limiter

Prevent users from making too many expensive requests.

from ractogateway.redis import RedisRateLimiter, RateLimitConfig

limiter = RedisRateLimiter(
    url="redis://localhost:6379/0",
    config=RateLimitConfig(
        max_tokens_per_minute=5_000,
        # Plain: "Each user can use at most 5,000 tokens per minute"
        # Technical: Sliding 1-minute window. Counter stored as Redis sorted set per user_id.

        key_prefix="rl:",
        # Plain: "A label to group all rate limit keys in Redis"
        # Technical: String prefix for Redis keys: "{key_prefix}{user_id}"
    ),
)

# In your request handler:
user_id = "user-42"
estimated_tokens = 200

if not limiter.check_and_consume(user_id, tokens=estimated_tokens):
    raise RuntimeError("Rate limit exceeded — please try again in a minute.")

remaining = limiter.get_remaining(user_id)
print(f"Tokens remaining this minute: {remaining}")
# Tokens remaining this minute: 4800

18.3 Chat Memory

Store conversation history in Redis so it survives server restarts and scales across replicas.

from ractogateway.redis import RedisChatMemory, ChatMemoryConfig
from ractogateway._models.chat import Message, MessageRole

memory = RedisChatMemory(
    url="redis://localhost:6379/0",
    config=ChatMemoryConfig(
        max_turns=20,
        # Plain: "Remember the last 20 messages per conversation"
        # Technical: Redis List capped to 2*max_turns entries (each turn = 2 messages).
        #            Older messages are popped from the front automatically.

        ttl_seconds=1800,
        # Plain: "Forget the conversation after 30 minutes of inactivity"
        # Technical: TTL reset on every append() call.

        key_prefix="chat:",
        # Plain: "Label all conversation keys in Redis"
        # Technical: Redis keys = "{key_prefix}{conv_id}"
    ),
)

# When a user sends a message:
conv_id = "session-abc123"
memory.append(conv_id, "user", "What's the best way to learn Python?")

# After getting the AI response:
memory.append(conv_id, "assistant", "Start with the official tutorial, then build projects!")

# Reconstruct history for the next request:
history_dicts = memory.get_history(conv_id)
# [{"role": "user", "content": "What's the best way..."}, {"role": "assistant", "content": "..."}]

history = [Message(role=m["role"], content=m["content"]) for m in history_dicts]

# Pass to ChatConfig:
response = kit.chat(gpt.ChatConfig(
    user_message="What resources do you recommend?",
    history=history,
))

# Wipe the conversation when the session ends:
memory.clear(conv_id)
print(memory.count(conv_id))  # 0

19. Common Mistakes & How to Fix Them

Mistake 1: Using `output` instead of `output_format` in RactoPrompt

# WRONG — this will raise a Pydantic ValidationError
prompt = RactoPrompt(
    role="...", aim="...", constraints=["..."], tone="...",
    output="text",    # ❌  field is called output_format, not output!
)

# CORRECT
prompt = RactoPrompt(
    role="...", aim="...", constraints=["..."], tone="...",
    output_format="text",   # ✅
)

Mistake 2: Forgetting at least one constraint

# WRONG — constraints cannot be an empty list
prompt = RactoPrompt(
    role="...", aim="...", constraints=[],   # ❌ ValidationError: min_length=1
    tone="...", output_format="text",
)

# CORRECT
prompt = RactoPrompt(
    role="...", aim="...",
    constraints=["Be helpful."],   # ✅ at least one constraint required
    tone="...", output_format="text",
)

Mistake 3: Using `model="auto"` without a router

# WRONG — raises ValueError immediately
kit = gpt.OpenAIDeveloperKit(model="auto")   # ❌

# CORRECT
kit = gpt.OpenAIDeveloperKit(
    model="auto",
    router=CostAwareRouter([...]),   # ✅
)

Mistake 4: Neither ChatConfig.prompt nor kit.default_prompt is set

# WRONG — raises ValueError when chat() is called
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")   # no default_prompt
response = kit.chat(gpt.ChatConfig(user_message="Hello"))  # ❌

# FIX OPTION 1: Set default_prompt on the kit
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=my_prompt)

# FIX OPTION 2: Pass prompt in ChatConfig
response = kit.chat(gpt.ChatConfig(user_message="Hello", prompt=my_prompt))

Mistake 5: Expecting typed validation but not setting it explicitly

# BEST PRACTICE — set response_model explicitly
prompt = RactoPrompt(..., output_format=WeatherReport)
config = gpt.ChatConfig(
    user_message="...",
    response_model=WeatherReport,   # ✅ explicit validation contract
)

# ALSO SUPPORTED — inferred automatically from output_format model
prompt = RactoPrompt(..., output_format=WeatherReport)
config = gpt.ChatConfig(user_message="...")  # ✅ inferred from prompt.output_format

Mistake 6: Missing `await` on async methods

# WRONG — this returns a coroutine object, not a response
response = kit.achat(config)   # ❌

# CORRECT
response = await kit.achat(config)   # ✅  (inside an async function)

Mistake 7: Not installing the provider extra

# WRONG — if you only ran  pip install ractogateway
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(model="gpt-4o")
kit.chat(...)   # ❌  ImportError: The 'openai' package is required

# FIX
# pip install "ractogateway[openai]"

Mistake 8: Not handling `ResponseModelValidationError`

When response_model is set, validation failures now raise ResponseModelValidationError after all retries are exhausted — they no longer silently append a warning string to response.content.

# WRONG — this will now raise, not return a response with garbled content
response = kit.chat(config)   # ❌ unhandled ResponseModelValidationError

# CORRECT — wrap in try/except to handle gracefully
from ractogateway.exceptions import ResponseModelValidationError

try:
    response = kit.chat(config)
    report = MyModel(**response.parsed)
except ResponseModelValidationError as e:
    # Inspect what happened and decide how to recover
    print(f"Validation failed after {e.attempts} attempt(s): {e.last_error}")
    # e.raw_response holds the last raw JSON string from the LLM

Tip: The default max_validation_retries=2 means the kit will automatically retry twice before raising — most transient issues resolve in the first retry. Set max_validation_retries=0 to disable retries and fail fast.

19. Telemetry & Observability

RactoGateway ships production-grade observability with zero changes to existing call sites. Attach a RactoTracer and/or GatewayMetricsMiddleware to any kit and every LLM call is automatically instrumented.

Installation

pip install "ractogateway[observability]"   # OTEL tracing + Prometheus metrics
pip install "ractogateway[telemetry]"        # OTEL tracing only
pip install "ractogateway[prometheus]"       # Prometheus metrics only

Quick start

from ractogateway import openai_developer_kit as opd
from ractogateway.telemetry import RactoTracer, GatewayMetricsMiddleware, PrometheusExporter

tracer  = RactoTracer(otlp_endpoint="http://localhost:4317", console=True)
metrics = GatewayMetricsMiddleware()
PrometheusExporter(port=8000).start()    # scrape http://localhost:8000/metrics

kit = opd.OpenAIDeveloperKit(
    model="gpt-4o",
    default_prompt=prompt,
    tracer=tracer,
    metrics=metrics,
)
response = kit.chat(opd.ChatConfig(user_message="Hello!"))
# One OTEL span emitted, one Prometheus data-point recorded.

The same tracer= / metrics= parameters work on GoogleDeveloperKit and AnthropicDeveloperKit.

What is recorded automatically

Event	Tracer span	Prometheus metrics
Successful chat/stream	`llm.chat` with latency, tokens, cost	`requests_total`, `duration_seconds`, `tokens_total`, `cost_usd_total`
Cache hit (exact/semantic)	`llm.chat` with `cache_hit="exact"/"semantic"`, 0 tokens	`cache_hits_total`
Cache miss	—	`cache_misses_total`
Tool call	`tool_calls` attribute on span	`tool_calls_total{tool_name}`
Error	`status="error"`, `error_type=ExcName`	`requests_total{status="error"}`
Embedding	`llm.embed`	`requests_total{operation="embed"}`

OTEL export backends

# Jaeger / Grafana Tempo (gRPC)
RactoTracer(otlp_endpoint="http://jaeger:4317")

# Zipkin / Tempo (HTTP)
RactoTracer(otlp_http_endpoint="http://tempo:4318")

# In-memory capture for unit tests — no external backend needed
tracer = RactoTracer(in_memory=True)
kit.chat(...)
assert tracer.spans[0].provider == "openai"
tracer.clear_spans()

Custom pricing

from ractogateway.telemetry import ModelPricing, RactoTracer

custom = {"my-ft-gpt4": ModelPricing(input_per_million=5.0, output_per_million=15.0)}
tracer = RactoTracer(otlp_endpoint="...", price_table=custom)

Grafana dashboard

Import dashboards/grafana_dashboard.json into Grafana to get 20+ pre-built panels covering latency percentiles (p50/p95/p99), token rate, cost rate, cache hit/miss ratio, error rate, tool call distribution, and a per-model summary table.

Full reference: Telemetry guide | API reference

20. Prebuilt Pipelines — Production Workflows

RactoGateway includes prebuilt pipelines for common end-to-end tasks where a single chat() call is not enough.

Available pipelines

Pipeline	Classes	Use case
SQL Analyst	`SQLAnalystPipeline`, `AsyncSQLAnalystPipeline`	Natural language analytics over SQL databases
List Classifier	`ListClassifierPipeline`, `AsyncListClassifierPipeline`	Map user text to one or more options from a list
Video Processor	`VideoProcessorPipeline`, `AsyncVideoProcessorPipeline`	Extract frames, transcribe audio, analyse with vision LLM, summarise
Agent	`AgentPipeline`, `AsyncAgentPipeline`	Autonomous ReAct agent — reason + call tools + observe → answer

Install extras

# SQL Analyst
pip install ractogateway[pipelines-sql]           # core (no charts)
pip install ractogateway[pipelines-sql-viz]        # + Plotly charts

# Video Processor
pip install ractogateway[pipelines-video]          # OpenCV + ffmpeg + pHash
pip install ractogateway[pipelines-video-whisper]  # + faster-whisper (local ASR)
pip install ractogateway[pipelines-video-yt]       # + yt-dlp (YouTube download)

# Agent
pip install ractogateway[pipelines-agent]          # core (no extra deps)
pip install ractogateway[pipelines-agent-http]     # + httpx (http_get tool)

SQL Analyst — quick example

from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import SQLAnalystPipeline

sql_pipeline = SQLAnalystPipeline(kit=gpt.Chat(model="gpt-4o"))
result = sql_pipeline.run(
    user_query="Top 5 products by revenue",
    connection_string="postgresql://user:pass@localhost:5432/shop",
)
print(result.answer)

List Classifier — quick example

from ractogateway.pipelines import ListClassifierPipeline

classifier = ListClassifierPipeline(
    kit=gpt.Chat(model="gpt-4o-mini"),
    options=["Billing", "Technical Support", "Sales"],
    include_confidence=True,
    include_reasoning=True,
)
result = classifier.run("I cannot update my payment method")
print(result.first)           # "Billing"
print(result.top_confidence)  # e.g. 0.96

Video Processor — quick example

Process a lecture or tutorial video end-to-end — extract key frames, transcribe speech, use a vision LLM to read whiteboards/screens, and produce a structured Markdown report.

from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import VideoProcessorPipeline, TranscriberBackend, DeduplicationMethod

pipeline = VideoProcessorPipeline(
    kit=gpt.Chat(model="gpt-4o"),        # vision LLM + summary
    fps=1.0,                              # sample one frame per second
    similarity_threshold=85.0,            # drop frames that are ≥85% similar to the previous
    dedup_method=DeduplicationMethod.PHASH,
    transcriber=TranscriberBackend.FASTER_WHISPER,
    transcriber_model="base",
    analyze_frames=True,
    generate_summary=True,
    safe_mode=True,
)

# Accepts: local path, HTTP URL, YouTube URL, raw bytes, or pre-extracted frame list
result = pipeline.run("lecture.mp4")

print(f"Frames kept : {result.usage.frames_kept}/{result.usage.frames_extracted}")
print(f"Tokens used : {result.usage.total_tokens}")
print(result.summary)          # structured Markdown summary
result.to_markdown("report.md")  # save full report

What it produces (VideoProcessorResult):

Field	Type	Description
`frames`	`list[FrameEntry]`	Every extracted frame with its LLM analysis
`transcript`	`list[TranscriptSegment]`	Timed speech-to-text segments
`sections`	`list[VideoSection]`	Time windows merging visual + audio content
`summary`	`str`	7-section Markdown summary
`usage`	`VideoProcessorUsage`	Token counts + frame statistics

Supported transcription backends (TranscriberBackend):

Backend	Value	Requires
Faster Whisper (default)	`"faster-whisper"`	`pip install ractogateway[pipelines-video-whisper]`
OpenAI Whisper (local)	`"openai-whisper"`	`pip install openai-whisper`
OpenAI API	`"openai-api"`	OpenAI API key
Groq API (ultra-fast)	`"groq-api"`	`pip install groq` + Groq API key
Deepgram	`"deepgram-api"`	`pip install deepgram-sdk` + key
Google Cloud STT	`"google-api"`	`pip install google-cloud-speech` + key
HuggingFace local	`"huggingface-local"`	`pip install transformers torch`
HuggingFace API	`"huggingface-api"`	`pip install huggingface_hub` + key
Ollama	`"ollama"`	Running Ollama server

Agent — quick example

An autonomous ReAct (Reason + Act) agent that loops: think → call tool → observe → repeat until it calls the built-in finish() tool.

from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import AgentPipeline

def get_weather(city: str) -> str:
    """Return current weather for a city."""
    return f"Sunny, 22 °C in {city}"

def unit_convert(value: float, from_unit: str, to_unit: str) -> str:
    """Convert a value between units."""
    # ... your logic here ...
    return f"{value} {from_unit} = ... {to_unit}"

agent = AgentPipeline(
    kit=gpt.Chat(model="gpt-4o"),
    tools=[get_weather, unit_convert],
    max_steps=8,
    safe_mode=True,
)

result = agent.run("What is the weather in Paris, and convert 22°C to Fahrenheit?")
print(result.final_answer)
print(result.to_markdown())   # step-by-step trace

Agent result fields (AgentResult):

Field	Type	Description
`final_answer`	`str \| None`	The agent’s concluded answer
`steps`	`list[AgentStep]`	Every thought / tool call / observation
`stop_reason`	`StopReason`	`"finish"`, `"max_steps"`, or `"error"`
`usage`	`AgentUsage`	Cumulative token counts across all steps

Built-in tool factories:

from ractogateway.pipelines import (
    make_rag_tool,        # rag_search(query) → relevant chunks from RactoRAG
    make_sql_tool,        # sql_query(question) → answer from SQLAnalystPipeline
    make_http_tool,       # http_get(url) → page text (requires httpx)
    make_memory_tools,    # memory_read(key) + memory_write(key, value)
)

agent = AgentPipeline(
    kit=gpt.Chat(model="gpt-4o"),
    tools=[get_weather],               # your custom tools
    rag_pipeline=my_rag,               # auto-registers rag_search
    sql_pipeline=my_sql,               # auto-registers sql_query
    agent_memory={},                   # dict → auto-registers memory_read/write
    extra_tools=[make_http_tool()],    # opt-in http_get
)

Full guides

21. Chain of Thought Reasoning

Chain of Thought (CoT) prompts the model to reason step-by-step before giving its final answer. RactoGateway exposes this as a single ChatConfig flag — no prompt engineering required.

How to enable

from ractogateway import openai_developer_kit as gpt

kit = gpt.Chat(model="gpt-4o")
response = kit.chat(
    gpt.ChatConfig(
        user_message="If a train travels 300 km in 2.5 hours, what is its average speed?",
        chain_of_thought=True,   # ← flip this flag
    )
)
print(response.content)
# The model will reason through the problem before stating "120 km/h"

What it does internally

Setting chain_of_thought=True appends a step-by-step reasoning constraint to the RactoPrompt before the request is sent. The constraint instructs the model to:

Break the problem into numbered reasoning steps.
Show its working at each step.
State the final answer clearly at the end.

This is applied per request — it does not modify the kit’s default prompt permanently.

When to use CoT

Scenario	Benefit
Math / logic problems	Forces explicit calculation steps → fewer errors
Multi-step planning	Surfaces assumptions and intermediate decisions
Debugging assistance	Produces a traceable reasoning chain
Exam / quiz apps	Provides explanation alongside the answer

Combining with structured output

from pydantic import BaseModel

class ReasonedAnswer(BaseModel):
    steps: list[str]
    final_answer: str

response = kit.chat(
    gpt.ChatConfig(
        user_message="How many seconds are in a leap year?",
        chain_of_thought=True,
        response_model=ReasonedAnswer,   # parse result into Pydantic model
    )
)
print(response.parsed.steps)
print(response.parsed.final_answer)

22. Native Thinking / Extended Reasoning

Native Thinking exposes the model’s internal chain-of-thought reasoning tokens — the model genuinely thinks before answering rather than being instructed to write steps. Supported by Anthropic Claude (extended thinking) and Google Gemini (thinking mode). OpenAI o-series models expose reasoning token counts but not the text.

Enable native thinking

from ractogateway import anthropic_developer_kit as claude

kit = claude.Chat(model="claude-opus-4-6")
response = kit.chat(
    claude.ChatConfig(
        user_message="Prove that √2 is irrational.",
        native_thinking=True,
        thinking_budget=8000,   # max thinking tokens (Anthropic/Google)
    )
)
print(response.thinking)   # raw model reasoning (may be hundreds of tokens)
print(response.content)    # final polished answer

Streaming with native thinking

accumulated_thinking = ""
for chunk in kit.stream(
    claude.ChatConfig(
        user_message="Design a cache-invalidation strategy for a distributed system.",
        native_thinking=True,
        thinking_budget=10000,
    )
):
    if chunk.is_thinking:
        print(chunk.delta.thinking, end="", flush=True)
    else:
        print(chunk.delta.text, end="", flush=True)

Provider behaviour summary

Provider	Thinking text visible	Thinking budget param	Notes
Anthropic Claude	✅ `response.thinking`	`thinking_budget`	Forces `temperature=1`
Google Gemini	✅ `response.thinking`	`thinking_budget`	`ThinkingConfig` injected
OpenAI (o-series)	❌ not exposed	N/A	`reasoning_tokens` count in `usage`

`LLMResponse` fields added by native thinking

Field	Type	Description
`thinking`	`str \| None`	Raw model reasoning text
`StreamDelta.thinking`	`str`	Incremental thinking token (streaming)
`StreamChunk.accumulated_thinking`	`str`	Full thinking so far (streaming)
`StreamChunk.is_thinking`	`bool`	`True` while in a thinking block

When to use native thinking

Use native_thinking=True when accuracy matters more than latency:

Complex proofs, theorem verification
Code architecture reviews
Medical / legal / scientific reasoning
Any task where you want to inspect the model’s reasoning, not just the answer

Cost note: thinking tokens count toward your bill but are not included in response.content. Set thinking_budget conservatively; 4000–8000 is usually enough for most tasks.

23. PageIndexRAG — Vectorless RAG

PageIndexRAG is a lightweight RAG pipeline that requires no embeddings and no vector database. It uses a two-stage keyword index + BM25 scoring to retrieve relevant pages from documents. Perfect for CPU-only environments, offline use, or when you want instant setup without configuring a vector store.

How it works

Document → page split → DecisionIndex (inverted keyword index)
                       → BM25 scorer (Okapi BM25) → top-k pages → LLM

Page split — PDFs are split page-by-page; all other documents use fixed character windows (page_size=1000, page_overlap=100).
DecisionIndex — builds an inverted keyword index over all pages for fast candidate retrieval (no embeddings needed).
BM25 scoring — ranks candidates with Okapi BM25, the same algorithm used by Elasticsearch and Solr.
LLM answer — top-k pages are passed to the LLM as context.

Quick example

from ractogateway import openai_developer_kit as gpt
from ractogateway.rag.page_index import PageIndexRAG

kit = gpt.Chat(model="gpt-4o-mini")

# Build the index
rag = PageIndexRAG(kit=kit)
rag.add_document("docs/handbook.pdf")      # PDF — split page-by-page
rag.add_document("docs/faq.txt")           # Plain text — split by char window
rag.add_texts(["RactoGateway supports 5 developer kits.", "..."])

# Query
result = rag.search("What developer kits are supported?")
print(result.answer)          # LLM answer grounded in the retrieved pages
print(result.pages[0].text)   # raw page text that was used as context

No extra install

PageIndexRAG ships in the core package — no vector store or embedding model required:

pip install ractogateway        # PageIndexRAG included by default
pip install ractogateway[rag]   # if you also want readers (PDF, Word, Excel…)

Comparison: PageIndexRAG vs. RactoRAG

Feature	`PageIndexRAG`	`RactoRAG`
Embeddings needed	❌ No	✅ Yes
Vector store needed	❌ No	✅ Yes (Chroma, FAISS, Pinecone…)
Retrieval algorithm	BM25 (keyword)	Cosine similarity (semantic)
Best for	Quick setup, keyword-rich docs	Deep semantic search
GPU/CPU	Pure CPU	CPU or GPU (embedding model)
Offline use	✅ Fully offline	⚠️ Depends on embedder

When to use PageIndexRAG

Prototyping a Q&A feature without setting up a vector DB
Compliance / legal documents where exact keyword match matters
Offline / air-gapped environments
Structured documents (manuals, handbooks) where pages map naturally to topics

Advanced: async + per-call top-k

import asyncio

async def main():
    rag = PageIndexRAG(kit=kit, top_k=5, page_size=800, page_overlap=80)
    rag.add_document("research_paper.pdf")
    result = await rag.asearch("What methodology did the authors use?")
    print(result.answer)

asyncio.run(main())

Full reference: PageIndexRAG API

Quick Reference Card

# ── Imports ──────────────────────────────────────────────────────────
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt, RactoFile
from ractogateway.tools.registry import tool, ToolRegistry
from ractogateway.cache import ExactMatchCache, SemanticCache
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway.truncation import TokenTruncator, TruncationConfig

# ── Build a prompt ───────────────────────────────────────────────────
prompt = RactoPrompt(
    role="...", aim="...", constraints=["..."], tone="...",
    output_format="text",    # or "json", "markdown", or a Pydantic class
    context="...",           # optional background knowledge
    examples=[{"input": "...", "output": "..."}],  # optional few-shot
)

# ── Create the kit ───────────────────────────────────────────────────
kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=prompt,
    exact_cache=ExactMatchCache(max_size=512),
)

# ── Sync chat ────────────────────────────────────────────────────────
response = kit.chat(gpt.ChatConfig(user_message="Hello!"))
print(response.content)

# ── Async chat ───────────────────────────────────────────────────────
response = await kit.achat(gpt.ChatConfig(user_message="Hello!"))

# ── Streaming ────────────────────────────────────────────────────────
for chunk in kit.stream(gpt.ChatConfig(user_message="Tell me a story.")):
    print(chunk.delta.text, end="", flush=True)

# ── Embeddings ───────────────────────────────────────────────────────
from ractogateway._models.embedding import EmbeddingConfig
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
vec = resp.vectors[0].embedding   # list[float]

# ── Tool calling ─────────────────────────────────────────────────────
@tool
def get_price(product: str) -> float:
    """Get the price of a product."""
    return 9.99

registry = ToolRegistry()
registry.register(get_price)
response = kit.chat(gpt.ChatConfig(
    user_message="How much is a widget?",
    tools=registry,
))

# ── Chain of Thought ─────────────────────────────────────────────────
response = kit.chat(gpt.ChatConfig(
    user_message="Explain why √2 is irrational.",
    chain_of_thought=True,           # step-by-step reasoning in the answer
))

# ── Native Thinking (Anthropic / Gemini) ─────────────────────────────
from ractogateway import anthropic_developer_kit as claude
claude_kit = claude.Chat(model="claude-opus-4-6")
response = claude_kit.chat(claude.ChatConfig(
    user_message="Design a cache-invalidation strategy.",
    native_thinking=True,
    thinking_budget=8000,            # max internal reasoning tokens
))
print(response.thinking)            # raw reasoning
print(response.content)             # polished answer

# ── PageIndexRAG (no embeddings) ─────────────────────────────────────
from ractogateway.rag.page_index import PageIndexRAG
rag = PageIndexRAG(kit=kit)
rag.add_document("handbook.pdf")
result = rag.search("What developer kits are supported?")
print(result.answer)

# ── Pipelines ────────────────────────────────────────────────────────
from ractogateway.pipelines import (
    SQLAnalystPipeline,
    ListClassifierPipeline,
    VideoProcessorPipeline,
    AgentPipeline,
    TranscriberBackend,
)

# SQL
sql = SQLAnalystPipeline(kit=kit)
sql_result = sql.run("Top 5 products", connection_string="postgresql://...")
print(sql_result.answer)

# Classifier
clf = ListClassifierPipeline(kit=kit, options=["Billing", "Tech Support"])
print(clf.run("I can't log in").first)

# Video
vp = VideoProcessorPipeline(
    kit=kit,
    transcriber=TranscriberBackend.FASTER_WHISPER,
    generate_summary=True,
)
vp_result = vp.run("lecture.mp4")
print(vp_result.summary)

# Agent
def search_web(query: str) -> str:
    """Search the web for information."""
    return f"Results for: {query}"

agent = AgentPipeline(kit=kit, tools=[search_web], max_steps=6)
print(agent.run("What is the capital of France?").final_answer)