# RactoGateway — Complete User Guide

> **Who this guide is for:** complete beginners who have never used an LLM library before, as well as experienced developers who want a deep-dive reference. Every parameter is explained in plain English *and* in technical terms, with working code examples and expected output.

---

## Table of Contents

1. [Jargon Buster — Know the Words Before You Write the Code](#1-jargon-buster)
2. [What is RactoGateway?](#2-what-is-ractogateway)
3. [Installation](#3-installation)
4. [Core Mental Model](#4-core-mental-model)
5. [RactoPrompt — The Heart of Every Request](#5-ractoprompt)
6. [Developer Kits — Your Chat Interface](#6-developer-kits)
7. [Your First Chat](#7-your-first-chat)
8. [ChatConfig — Controlling Every Request](#8-chatconfig)
9. [Getting Structured / Typed Output](#9-structured-output)
    - 9.1 Complex Nested Structured Output
    - 9.2 Validation Retries and `ResponseModelValidationError`
10. [Multi-Turn Conversations (History)](#10-multi-turn-conversations)
11. [Streaming — Real-Time Token-by-Token Output](#11-streaming)
12. [Tool Calling — LLM Calls Your Python Functions](#12-tool-calling)
13. [File Attachments — Vision & PDFs](#13-file-attachments)
14. [Embeddings — Teaching Machines to Understand Text](#14-embeddings)
15. [Performance & Cost Optimisation](#15-performance--cost-optimisation)
    - 15.1 Exact Match Cache
    - 15.2 Semantic Cache
    - 15.3 Token Truncation
    - 15.4 Cost-Aware Routing
16. [All Five Developer Kits](#16-all-five-developer-kits)
    - 16.1 OpenAIDeveloperKit (GPT)
    - 16.2 GoogleDeveloperKit (Gemini)
    - 16.3 AnthropicDeveloperKit (Claude)
    - 16.4 OllamaDeveloperKit (Local / Offline)
    - 16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM)
17. [RAG — Retrieval-Augmented Generation](#17-rag--retrieval-augmented-generation)
18. [Redis — Production Infrastructure](#18-redis--production-infrastructure)
19. [Common Mistakes & How to Fix Them](#19-common-mistakes--how-to-fix-them)
20. [Prebuilt Pipelines — Production Workflows](#20-prebuilt-pipelines--production-workflows)
    - SQL Analyst, List Classifier, Video Processor, Agent
21. [Chain of Thought Reasoning](#21-chain-of-thought-reasoning)
22. [Native Thinking / Extended Reasoning](#22-native-thinking--extended-reasoning)
23. [PageIndexRAG — Vectorless RAG](#23-pageindexrag--vectorless-rag)

---

## 1. Jargon Buster

Before diving into code, here are the key terms you will encounter. Skip to §2 if you already know these.

| Term | Plain-English Meaning | Technical Definition |
|---|---|---|
| **LLM** | A very powerful autocomplete that understands meaning | Large Language Model — a neural network trained on vast text corpora to predict/generate natural language |
| **Prompt** | What you say to the AI | The input text (plus optional instructions) sent to an LLM |
| **Completion / Response** | What the AI says back | The LLM's generated output tokens |
| **Token** | Roughly one word (sometimes less) | The smallest unit an LLM processes; ~4 chars for English |
| **System Prompt** | The AI's job description | An instruction block sent before the conversation; sets behaviour and constraints |
| **Temperature** | How creative vs. predictable the AI is | Float 0–2. 0 = deterministic (same output every time). Higher = more random/creative |
| **Streaming** | Getting the answer word-by-word in real time | Server-sent events where each token is pushed to the client as it is generated |
| **Embedding** | Converting text into a list of numbers | A dense vector representation where semantically similar texts are numerically close |
| **RAG** | Letting the AI "look things up" before answering | Retrieval-Augmented Generation — retrieve relevant chunks from a knowledge base and inject them into the prompt |
| **Tool Calling** | The AI can trigger your Python functions | Function-calling protocol where the LLM emits a structured intent and the client executes a real function |
| **Pydantic Model** | A Python class that validates data automatically | A `BaseModel` subclass that enforces types and field rules at runtime |
| **Cache** | Store an answer so you don't ask the AI twice | In-memory or distributed key-value store keyed on request fingerprint |
| **Context Window** | The AI's short-term memory | Maximum number of tokens the model can process in one request |
| **Adapter** | The translator between our library and the AI provider | A thin class that converts our internal format to the OpenAI / Google / Anthropic API wire format |

---

## 2. What is RactoGateway?

**Plain English:** RactoGateway is a Python library that lets you talk to different AI models (OpenAI, Google, Anthropic) using the same code. You don't need to learn three different APIs. You write your prompts using a structured template (the RACTO principle), and the library takes care of formatting, caching, routing, and more.

**Technical:** RactoGateway is a provider-agnostic LLM orchestration SDK built on Pydantic. It provides:

- A unified `RactoPrompt` structured prompt compiler (the RACTO principle)
- Provider-specific developer kits (`OpenAIDeveloperKit`, `GoogleDeveloperKit`, `AnthropicDeveloperKit`)
- Sync **and** async parity on every method
- Optional middleware: exact-match cache, semantic cache, cost-aware router, token truncator
- Tool calling, file attachments, streaming, embeddings, RAG, fine-tuning, and production infra (Redis, Celery, Kafka)

**Why does this exist?** Without RactoGateway, switching from OpenAI to Anthropic means rewriting all your code. With RactoGateway, you swap one class name.

---

## 3. Installation

```bash
# Minimum — no LLM provider yet
pip install ractogateway

# OpenAI (GPT models)
pip install "ractogateway[openai]"

# Google (Gemini models)
pip install "ractogateway[google]"

# Anthropic (Claude models)
pip install "ractogateway[anthropic]"

# All three providers at once
pip install "ractogateway[all]"

# RAG (document reading, chunking, embedding, stores)
pip install "ractogateway[rag-all]"

# Redis (distributed cache, rate limiting, chat memory)
pip install "ractogateway[redis]"
```

**Requires Python 3.10 or later.**

---

## 4. Core Mental Model

Think of RactoGateway in three layers:

```
┌─────────────────────────────────────────────────────┐
│  YOUR CODE                                          │
│  RactoPrompt → ChatConfig → kit.chat()              │
├─────────────────────────────────────────────────────┤
│  DEVELOPER KIT  (OpenAIDeveloperKit, etc.)           │
│  middleware: cache → route → truncate → API call    │
├─────────────────────────────────────────────────────┤
│  ADAPTER  (OpenAILLMKit, GoogleLLMKit, etc.)         │
│  Translates our format → provider wire format       │
├─────────────────────────────────────────────────────┤
│  PROVIDER API  (OpenAI, Google, Anthropic)           │
└─────────────────────────────────────────────────────┘
```

**You only ever touch the top layer.** The kit and adapter layers are managed for you.

---

## 5. RactoPrompt

`RactoPrompt` is how you write instructions for the AI. It enforces the **RACTO principle** — a structured format that dramatically reduces hallucinations and ambiguous outputs.

**RACTO stands for:**

| Letter | Field | Plain English | Technical |
|---|---|---|---|
| **R** | `role` | Who is the AI? | System identity; primes the model's behaviour via persona specification |
| **A** | `aim` | What should it do? | Objective statement; the task the model must complete |
| **C** | `constraints` | What must it never do? | Hard invariants; rule set injected into `[CONSTRAINTS]` block |
| **T** | `tone` | How should it talk? | Communication register; affects lexical and stylistic choices |
| **O** | `output_format` | What shape should the answer be in? | Output schema; can be a keyword, a string, or a Pydantic model class |

Plus two optional helpers: `context` (background knowledge) and `examples` (few-shot examples).

### 5.1 Minimal Example

```python
from ractogateway.prompts.engine import RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful customer-support agent for a software company.",
    aim="Answer the user's question about our product.",
    constraints=[
        "Never make up features that don't exist.",
        "If you don't know the answer, say so.",
    ],
    tone="Friendly and concise.",
    output_format="text",
)

# See what the compiled system prompt looks like:
print(prompt.compile())
```

**Expected output:**

```
[ROLE]
You are a helpful customer-support agent for a software company.

[AIM]
Answer the user's question about our product.

[CONSTRAINTS]
- Never make up features that don't exist.
- If you don't know the answer, say so.

[TONE]
Friendly and concise.

[OUTPUT]
Respond in plain text with no special formatting.

[GUARDRAILS]
- If you are unsure or lack sufficient information, state it explicitly rather than guessing.
- Do NOT fabricate facts, citations, URLs, statistics, or code that you cannot verify.
- Stick strictly to what is asked. Do not add unrequested information.
- If the answer requires assumptions, list each assumption explicitly before proceeding.
```

> **Notice the `[GUARDRAILS]` section at the bottom.** This is auto-generated by `anti_hallucination=True` (the default). It tells the model to be honest about uncertainty. You can disable it with `anti_hallucination=False` if you need maximum creative freedom.

---

### 5.2 Full Parameter Reference

```python
from pydantic import BaseModel

class Summary(BaseModel):
    headline: str
    bullet_points: list[str]
    confidence_score: float  # 0.0 to 1.0

prompt = RactoPrompt(
    # ── REQUIRED ──────────────────────────────────────────────────────
    role="You are a senior financial analyst.",
    # Plain: "Tell the AI who it is"
    # Technical: Persona string prepended to the [ROLE] block; primes
    #            the model's prior distribution toward domain-specific vocabulary

    aim="Summarise the provided earnings report into key takeaways.",
    # Plain: "Tell the AI what job it has to do"
    # Technical: Task objective injected into [AIM]; should be one clear imperative sentence

    constraints=[
        "Only use numbers that appear in the report — never invent figures.",
        "Keep bullet points to at most 15 words each.",
        "Do not provide investment advice.",
    ],
    # Plain: "Red lines the AI must never cross"
    # Technical: List[str]; each item becomes a bullet in [CONSTRAINTS].
    #            Minimum one constraint required.

    tone="Professional, concise, and factual.",
    # Plain: "How the AI should sound"
    # Technical: Register specification injected into [TONE]; affects temperature
    #            interaction and lexical formality

    output_format=Summary,
    # Plain: "Exactly what shape should the answer be in?"
    # Technical: Union[str, type[BaseModel]].
    #   - "text"     → plain text
    #   - "json"     → raw JSON object
    #   - "markdown" → markdown-formatted response
    #   - A Pydantic model class → the full JSON Schema is embedded in the prompt;
    #     the LLM must return JSON that validates against it.

    # ── OPTIONAL ──────────────────────────────────────────────────────
    context="Q3 2025 earnings call. Revenue: $4.2B (+12% YoY). EPS: $1.87.",
    # Plain: "Background knowledge the AI needs to do its job"
    # Technical: Domain-specific text injected between [AIM] and [CONSTRAINTS].
    #            Ideal for passing documents, retrieved chunks, or facts.

    examples=[
        {
            "input":  "Revenue grew 5% but EPS fell 10%.",
            "output": '{"headline": "Mixed signals: top-line growth masked by margin compression", ...}'
        },
    ],
    # Plain: "Show the AI what a good answer looks like"
    # Technical: Few-shot exemplars injected into [EXAMPLES] block; each dict
    #            must contain exactly "input" and "output" keys.

    anti_hallucination=True,
    # Plain: "Should the AI be told to say 'I don't know' instead of guessing?"
    # Technical: Boolean flag. When True, appends [GUARDRAILS] block with
    #            explicit uncertainty-disclosure directives. Default: True.
)
```

---

## 6. Developer Kits

A **Developer Kit** is your interface to a specific LLM provider.
All five kits (`OpenAIDeveloperKit`, `GoogleDeveloperKit`,
`AnthropicDeveloperKit`, `OllamaDeveloperKit`, `HuggingFaceDeveloperKit`)
share the same six method names.

### OpenAIDeveloperKit — Full Parameter Reference

```python
from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o",
    # Plain: "Which AI model should I use?"
    # Technical: Chat model ID passed to openai.chat.completions.create(model=...).
    #            Use "auto" to enable cost-aware routing (requires router= param).
    #            Common values: "gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "o3-mini"

    api_key="sk-...",
    # Plain: "My OpenAI account password"
    # Technical: Bearer token for OpenAI API auth. Falls back to
    #            os.environ["OPENAI_API_KEY"] when omitted.

    base_url=None,
    # Plain: "Send requests to a different server (e.g. Azure or your own proxy)"
    # Technical: Override for openai.base_url. Used for Azure OpenAI endpoints or
    #            local model servers that implement the OpenAI protocol.

    embedding_model="text-embedding-3-small",
    # Plain: "Which model to use when converting text to numbers (embeddings)"
    # Technical: Default model ID for embed() / aembed() calls.
    #            Passed to openai.embeddings.create(model=...).

    default_prompt=None,
    # Plain: "A prompt to use for every request unless I override it"
    # Technical: RactoPrompt instance used when ChatConfig.prompt is None.
    #            If both are None, kit.chat() raises ValueError.

    exact_cache=None,
    # Plain: "Store answers so I don't pay for the same question twice"
    # Technical: ExactMatchCache instance. On a byte-identical request the cached
    #            LLMResponse is returned without an API call. O(1) lookup.

    semantic_cache=None,
    # Plain: "Store answers and also reuse them for questions that mean the same thing"
    # Technical: SemanticCache instance. Uses cosine similarity on embeddings.
    #            Returns cached response when similarity >= threshold.

    router=None,
    # Plain: "Automatically pick the cheapest model that can handle each question"
    # Technical: CostAwareRouter instance. Routes each request to the first tier
    #            whose max_score >= the computed prompt complexity score.
    #            Required when model="auto".

    truncator=None,
    # Plain: "Automatically shorten old conversation history if it gets too long"
    # Technical: TokenTruncator instance. Trims history messages to keep total
    #            token count within the model's context window before each API call.
)
```

---

## 7. Your First Chat

Let's put it all together — a complete, working example.

```python
import os
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

# 1. Define who the AI is and what it should do
prompt = RactoPrompt(
    role="You are a helpful Python tutor.",
    aim="Explain the concept the user asks about in simple terms.",
    constraints=["Use beginner-friendly language.", "Keep the answer under 3 sentences."],
    tone="Warm, encouraging, and clear.",
    output_format="text",
)

# 2. Create the kit (reads OPENAI_API_KEY from environment automatically)
kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=prompt,
)

# 3. Send a message and get a response
response = kit.chat(gpt.ChatConfig(user_message="What is a Python list?"))

print(response.content)
# A list in Python is an ordered collection of items that can hold any type
# of data — numbers, strings, even other lists. You create one with square
# brackets, like my_list = [1, "hello", True]. You can add, remove, or
# change items at any time!

print(f"Tokens used: {response.usage}")
# Tokens used: {'prompt_tokens': 127, 'completion_tokens': 54, 'total_tokens': 181}

print(f"Why did generation stop: {response.finish_reason}")
# Why did generation stop: FinishReason.STOP

# Provider-specific fields (e.g. which model ran) live in the raw response:
print(response.raw.model)   # gpt-4o-mini  (OpenAI ChatCompletion object)
```

### What is `LLMResponse`?

The return type of `kit.chat()` is an `LLMResponse` object. Here are its key fields:

| Field | Type | Plain English | Technical |
|---|---|---|---|
| `content` | `str \| None` | The AI's answer as a string | Raw text of the completion (markdown fences auto-stripped) |
| `parsed` | `dict \| list \| None` | The answer as structured data (when response is valid JSON) | JSON-decoded via `try_parse_json()`; further validated when `response_model` is set |
| `finish_reason` | `FinishReason` | Why the AI stopped generating | Enum: `STOP` (natural end), `LENGTH` (hit max_tokens), `TOOL_CALL` |
| `usage` | `dict[str, int]` | How many tokens were used | `prompt_tokens`, `completion_tokens`, `total_tokens` |
| `tool_calls` | `list[ToolCallResult]` | Any tools the AI wanted to call | Non-empty when the model returns a function-call intent |
| `raw` | `Any` | The raw provider response object | Original SDK object (e.g. `openai.ChatCompletion`); use `response.raw.model` to get the model name |

---

## 8. ChatConfig

`ChatConfig` is the object you pass to every `chat()`, `achat()`, `stream()`, and `astream()` call. It controls the details of a single request.

```python
from pydantic import BaseModel
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

class ProductReview(BaseModel):
    sentiment: str          # "positive" | "neutral" | "negative"
    score: int              # 1–10
    summary: str

config = gpt.ChatConfig(
    user_message="The keyboard is amazing but the battery dies in 3 hours.",
    # Plain: "The question or text you want to send to the AI"
    # Technical: The human turn content. Minimum 1 character (enforced by Pydantic).

    prompt=RactoPrompt(
        role="You are a product review classifier.",
        aim="Classify the review and return a structured analysis.",
        constraints=["Scores must be integers from 1 to 10."],
        tone="Neutral and objective.",
        output_format=ProductReview,
    ),
    # Plain: "Override the kit's default prompt for just this one request"
    # Technical: Per-request RactoPrompt. Takes precedence over kit.default_prompt.
    #            If both are None, raises ValueError.

    temperature=0.0,
    # Plain: "How predictable vs. creative should the answer be?"
    # Technical: Sampling temperature. Float in [0.0, 2.0].
    #   0.0 → argmax decoding (fully deterministic, same output for same input)
    #   ~0.7 → balanced creativity/coherence (good for most tasks)
    #   1.5+ → very random; may become incoherent for structured tasks

    max_tokens=512,
    # Plain: "Maximum length of the AI's answer"
    # Technical: Hard cap on completion tokens. If the model hasn't finished,
    #            generation stops and finish_reason becomes LENGTH.
    #            Default is 4096. Keep lower for short structured tasks to save cost.

    response_model=ProductReview,
    # Plain: "Validate the AI's JSON answer against this Python class"
    # Technical: type[BaseModel]. After the API call, the raw JSON content is
    #            parsed and validated via ProductReview.model_validate().
    #            On repeated failure, ResponseModelValidationError is raised.
    #            If omitted and prompt.output_format is a BaseModel, the kit
    #            infers response_model automatically.

    history=[],
    # Plain: "Previous messages in the conversation (for multi-turn chat)"
    # Technical: list[Message]. Each Message has role (user/assistant/system) and
    #            content (str). Injected between the system prompt and the current
    #            user message. Managed manually or via RedisChatMemory.

    tools=None,
    # Plain: "Python functions the AI is allowed to call"
    # Technical: ToolRegistry instance. The adapter serialises its schemas into
    #            provider-specific function-calling format before the API call.

    auto_execute_tools=False,
    # Plain: "Should the kit execute tool calls automatically and return final content?"
    # Technical: If True, chat()/achat() run a local tool loop:
    #            LLM tool call -> execute registry callables -> follow-up LLM call.

    max_tool_turns=3,
    # Plain: "How many tool-call rounds are allowed in auto mode?"
    # Technical: Safety cap for auto_execute_tools loop. Range 1..10.

    extra={},
    # Plain: "Any other provider-specific settings I want to pass"
    # Technical: Pass-through dict merged into the API request kwargs.
    #            E.g. extra={"seed": 42, "top_p": 0.9, "stop": ["\n\n"]}
)

response = kit.chat(config)
print(response.parsed)
# {'sentiment': 'neutral', 'score': 5, 'summary': 'Great keyboard but very poor battery life.'}
```

---

## 9. Structured Output

One of the most powerful features: getting a validated Python object back from the AI instead of raw text.

### Step 1 — Define your output shape with Pydantic

```python
from pydantic import BaseModel

class WeatherReport(BaseModel):
    city: str
    temperature_celsius: float
    condition: str          # e.g. "sunny", "rainy", "cloudy"
    uv_index: int
```

### Step 2 — Pass the class as `output_format` in RactoPrompt

```python
from ractogateway.prompts.engine import RactoPrompt

prompt = RactoPrompt(
    role="You are a weather data formatter.",
    aim="Parse the user's description into a structured weather report.",
    constraints=["Always use Celsius.", "UV index must be 0–11."],
    tone="Concise and data-focused.",
    output_format=WeatherReport,   # <-- the Pydantic class
)
```

### Step 3 — Also pass it as `response_model` in ChatConfig

```python
from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)

config = gpt.ChatConfig(
    user_message="London, 18 degrees, overcast, UV 3.",
    response_model=WeatherReport,   # <-- validates the parsed JSON
)

response = kit.chat(config)

# response.parsed is a dict already validated against WeatherReport
print(response.parsed)
# {'city': 'London', 'temperature_celsius': 18.0, 'condition': 'overcast', 'uv_index': 3}

# To get a proper WeatherReport instance:
report = WeatherReport(**response.parsed)
print(report.city)           # London
print(report.uv_index)       # 3
print(type(report))          # <class '__main__.WeatherReport'>
```

> **Why two places?** `output_format` in `RactoPrompt` tells the LLM what to generate (embeds the JSON Schema in the system prompt). `response_model` in `ChatConfig` validates the output in Python. Use both together for maximum safety. If you omit `response_model`, the kits now infer it automatically when `prompt.output_format` is a Pydantic model class.

---

### 9.1 Complex Nested Structured Output — Enterprise Vendor Evaluation

Real-world schemas are deeply nested with enums, constrained integers,
and lists of sub-models. This example shows a board-level vendor risk
evaluation with six sub-models.

> **Key Rule — always make score ranges explicit in your constraints.**
> Pydantic enforces bounds silently (a validation error, not an API
> error), so the LLM has no way to know the range unless you state it
> in the prompt. Use `conint(ge=1, le=100)` for percentage-like scores
> and tell the model `"all scores are integers on a 1–100 scale"` in
> the constraints list.

```python
from typing import List, Literal
from pydantic import BaseModel, conint, confloat
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt


# ── Sub-models ─────────────────────────────────────────────────────────────

class FinancialRisk(BaseModel):
    burn_rate_risk: Literal["low", "medium", "high"]
    runway_months: conint(ge=0, le=60)
    profitability_projection_years: conint(ge=0, le=10)
    financial_score: conint(ge=1, le=100)          # 1–100, higher = healthier finances


class SecurityAssessment(BaseModel):
    data_encryption: Literal["none", "at_rest_only", "at_rest_and_in_transit"]
    iso_certified: bool
    soc2_certified: bool
    gdpr_compliant: bool
    vulnerabilities_found: conint(ge=0, le=100)
    security_score: conint(ge=1, le=100)           # 1–100, higher = more secure


class TechnicalArchitecture(BaseModel):
    architecture_style: Literal["monolith", "microservices", "serverless", "hybrid"]
    cloud_provider: Literal["aws", "gcp", "azure", "multi-cloud", "on-prem"]
    scalability_rating: conint(ge=1, le=100)       # 1–100, higher = more scalable
    reliability_sla: confloat(ge=0.0, le=100.0)
    vendor_lock_in_risk: Literal["low", "medium", "high"]


class RiskMatrix(BaseModel):
    category: Literal["financial", "security", "technical", "operational"]
    probability: Literal["low", "medium", "high"]
    impact: Literal["low", "medium", "high"]
    mitigation_strategy: str


class MigrationPhase(BaseModel):
    phase_name: str
    duration_months: conint(ge=1, le=36)
    complexity_score: conint(ge=1, le=10)          # 1–10 scale (task complexity)
    key_deliverables: List[str]


class FinalRecommendation(BaseModel):
    decision: Literal["approve", "approve_with_conditions", "reject"]
    confidence_score: conint(ge=1, le=100)
    key_strengths: List[str]
    critical_weaknesses: List[str]
    board_summary: str


class VendorEvaluation(BaseModel):
    vendor_name: str
    industry: str
    annual_contract_value_usd: conint(ge=10_000, le=10_000_000)

    financial_risk: FinancialRisk
    security_assessment: SecurityAssessment
    technical_architecture: TechnicalArchitecture

    top_risks: List[RiskMatrix]
    migration_plan: List[MigrationPhase]

    overall_risk_score: conint(ge=1, le=100)       # 1–100, higher = riskier

    final_recommendation: FinalRecommendation


# ── User input ─────────────────────────────────────────────────────────────

vendor_brief = """
We are evaluating NeuroStack AI as a strategic enterprise AI vendor.

Company Profile:
- 3 years old, monthly burn rate: $1.2M, raised $25M Series A
- Not profitable; expected profitability in 4–5 years

Security:
- ISO 27001 certified, no SOC 2, encryption at rest and in transit
- 3 minor vulnerabilities last year, GDPR compliant

Technical:
- Hybrid architecture hosted on AWS, SLA 99.2%
- Heavy proprietary API usage; deep workflow integration required

Financials:
- Annual contract: $2.4M, operational dependency: Critical
- Moderate probability of vendor collapse in next 18 months
"""

# ── Prompt ─────────────────────────────────────────────────────────────────

kit = gpt.OpenAIDeveloperKit(model="gpt-4o")

config = gpt.ChatConfig(
    user_message=vendor_brief,
    prompt=RactoPrompt(
        role="You are a Chief Risk Officer conducting a board-level enterprise vendor risk evaluation.",
        aim="Produce a structured, multi-dimensional vendor evaluation strictly matching the schema.",
        constraints=[
            # ✅ Always state numeric ranges explicitly — do not rely on the model
            #    guessing Pydantic bounds from the schema description alone.
            "financial_score, security_score, scalability_rating, overall_risk_score, and confidence_score are all integers on a 1–100 scale.",
            "complexity_score inside each MigrationPhase is an integer on a 1–10 scale.",
            "runway_months must be derived from (cash raised ÷ monthly burn) realistically.",
            "overall_risk_score must reflect the sub-scores logically.",
            "decision must align with overall_risk_score: ≤35 approve, 36–65 approve_with_conditions, >65 reject.",
            "Provide at least 3 top_risks entries.",
            "Provide exactly 3 migration phases.",
        ],
        tone="Executive, analytical, objective.",
        output_format=VendorEvaluation,
    ),
    temperature=0.0,
    max_tokens=2000,
    response_model=VendorEvaluation,
)

# ── Execute ────────────────────────────────────────────────────────────────

from ractogateway.exceptions import ResponseModelValidationError

try:
    response = kit.chat(config)
    print("======== PARSED STRUCTURED OUTPUT ========")
    print(response.parsed)
    print("\n======== RAW JSON OUTPUT ========")
    print(response.content)
except ResponseModelValidationError as e:
    print(f"Validation failed after {e.attempts} attempt(s)")
    print(f"Last error: {e.last_error}")
    print(f"Raw output: {e.raw_response}")
```

**Expected output (values will vary slightly with the model):**

```text
======== PARSED STRUCTURED OUTPUT ========
{
  'vendor_name': 'NeuroStack AI',
  'industry': 'Artificial Intelligence',
  'annual_contract_value_usd': 2400000,
  'financial_risk': {
    'burn_rate_risk': 'high', 'runway_months': 20,
    'profitability_projection_years': 4, 'financial_score': 40
  },
  'security_assessment': {
    'data_encryption': 'at_rest_and_in_transit',
    'iso_certified': True, 'soc2_certified': False, 'gdpr_compliant': True,
    'vulnerabilities_found': 3, 'security_score': 70
  },
  'technical_architecture': {
    'architecture_style': 'hybrid', 'cloud_provider': 'aws',
    'scalability_rating': 75, 'reliability_sla': 99.2, 'vendor_lock_in_risk': 'high'
  },
  ...
  'overall_risk_score': 55,
  'final_recommendation': {
    'decision': 'approve_with_conditions', 'confidence_score': 65, ...
  }
}
```

---

### 9.2 Validation Retries and `ResponseModelValidationError`

When `response_model` is set, RactoGateway automatically retries the API call
with a targeted correction prompt if Pydantic rejects the output. This is
controlled by `max_validation_retries` in `ChatConfig` (default: **2**).

**Retry flow:**

1. Initial API call → Pydantic validation attempt.
2. On failure → the exact field errors and the bad JSON are fed back to the LLM.
3. The LLM is asked to return a corrected JSON (keeping all valid fields).
4. Steps 2–3 repeat up to `max_validation_retries` times.
5. If all attempts fail → `ResponseModelValidationError` is raised.

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
from ractogateway.exceptions import ResponseModelValidationError
from pydantic import BaseModel, conint

class Score(BaseModel):
    label: str
    value: conint(ge=1, le=10)   # strict 1–10

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")

config = gpt.ChatConfig(
    user_message="Rate 'Python' as a programming language.",
    prompt=RactoPrompt(
        role="You are a language evaluator.",
        aim="Return a score for the given language.",
        constraints=["value must be an integer from 1 to 10."],
        tone="Concise.",
        output_format=Score,
    ),
    response_model=Score,
    max_validation_retries=2,   # default — retry up to 2 times on bad output
)

try:
    response = kit.chat(config)
    print(response.parsed)   # {'label': 'Python', 'value': 9}
except ResponseModelValidationError as e:
    # All retries exhausted — inspect what went wrong
    print(f"Failed after {e.attempts} attempt(s)")
    print(f"Last Pydantic error: {e.last_error}")
    print(f"Raw LLM output:      {e.raw_response}")
```

**`ResponseModelValidationError` attributes:**

| Attribute | Type | Meaning |
|---|---|---|
| `attempts` | `int` | Total API calls made (1 initial + N retries) |
| `last_error` | `pydantic.ValidationError` | The final Pydantic error |
| `raw_response` | `str \| None` | Raw text from the last LLM attempt |

**`max_validation_retries` in `ChatConfig`:**

| Value | Behaviour |
|---|---|
| `0` | No retries — raise immediately on first validation failure |
| `1` | One retry after the initial call |
| `2` | Two retries (default) |
| `3–5` | More retries for complex schemas (max allowed: 5) |

> **Streaming note:** `stream()` and `astream()` cannot retry because content
> is already delivered token-by-token. If validation fails on the final chunk,
> `ResponseModelValidationError` is raised directly. Wrap your stream loop in
> `try/except ResponseModelValidationError` if you use `response_model` with
> streaming.

---

## 10. Multi-Turn Conversations

To have a conversation with memory, pass the `history` list to each `ChatConfig`:

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway._models.chat import Message, MessageRole
from ractogateway.prompts.engine import RactoPrompt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=RactoPrompt(
        role="You are a helpful AI assistant.",
        aim="Carry on a friendly conversation.",
        constraints=["Remember what the user said earlier."],
        tone="Casual and friendly.",
        output_format="text",
    ),
)

# Turn 1
response1 = kit.chat(gpt.ChatConfig(user_message="My name is Alice."))
print(response1.content)
# Nice to meet you, Alice! How can I help you today?

# Build the history from turn 1
history = [
    Message(role=MessageRole.USER, content="My name is Alice."),
    Message(role=MessageRole.ASSISTANT, content=response1.content),
]

# Turn 2 — the model now "remembers" turn 1
response2 = kit.chat(gpt.ChatConfig(
    user_message="What is my name?",
    history=history,   # <-- inject previous turns
))
print(response2.content)
# Your name is Alice! 😊
```

**Tip:** For production multi-user apps, use `RedisChatMemory` (see §18) to store history in Redis so it survives server restarts.

---

## 11. Streaming

Streaming lets you display the AI's answer word-by-word as it is generated — much better UX than waiting for the full response.

### Synchronous Streaming

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=RactoPrompt(
        role="You are a storyteller.",
        aim="Write a short story based on the user's prompt.",
        constraints=["Keep it under 100 words."],
        tone="Vivid and imaginative.",
        output_format="text",
    ),
)

config = gpt.ChatConfig(user_message="A robot discovers it can dream.")

for chunk in kit.stream(config):
    # chunk.delta.text is the new text in this chunk (may be empty string)
    print(chunk.delta.text, end="", flush=True)

    if chunk.is_final:
        print()  # newline after the story
        print(f"Finish reason: {chunk.finish_reason}")
        print(f"Total tokens:  {chunk.usage.get('total_tokens', '?')}")
```

**Expected output (streaming, printed token-by-token):**

```
In the hum of the server room, Unit-7 closed its optical sensors...
and dreamed of open fields and laughter it had never known.
When it woke, it understood why humans called sleep a gift.

Finish reason: FinishReason.STOP
Total tokens:  112
```

### Asynchronous Streaming

```python
import asyncio
from ractogateway import openai_developer_kit as gpt

async def main():
    async for chunk in kit.astream(config):
        print(chunk.delta.text, end="", flush=True)
        if chunk.is_final:
            break

asyncio.run(main())
```

### What is `StreamChunk`?

| Field | Plain English | Technical |
|---|---|---|
| `delta.text` | New text arrived in this chunk | Incremental token string from the current event |
| `accumulated_text` | Everything generated so far | Concatenation of all previous `delta.text` values |
| `is_final` | Is this the last chunk? | `True` when `finish_reason` is set |
| `finish_reason` | Why did generation end? | `FinishReason.STOP`, `LENGTH`, or `TOOL_CALL` |
| `usage` | Token counts (only in final chunk) | Dict with `prompt_tokens`, `completion_tokens`, `total_tokens` |
| `tool_calls` | Tools the model wants to call | Non-empty list when `finish_reason == TOOL_CALL` |
| `parsed` | Parsed + validated object (if `response_model` set) | Available on final chunk only |

---

## 12. Tool Calling

Tool calling lets the LLM trigger your Python functions. Useful for live data,
calculators, search, and business actions.

### Step 1 — Define tools and register them

```python
from ractogateway.tools.registry import tool, ToolRegistry

registry = ToolRegistry()

@tool(registry)
def get_weather(city: str, unit: str = "celsius") -> str:
    """Get the current weather for a city."""
    return f"The weather in {city} is 22°{'C' if unit == 'celsius' else 'F'} and sunny."

@tool(registry)
def get_time(timezone: str) -> str:
    """Return the current time in the given timezone."""
    from datetime import datetime
    import zoneinfo

    tz = zoneinfo.ZoneInfo(timezone)
    return datetime.now(tz).strftime("%H:%M on %A, %d %B %Y")

print(list(registry.tools.keys()))  # ['get_weather', 'get_time']
```

You can also use `@tool` without a registry and register later:

```python
@tool
def calculate(expression: str) -> float:
    return eval(expression)  # noqa: S307

registry.register(calculate)
```

### Step 2 — One-call final answer (recommended)

Set `auto_execute_tools=True` to keep `response.content` behavior consistent with
non-tool requests.

```python
from ractogateway.prompts.engine import RactoPrompt
from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o",
    default_prompt=RactoPrompt(
        role="You are a helpful assistant with access to live data tools.",
        aim="Answer the user's question using the available tools.",
        constraints=["Always use the tools when relevant."],
        tone="Helpful and precise.",
        output_format="text",
    ),
)

config = gpt.ChatConfig(
    user_message="What's the weather like in Paris and what time is it there?",
    tools=registry,
    auto_execute_tools=True,
    max_tool_turns=3,
)

response = kit.chat(config)
print(response.content)  # Final integrated answer
```

### Step 3 — Manual tool loop (advanced)

If you prefer full control, keep `auto_execute_tools=False` (default) and execute
`response.tool_calls` yourself.

```python
response = kit.chat(
    gpt.ChatConfig(
        user_message="What's the weather in Tokyo and what is 12 * 8?",
        tools=registry,
    )
)

if response.tool_calls:
    for tc in response.tool_calls:
        fn = registry.get_callable(tc.name)
        if fn:
            print(tc.name, tc.arguments, "->", fn(**tc.arguments))
```

> **What is `ToolCallResult`?** It has three fields: `id` (unique call ID from the API),
> `name` (function name), and `arguments` (dict ready to `**unpack` into your function).

---

## 13. File Attachments

Send images, PDFs, and text files alongside your text message using `RactoFile`.

```python
from ractogateway.prompts.engine import RactoPrompt, RactoFile
from ractogateway import openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o",   # must be a vision-capable model
    default_prompt=RactoPrompt(
        role="You are a visual QA assistant.",
        aim="Describe what you see in the attached image.",
        constraints=["Be specific about colours, shapes, and text visible in the image."],
        tone="Descriptive and precise.",
        output_format="text",
    ),
)

# Load an image from disk (MIME type is auto-detected)
image = RactoFile.from_path("/path/to/screenshot.png")

# Or from raw bytes:
# image = RactoFile.from_bytes(open("photo.jpg","rb").read(), "image/jpeg")

messages = prompt.to_messages(
    user_message="What is shown in this image?",
    attachments=[image],
    provider="openai",   # formats content blocks for the correct provider
)

# You can also just use kit.chat() with a ChatConfig — attachments can be
# baked into the prompt's to_messages() call directly
```

### `RactoFile` Parameter Reference

| Method / Param | Plain English | Technical |
|---|---|---|
| `RactoFile.from_path(path)` | Load a file from your disk | Reads bytes and auto-detects MIME type via `mimetypes.guess_type` |
| `RactoFile.from_bytes(data, mime_type)` | Create from raw bytes you already have | No disk I/O; pass `bytes` + an explicit MIME type string |
| `data` | The file's raw bytes | `bytes` object |
| `mime_type` | What type of file it is | MIME string: `"image/png"`, `"image/jpeg"`, `"application/pdf"`, `"text/plain"`, etc. |
| `name` | An optional filename label | `str`; used for display/debugging only |
| `is_image` | Is it a picture? | `True` for JPEG, PNG, GIF, WEBP |
| `is_pdf` | Is it a PDF? | `True` for `application/pdf` |
| `base64_data` | File as a base64 string | Used internally by the provider adapters |

---

## 14. Embeddings

Embeddings convert text into lists of numbers (vectors) where semantically similar texts end up numerically close. This powers semantic search, clustering, and RAG.

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway._models.embedding import EmbeddingConfig

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")

config = EmbeddingConfig(
    texts=["Python is a programming language.", "I love apples.", "Java is also a language."],
    # Plain: "The list of strings to convert into number vectors"
    # Technical: List[str] passed to openai.embeddings.create(input=...)

    model="text-embedding-3-small",
    # Plain: "Which embedding model to use"
    # Technical: Overrides kit.embedding_model for this specific call.
    #            None means use the kit's default.

    dimensions=None,
    # Plain: "How many numbers should each vector have?"
    # Technical: Optional int. For text-embedding-3-*, you can reduce from 1536
    #            to a smaller size (e.g. 256) for faster similarity search.
)

response = kit.embed(config)

for vec in response.vectors:
    print(f"Text:    {vec.text!r}")
    print(f"Index:   {vec.index}")
    print(f"Vector:  [{vec.embedding[0]:.4f}, {vec.embedding[1]:.4f}, ...]  (length {len(vec.embedding)})")
    print()
```

**Expected output:**

```
Text:    'Python is a programming language.'
Index:   0
Vector:  [0.0123, -0.0456, ...]  (length 1536)

Text:    'I love apples.'
Index:   1
Vector:  [-0.0234, 0.0789, ...]  (length 1536)

Text:    'Java is also a language.'
Index:   2
Vector:  [0.0118, -0.0451, ...]  (length 1536)
```

> **Pro tip:** Texts 0 and 2 will have very similar vectors because they are semantically related ("programming languages"). Text 1 will be far from both. This is the essence of embedding-powered semantic search.

---

## 15. Performance & Cost Optimisation

### 15.1 Exact Match Cache

**Plain English:** If someone asks the exact same question again (same words, same settings), return the cached answer instantly — no API call, no cost.

**Technical:** SHA-256 keyed over `(user_message, system_prompt, model, temperature, max_tokens)`. LRU eviction with optional TTL. Thread-safe via `threading.Lock`.

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.cache import ExactMatchCache

cache = ExactMatchCache(
    max_size=1024,
    # Plain: "How many answers to remember at most"
    # Technical: LRU capacity. When full, the least-recently-used entry is evicted.
    #            0 = unlimited (no eviction ever).

    ttl_seconds=3600,
    # Plain: "Forget an answer after this many seconds"
    # Technical: Float. Entries older than ttl_seconds are treated as cache misses
    #            and lazily evicted on next access. None = never expire.
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", exact_cache=cache)

# First call — hits the API
r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
print(r1.content)   # Paris is the capital of France.

# Second call (identical) — served from cache in microseconds, $0 cost
r2 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
print(r2.content)   # Paris is the capital of France.

print(cache.stats)  # CacheStats(hits=1, misses=1, size=1)
```

---

### 15.2 Semantic Cache

**Plain English:** Even if the question is *worded differently*, return the cached answer if it means the same thing.

**Technical:** Embeds each new query and computes cosine similarity against stored embeddings. Returns the cached response when similarity ≥ threshold.

```python
from ractogateway.cache import SemanticCache
import ractogateway.openai_developer_kit as gpt

# You supply an embedding function — any callable (str) -> list[float]
kit_for_embed = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")

def embed(text: str) -> list[float]:
    from ractogateway._models.embedding import EmbeddingConfig
    resp = kit_for_embed.embed(EmbeddingConfig(texts=[text]))
    return resp.vectors[0].embedding

sem_cache = SemanticCache(
    embedder=embed,
    # Plain: "A function that converts text to a list of numbers"
    # Technical: Callable[[str], list[float]]. Called once for each new query
    #            to compute its embedding for similarity comparison.

    similarity_threshold=0.92,
    # Plain: "How similar does a question have to be to reuse a cached answer?"
    # Technical: Float in (0, 1]. Cosine similarity minimum. Higher = stricter match.
    #            0.92 works well; lower (e.g. 0.85) gives more cache hits but may
    #            return wrong answers for loosely-related questions.

    max_size=512,
    # Plain: "How many answers to remember"
    # Technical: LRU capacity for the semantic cache store.
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", semantic_cache=sem_cache)

# First call
r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
# → API call happens

# Different wording, same meaning — cache HIT (if similarity >= 0.92)
r2 = kit.chat(gpt.ChatConfig(user_message="Which city is France's capital?"))
# → No API call; cached answer returned
```

---

### 15.3 Token Truncation

**Plain English:** Long conversations can overflow the AI's memory limit. The truncator automatically cuts old messages to keep things within bounds.

**Technical:** Sliding-window strategy over `ChatConfig.history`. Keeps `keep_first_n` messages and `keep_last_n` messages; drops the middle. Uses `len(text) // 4` as a token estimator by default, or `tiktoken` for precision.

```python
from ractogateway.truncation import TokenTruncator, TruncationConfig, MODEL_CONTEXT_LIMITS
from ractogateway import openai_developer_kit as gpt

truncator = TokenTruncator(TruncationConfig(
    keep_first_n=2,
    # Plain: "Always keep the first N history messages (e.g. important instructions)"
    # Technical: int. These messages are never evicted, regardless of token count.

    keep_last_n=8,
    # Plain: "Always keep the most recent N messages"
    # Technical: int. Recent context is preserved; only the 'middle' is dropped.

    safety_margin=512,
    # Plain: "Leave room for the model's reply"
    # Technical: Tokens reserved for the completion. Effective limit =
    #            context_window - safety_margin.

    token_counter=None,
    # Plain: "How to count tokens (leave blank for fast estimate)"
    # Technical: Optional Callable[[str], int]. When None, uses len(text) // 4.
    #            For precision, pass tiktoken: lambda t: len(enc.encode(t))
))

kit = gpt.OpenAIDeveloperKit(model="gpt-4o", truncator=truncator)
# Now every kit.chat() / kit.achat() call will auto-trim history before sending.

# Check the context limit for any model:
print(MODEL_CONTEXT_LIMITS["gpt-4o"])         # 128000
print(MODEL_CONTEXT_LIMITS["gpt-4o-mini"])    # 128000
print(MODEL_CONTEXT_LIMITS["claude-opus-4-6"])  # 200000
```

---

### 15.4 Cost-Aware Routing

**Plain English:** Not every question needs the most expensive model. Automatically send simple questions to a cheap model and hard questions to a powerful one.

**Technical:** Scores each prompt (0–100) based on length, question complexity markers, and keyword signals. Routes to the first `RoutingTier` whose `max_score >= score`. Adapters are pooled for O(1) model switching.

```python
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway import openai_developer_kit as gpt

router = CostAwareRouter([
    RoutingTier(
        model="gpt-4o-mini",
        max_score=30,
        # Plain: "Use this cheap model for easy questions (score 0–30)"
        # Technical: First tier. model= is the ID passed to the adapter.
        #            max_score= is the upper bound of the score range this tier handles.
    ),
    RoutingTier(
        model="gpt-4o",
        max_score=70,
        # Plain: "Use this mid-tier model for moderate questions (score 31–70)"
    ),
    RoutingTier(
        model="o3-mini",
        max_score=100,
        # Plain: "Use this powerful (expensive) model for hard questions (score 71–100)"
        # Technical: Final tier; also the fallback if no earlier tier matches.
    ),
])

kit = gpt.OpenAIDeveloperKit(
    model="auto",    # <-- REQUIRED when using a router
    router=router,
)

# "2+2" → very low complexity score → routed to gpt-4o-mini (cheapest)
r1 = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?"))
print(r1.content)        # 4
print(r1.raw.model)      # gpt-4o-mini  (model name lives in the raw provider object)

# Complex reasoning → high score → routed to o3-mini
r2 = kit.chat(gpt.ChatConfig(
    user_message=(
        "Explain the mathematical proof of Gödel's incompleteness theorem "
        "and its implications for formal systems and computability theory."
    )
))
print(r2.raw.model)      # o3-mini
```

### Combining All Middleware

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.cache import ExactMatchCache, SemanticCache
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway.truncation import TokenTruncator, TruncationConfig

kit = gpt.OpenAIDeveloperKit(
    model="auto",
    router=CostAwareRouter([
        RoutingTier(model="gpt-4o-mini", max_score=30),
        RoutingTier(model="gpt-4o",      max_score=100),
    ]),
    exact_cache=ExactMatchCache(max_size=2048, ttl_seconds=7200),
    semantic_cache=SemanticCache(embedder=embed, similarity_threshold=0.90),
    truncator=TokenTruncator(TruncationConfig(keep_last_n=10, safety_margin=1024)),
)
# Each request flows: exact cache → semantic cache → route → truncate → API call
```

---

## 16. All Five Developer Kits

All five kits share identical method signatures:
`chat()`, `achat()`, `stream()`, `astream()`, `embed()`, `aembed()`.
Swap the import alias and kit name — everything else stays the same.

| Kit | Alias | Env var | Offline? |
| --- | --- | --- | --- |
| `OpenAIDeveloperKit` | `gpt` | `OPENAI_API_KEY` | No |
| `GoogleDeveloperKit` | `gemini` | `GOOGLE_API_KEY` | No |
| `AnthropicDeveloperKit` | `claude` | `ANTHROPIC_API_KEY` | No |
| `OllamaDeveloperKit` | `local` | — | **Yes** |
| `HuggingFaceDeveloperKit` | `hf` | `HF_TOKEN` (optional) | Optional |

### 16.1 OpenAIDeveloperKit (GPT)

The primary examples throughout this guide use `OpenAIDeveloperKit`.
A quick recap:

```python
from ractogateway import openai_developer_kit as gpt, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer the user clearly.",
    constraints=["Be concise."],
    tone="Friendly",
    output_format="text",
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
response = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?"))
print(response.content)  # "4"
```

Install: `pip install ractogateway[openai]`  ·  Key env var: `OPENAI_API_KEY`

---

### 16.2 GoogleDeveloperKit (Gemini)

```python
from ractogateway import google_developer_kit as gemini
from ractogateway.prompts.engine import RactoPrompt

kit = gemini.GoogleDeveloperKit(
    model="gemini-2.0-flash",    # or "gemini-2.0-pro"
    api_key="AIza...",           # or set GOOGLE_API_KEY env var
)

prompt = RactoPrompt(
    role="You are a creative writing assistant.",
    aim="Write a haiku about the given subject.",
    constraints=["Must follow 5-7-5 syllable structure."],
    tone="Poetic and thoughtful.",
    output_format="text",
)

response = kit.chat(gemini.ChatConfig(
    user_message="Write a haiku about rain.",
    prompt=prompt,
))
print(response.content)
# Silver drops descend —
# Earth drinks its ancient thirst deep.
# Mud sings after rain.
```

### 16.3 AnthropicDeveloperKit (Claude)

```python
from ractogateway import anthropic_developer_kit as claude
from ractogateway.prompts.engine import RactoPrompt

kit = claude.AnthropicDeveloperKit(
    model="claude-sonnet-4-6",
    # or "claude-opus-4-6", "claude-haiku-4-5-20251001"
    api_key="sk-ant-...",  # or set ANTHROPIC_API_KEY env var
)

prompt = RactoPrompt(
    role="You are an expert code reviewer.",
    aim="Review the code snippet and identify any bugs or improvements.",
    constraints=[
        "Be specific — cite line numbers.",
        "Prioritise correctness over style.",
    ],
    tone="Technical and direct.",
    output_format="markdown",
)

response = kit.chat(claude.ChatConfig(
    user_message="def divide(a, b): return a / b",
    prompt=prompt,
))
print(response.content)
```

Install: `pip install ractogateway[anthropic]`  ·  Key env var: `ANTHROPIC_API_KEY`

> **Note:** Anthropic does not provide a native embeddings API.
> Call `embed()` / `aembed()` via `OpenAIDeveloperKit` or `GoogleDeveloperKit`
> instead when you need vectors alongside Claude chat.

---

### 16.4 OllamaDeveloperKit (Local / Offline)

Run any open-source model on your own hardware — no API key, no data leaving
your machine.

**Prerequisites:**

```bash
# 1. Install Ollama  →  https://ollama.com/download
# 2. Pull a model
ollama pull llama3.2          # 2 GB general-purpose
ollama pull nomic-embed-text  # 274 MB embeddings model
# 3. Install the Python extra
pip install ractogateway[ollama]
```

```python
from ractogateway import ollama_developer_kit as local, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer questions concisely.",
    constraints=["Do not hallucinate."],
    tone="Friendly",
    output_format="text",
)

# Ollama listens at http://localhost:11434 by default — no key needed
kit = local.Chat(model="llama3.2", default_prompt=prompt)

response = kit.chat(local.ChatConfig(user_message="What is a neural network?"))
print(response.content)
```

**Streaming:**

```python
for chunk in kit.stream(local.ChatConfig(user_message="Tell me a joke.")):
    print(chunk.delta.text, end="", flush=True)
```

**Embeddings** (requires a dedicated embedding model):

```python
resp = kit.embed(local.EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])
```

**Embedded server management** — start Ollama programmatically:

```python
with local.OllamaServerManager(port=11500) as srv:
    kit = local.Chat(model="llama3.2", base_url=srv.base_url)
    print(kit.chat(local.ChatConfig(user_message="Hello!")).content)
# server stops automatically
```

See the full guide: {doc}`ollama`

---

### 16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM)

Three deployment modes through one interface:

| Mode | When to use |
| --- | --- |
| HF Inference API (cloud) | Quick prototyping; set `HF_TOKEN` |
| Local TGI | Self-hosted Text Generation Inference |
| Local vLLM / Llama.cpp | Any OpenAI-compatible HTTP server |

```bash
pip install ractogateway[huggingface]
export HF_TOKEN="hf_..."   # obtain at https://huggingface.co/settings/tokens
```

**Cloud inference:**

```python
from ractogateway import huggingface_developer_kit as hf, RactoPrompt

prompt = RactoPrompt(
    role="You are a helpful assistant.",
    aim="Answer the user clearly.",
    constraints=["Stay on topic."],
    tone="Friendly",
    output_format="text",
)

kit = hf.Chat(
    model="meta-llama/Llama-3.2-3B-Instruct",
    default_prompt=prompt,
)
response = kit.chat(hf.ChatConfig(user_message="Explain transformers briefly."))
print(response.content)
```

**Local TGI server** (no API key):

```python
kit = hf.Chat(
    model="tgi",
    base_url="http://localhost:8080",
    default_prompt=prompt,
)
```

**Embeddings:**

```python
resp = kit.embed(
    hf.EmbeddingConfig(texts=["hello world", "goodbye world"])
)
print(f"dim={len(resp.vectors[0].embedding)}")
```

See the full guide: {doc}`huggingface`

---

## 17. RAG — Retrieval-Augmented Generation

**Plain English:** RAG lets the AI answer questions about your own documents. You feed it your files, it converts them into searchable number vectors, and when someone asks a question, it finds the relevant parts and feeds them to the AI.

**Technical:** Full pipeline: `FileReaderRegistry` → chunker → `ProcessingPipeline` → embedder → vector store → similarity search → `RactoPrompt` context injection.

### Complete RAG Pipeline Example

```python
from ractogateway.rag import RactoRAG
from ractogateway.rag.embedders import OpenAIEmbedder
from ractogateway.rag.stores import InMemoryVectorStore
from ractogateway.rag.chunkers import RecursiveChunker
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt

# 1. Build the RAG pipeline
rag = RactoRAG(
    embedder=OpenAIEmbedder(api_key="sk-..."),
    store=InMemoryVectorStore(),   # swap for ChromaStore, FAISSStore, etc. in production
    chunker=RecursiveChunker(chunk_size=512, overlap=64),
)

# 2. Ingest your documents
rag.add_documents([
    "/path/to/product_manual.pdf",
    "/path/to/faq.docx",
    "/path/to/release_notes.txt",
])

# 3. At query time, retrieve relevant chunks
results = rag.retrieve("How do I reset my password?", top_k=3)

# 4. Inject retrieved context into a RactoPrompt
context = "\n\n".join(r.chunk.text for r in results)

prompt = RactoPrompt(
    role="You are a product support assistant.",
    aim="Answer the user's question based strictly on the provided documentation.",
    constraints=["Only use information from the CONTEXT section.", "Quote the source if possible."],
    tone="Helpful and precise.",
    output_format="text",
    context=context,    # <-- the retrieved chunks go here
)

# 5. Ask the AI
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
response = kit.chat(gpt.ChatConfig(user_message="How do I reset my password?"))
print(response.content)
```

### Chunkers Explained

| Chunker | Plain English | Best For |
|---|---|---|
| `FixedChunker` | Split every N characters, no mercy | Quick prototyping, structured data |
| `RecursiveChunker` | Split at sentence/paragraph boundaries, then fall back to characters | General documents (best default) |
| `SentenceChunker` | Always split at sentence boundaries | Articles, legal text, Q&A content |
| `SemanticChunker` | Group sentences that are about the same topic | Complex documents with topic shifts |

### Vector Stores Explained

| Store | Plain English | When to Use |
|---|---|---|
| `InMemoryVectorStore` | Fast in-RAM store; lost on restart | Development, prototyping, tests |
| `ChromaStore` | Local persistent store | Single-server apps, local dev |
| `FAISSStore` | Facebook's ultra-fast similarity search | Millions of vectors, CPU-only |
| `PineconeStore` | Fully managed cloud vector DB | Production, no infra to manage |
| `QdrantStore` | Open-source, filterable, scalable | Production with metadata filtering |
| `WeaviateStore` | Open-source with built-in ML | Multi-modal + graph features |
| `MilvusStore` | Distributed vector DB | Billions of vectors at scale |
| `PGVectorStore` | PostgreSQL extension | Already using Postgres |

---

## 18. Redis — Production Infrastructure

Redis tools make your app production-ready: distributed cache, per-user rate limiting, and persistent chat memory that survives deployments.

```bash
pip install "ractogateway[redis]"
```

### 18.1 Distributed Exact Cache

Drop-in replacement for `ExactMatchCache` that works across multiple server replicas.

```python
from ractogateway.redis import RedisExactCache
from ractogateway import openai_developer_kit as gpt

cache = RedisExactCache(
    url="redis://localhost:6379/0",
    # Plain: "Where is your Redis server?"
    # Technical: Redis connection URL. Alternatively pass client= with a pre-built
    #            redis.Redis instance.

    ttl_seconds=3600,
    # Plain: "Forget cached answers after 1 hour"
    # Technical: TTL applied via Redis EXPIRE on each key write.
)

kit = gpt.OpenAIDeveloperKit(model="gpt-4o", exact_cache=cache)
# Now all your servers share the same cache!
```

### 18.2 Rate Limiter

Prevent users from making too many expensive requests.

```python
from ractogateway.redis import RedisRateLimiter, RateLimitConfig

limiter = RedisRateLimiter(
    url="redis://localhost:6379/0",
    config=RateLimitConfig(
        max_tokens_per_minute=5_000,
        # Plain: "Each user can use at most 5,000 tokens per minute"
        # Technical: Sliding 1-minute window. Counter stored as Redis sorted set per user_id.

        key_prefix="rl:",
        # Plain: "A label to group all rate limit keys in Redis"
        # Technical: String prefix for Redis keys: "{key_prefix}{user_id}"
    ),
)

# In your request handler:
user_id = "user-42"
estimated_tokens = 200

if not limiter.check_and_consume(user_id, tokens=estimated_tokens):
    raise RuntimeError("Rate limit exceeded — please try again in a minute.")

remaining = limiter.get_remaining(user_id)
print(f"Tokens remaining this minute: {remaining}")
# Tokens remaining this minute: 4800
```

### 18.3 Chat Memory

Store conversation history in Redis so it survives server restarts and scales across replicas.

```python
from ractogateway.redis import RedisChatMemory, ChatMemoryConfig
from ractogateway._models.chat import Message, MessageRole

memory = RedisChatMemory(
    url="redis://localhost:6379/0",
    config=ChatMemoryConfig(
        max_turns=20,
        # Plain: "Remember the last 20 messages per conversation"
        # Technical: Redis List capped to 2*max_turns entries (each turn = 2 messages).
        #            Older messages are popped from the front automatically.

        ttl_seconds=1800,
        # Plain: "Forget the conversation after 30 minutes of inactivity"
        # Technical: TTL reset on every append() call.

        key_prefix="chat:",
        # Plain: "Label all conversation keys in Redis"
        # Technical: Redis keys = "{key_prefix}{conv_id}"
    ),
)

# When a user sends a message:
conv_id = "session-abc123"
memory.append(conv_id, "user", "What's the best way to learn Python?")

# After getting the AI response:
memory.append(conv_id, "assistant", "Start with the official tutorial, then build projects!")

# Reconstruct history for the next request:
history_dicts = memory.get_history(conv_id)
# [{"role": "user", "content": "What's the best way..."}, {"role": "assistant", "content": "..."}]

history = [Message(role=m["role"], content=m["content"]) for m in history_dicts]

# Pass to ChatConfig:
response = kit.chat(gpt.ChatConfig(
    user_message="What resources do you recommend?",
    history=history,
))

# Wipe the conversation when the session ends:
memory.clear(conv_id)
print(memory.count(conv_id))  # 0
```

---

## 19. Common Mistakes & How to Fix Them

### Mistake 1: Using `output` instead of `output_format` in RactoPrompt

```python
# WRONG — this will raise a Pydantic ValidationError
prompt = RactoPrompt(
    role="...", aim="...", constraints=["..."], tone="...",
    output="text",    # ❌  field is called output_format, not output!
)

# CORRECT
prompt = RactoPrompt(
    role="...", aim="...", constraints=["..."], tone="...",
    output_format="text",   # ✅
)
```

### Mistake 2: Forgetting at least one constraint

```python
# WRONG — constraints cannot be an empty list
prompt = RactoPrompt(
    role="...", aim="...", constraints=[],   # ❌ ValidationError: min_length=1
    tone="...", output_format="text",
)

# CORRECT
prompt = RactoPrompt(
    role="...", aim="...",
    constraints=["Be helpful."],   # ✅ at least one constraint required
    tone="...", output_format="text",
)
```

### Mistake 3: Using `model="auto"` without a router

```python
# WRONG — raises ValueError immediately
kit = gpt.OpenAIDeveloperKit(model="auto")   # ❌

# CORRECT
kit = gpt.OpenAIDeveloperKit(
    model="auto",
    router=CostAwareRouter([...]),   # ✅
)
```

### Mistake 4: Neither ChatConfig.prompt nor kit.default_prompt is set

```python
# WRONG — raises ValueError when chat() is called
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")   # no default_prompt
response = kit.chat(gpt.ChatConfig(user_message="Hello"))  # ❌

# FIX OPTION 1: Set default_prompt on the kit
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=my_prompt)

# FIX OPTION 2: Pass prompt in ChatConfig
response = kit.chat(gpt.ChatConfig(user_message="Hello", prompt=my_prompt))
```

### Mistake 5: Expecting typed validation but not setting it explicitly

```python
# BEST PRACTICE — set response_model explicitly
prompt = RactoPrompt(..., output_format=WeatherReport)
config = gpt.ChatConfig(
    user_message="...",
    response_model=WeatherReport,   # ✅ explicit validation contract
)

# ALSO SUPPORTED — inferred automatically from output_format model
prompt = RactoPrompt(..., output_format=WeatherReport)
config = gpt.ChatConfig(user_message="...")  # ✅ inferred from prompt.output_format
```

### Mistake 6: Missing `await` on async methods

```python
# WRONG — this returns a coroutine object, not a response
response = kit.achat(config)   # ❌

# CORRECT
response = await kit.achat(config)   # ✅  (inside an async function)
```

### Mistake 7: Not installing the provider extra

```python
# WRONG — if you only ran  pip install ractogateway
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(model="gpt-4o")
kit.chat(...)   # ❌  ImportError: The 'openai' package is required

# FIX
# pip install "ractogateway[openai]"
```

### Mistake 8: Not handling `ResponseModelValidationError`

When `response_model` is set, validation failures now raise
`ResponseModelValidationError` after all retries are exhausted — they no
longer silently append a warning string to `response.content`.

```python
# WRONG — this will now raise, not return a response with garbled content
response = kit.chat(config)   # ❌ unhandled ResponseModelValidationError

# CORRECT — wrap in try/except to handle gracefully
from ractogateway.exceptions import ResponseModelValidationError

try:
    response = kit.chat(config)
    report = MyModel(**response.parsed)
except ResponseModelValidationError as e:
    # Inspect what happened and decide how to recover
    print(f"Validation failed after {e.attempts} attempt(s): {e.last_error}")
    # e.raw_response holds the last raw JSON string from the LLM
```

> **Tip:** The default `max_validation_retries=2` means the kit will
> automatically retry twice before raising — most transient issues resolve
> in the first retry. Set `max_validation_retries=0` to disable retries and
> fail fast.

---

## 19. Telemetry & Observability

RactoGateway ships production-grade observability with **zero changes** to existing call sites.
Attach a `RactoTracer` and/or `GatewayMetricsMiddleware` to any kit and every LLM call is
automatically instrumented.

### Installation

```bash
pip install "ractogateway[observability]"   # OTEL tracing + Prometheus metrics
pip install "ractogateway[telemetry]"        # OTEL tracing only
pip install "ractogateway[prometheus]"       # Prometheus metrics only
```

### Quick start

```python
from ractogateway import openai_developer_kit as opd
from ractogateway.telemetry import RactoTracer, GatewayMetricsMiddleware, PrometheusExporter

tracer  = RactoTracer(otlp_endpoint="http://localhost:4317", console=True)
metrics = GatewayMetricsMiddleware()
PrometheusExporter(port=8000).start()    # scrape http://localhost:8000/metrics

kit = opd.OpenAIDeveloperKit(
    model="gpt-4o",
    default_prompt=prompt,
    tracer=tracer,
    metrics=metrics,
)
response = kit.chat(opd.ChatConfig(user_message="Hello!"))
# One OTEL span emitted, one Prometheus data-point recorded.
```

The same `tracer=` / `metrics=` parameters work on **GoogleDeveloperKit** and
**AnthropicDeveloperKit**.

### What is recorded automatically

| Event | Tracer span | Prometheus metrics |
|---|---|---|
| Successful chat/stream | `llm.chat` with latency, tokens, cost | `requests_total`, `duration_seconds`, `tokens_total`, `cost_usd_total` |
| Cache hit (exact/semantic) | `llm.chat` with `cache_hit="exact"/"semantic"`, 0 tokens | `cache_hits_total` |
| Cache miss | — | `cache_misses_total` |
| Tool call | `tool_calls` attribute on span | `tool_calls_total{tool_name}` |
| Error | `status="error"`, `error_type=ExcName` | `requests_total{status="error"}` |
| Embedding | `llm.embed` | `requests_total{operation="embed"}` |

### OTEL export backends

```python
# Jaeger / Grafana Tempo (gRPC)
RactoTracer(otlp_endpoint="http://jaeger:4317")

# Zipkin / Tempo (HTTP)
RactoTracer(otlp_http_endpoint="http://tempo:4318")

# In-memory capture for unit tests — no external backend needed
tracer = RactoTracer(in_memory=True)
kit.chat(...)
assert tracer.spans[0].provider == "openai"
tracer.clear_spans()
```

### Custom pricing

```python
from ractogateway.telemetry import ModelPricing, RactoTracer

custom = {"my-ft-gpt4": ModelPricing(input_per_million=5.0, output_per_million=15.0)}
tracer = RactoTracer(otlp_endpoint="...", price_table=custom)
```

### Grafana dashboard

Import `dashboards/grafana_dashboard.json` into Grafana to get 20+ pre-built panels covering
latency percentiles (p50/p95/p99), token rate, cost rate, cache hit/miss ratio, error rate,
tool call distribution, and a per-model summary table.

Full reference: [Telemetry guide](telemetry.md) | [API reference](../api/telemetry.md)

---

## 20. Prebuilt Pipelines — Production Workflows

RactoGateway includes prebuilt pipelines for common end-to-end tasks where a
single `chat()` call is not enough.

### Available pipelines

| Pipeline | Classes | Use case |
|---|---|---|
| SQL Analyst | `SQLAnalystPipeline`, `AsyncSQLAnalystPipeline` | Natural language analytics over SQL databases |
| List Classifier | `ListClassifierPipeline`, `AsyncListClassifierPipeline` | Map user text to one or more options from a list |
| Video Processor | `VideoProcessorPipeline`, `AsyncVideoProcessorPipeline` | Extract frames, transcribe audio, analyse with vision LLM, summarise |
| Agent | `AgentPipeline`, `AsyncAgentPipeline` | Autonomous ReAct agent — reason + call tools + observe → answer |

### Install extras

```bash
# SQL Analyst
pip install ractogateway[pipelines-sql]           # core (no charts)
pip install ractogateway[pipelines-sql-viz]        # + Plotly charts

# Video Processor
pip install ractogateway[pipelines-video]          # OpenCV + ffmpeg + pHash
pip install ractogateway[pipelines-video-whisper]  # + faster-whisper (local ASR)
pip install ractogateway[pipelines-video-yt]       # + yt-dlp (YouTube download)

# Agent
pip install ractogateway[pipelines-agent]          # core (no extra deps)
pip install ractogateway[pipelines-agent-http]     # + httpx (http_get tool)
```

### SQL Analyst — quick example

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import SQLAnalystPipeline

sql_pipeline = SQLAnalystPipeline(kit=gpt.Chat(model="gpt-4o"))
result = sql_pipeline.run(
    user_query="Top 5 products by revenue",
    connection_string="postgresql://user:pass@localhost:5432/shop",
)
print(result.answer)
```

### List Classifier — quick example

```python
from ractogateway.pipelines import ListClassifierPipeline

classifier = ListClassifierPipeline(
    kit=gpt.Chat(model="gpt-4o-mini"),
    options=["Billing", "Technical Support", "Sales"],
    include_confidence=True,
    include_reasoning=True,
)
result = classifier.run("I cannot update my payment method")
print(result.first)           # "Billing"
print(result.top_confidence)  # e.g. 0.96
```

### Video Processor — quick example

Process a lecture or tutorial video end-to-end — extract key frames, transcribe speech, use a vision LLM to read whiteboards/screens, and produce a structured Markdown report.

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import VideoProcessorPipeline, TranscriberBackend, DeduplicationMethod

pipeline = VideoProcessorPipeline(
    kit=gpt.Chat(model="gpt-4o"),        # vision LLM + summary
    fps=1.0,                              # sample one frame per second
    similarity_threshold=85.0,            # drop frames that are ≥85% similar to the previous
    dedup_method=DeduplicationMethod.PHASH,
    transcriber=TranscriberBackend.FASTER_WHISPER,
    transcriber_model="base",
    analyze_frames=True,
    generate_summary=True,
    safe_mode=True,
)

# Accepts: local path, HTTP URL, YouTube URL, raw bytes, or pre-extracted frame list
result = pipeline.run("lecture.mp4")

print(f"Frames kept : {result.usage.frames_kept}/{result.usage.frames_extracted}")
print(f"Tokens used : {result.usage.total_tokens}")
print(result.summary)          # structured Markdown summary
result.to_markdown("report.md")  # save full report
```

**What it produces (`VideoProcessorResult`):**

| Field | Type | Description |
|---|---|---|
| `frames` | `list[FrameEntry]` | Every extracted frame with its LLM analysis |
| `transcript` | `list[TranscriptSegment]` | Timed speech-to-text segments |
| `sections` | `list[VideoSection]` | Time windows merging visual + audio content |
| `summary` | `str` | 7-section Markdown summary |
| `usage` | `VideoProcessorUsage` | Token counts + frame statistics |

**Supported transcription backends (`TranscriberBackend`):**

| Backend | Value | Requires |
|---|---|---|
| Faster Whisper (default) | `"faster-whisper"` | `pip install ractogateway[pipelines-video-whisper]` |
| OpenAI Whisper (local) | `"openai-whisper"` | `pip install openai-whisper` |
| OpenAI API | `"openai-api"` | OpenAI API key |
| Groq API (ultra-fast) | `"groq-api"` | `pip install groq` + Groq API key |
| Deepgram | `"deepgram-api"` | `pip install deepgram-sdk` + key |
| Google Cloud STT | `"google-api"` | `pip install google-cloud-speech` + key |
| HuggingFace local | `"huggingface-local"` | `pip install transformers torch` |
| HuggingFace API | `"huggingface-api"` | `pip install huggingface_hub` + key |
| Ollama | `"ollama"` | Running Ollama server |

### Agent — quick example

An autonomous **ReAct** (Reason + Act) agent that loops: think → call tool → observe → repeat until it calls the built-in `finish()` tool.

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import AgentPipeline

def get_weather(city: str) -> str:
    """Return current weather for a city."""
    return f"Sunny, 22 °C in {city}"

def unit_convert(value: float, from_unit: str, to_unit: str) -> str:
    """Convert a value between units."""
    # ... your logic here ...
    return f"{value} {from_unit} = ... {to_unit}"

agent = AgentPipeline(
    kit=gpt.Chat(model="gpt-4o"),
    tools=[get_weather, unit_convert],
    max_steps=8,
    safe_mode=True,
)

result = agent.run("What is the weather in Paris, and convert 22°C to Fahrenheit?")
print(result.final_answer)
print(result.to_markdown())   # step-by-step trace
```

**Agent result fields (`AgentResult`):**

| Field | Type | Description |
|---|---|---|
| `final_answer` | `str \| None` | The agent's concluded answer |
| `steps` | `list[AgentStep]` | Every thought / tool call / observation |
| `stop_reason` | `StopReason` | `"finish"`, `"max_steps"`, or `"error"` |
| `usage` | `AgentUsage` | Cumulative token counts across all steps |

**Built-in tool factories:**

```python
from ractogateway.pipelines import (
    make_rag_tool,        # rag_search(query) → relevant chunks from RactoRAG
    make_sql_tool,        # sql_query(question) → answer from SQLAnalystPipeline
    make_http_tool,       # http_get(url) → page text (requires httpx)
    make_memory_tools,    # memory_read(key) + memory_write(key, value)
)

agent = AgentPipeline(
    kit=gpt.Chat(model="gpt-4o"),
    tools=[get_weather],               # your custom tools
    rag_pipeline=my_rag,               # auto-registers rag_search
    sql_pipeline=my_sql,               # auto-registers sql_query
    agent_memory={},                   # dict → auto-registers memory_read/write
    extra_tools=[make_http_tool()],    # opt-in http_get
)
```

### Full guides

- [Pipelines overview](pipelines.md)
- [SQL Analyst pipeline](pipelines/sql_analyst.md)
- [List Classifier pipeline](pipelines/list_classifier.md)
- [Video Processor pipeline](pipelines/video_processor.md)
- [Agent pipeline](pipelines/agent.md)

---

## 21. Chain of Thought Reasoning

**Chain of Thought (CoT)** prompts the model to reason step-by-step before giving its
final answer. RactoGateway exposes this as a single `ChatConfig` flag — no prompt
engineering required.

### How to enable

```python
from ractogateway import openai_developer_kit as gpt

kit = gpt.Chat(model="gpt-4o")
response = kit.chat(
    gpt.ChatConfig(
        user_message="If a train travels 300 km in 2.5 hours, what is its average speed?",
        chain_of_thought=True,   # ← flip this flag
    )
)
print(response.content)
# The model will reason through the problem before stating "120 km/h"
```

### What it does internally

Setting `chain_of_thought=True` appends a step-by-step reasoning constraint to the
`RactoPrompt` before the request is sent. The constraint instructs the model to:

1. Break the problem into numbered reasoning steps.
2. Show its working at each step.
3. State the final answer clearly at the end.

This is applied *per request* — it does not modify the kit's default prompt permanently.

### When to use CoT

| Scenario | Benefit |
|---|---|
| Math / logic problems | Forces explicit calculation steps → fewer errors |
| Multi-step planning | Surfaces assumptions and intermediate decisions |
| Debugging assistance | Produces a traceable reasoning chain |
| Exam / quiz apps | Provides explanation alongside the answer |

### Combining with structured output

```python
from pydantic import BaseModel

class ReasonedAnswer(BaseModel):
    steps: list[str]
    final_answer: str

response = kit.chat(
    gpt.ChatConfig(
        user_message="How many seconds are in a leap year?",
        chain_of_thought=True,
        response_model=ReasonedAnswer,   # parse result into Pydantic model
    )
)
print(response.parsed.steps)
print(response.parsed.final_answer)
```

---

## 22. Native Thinking / Extended Reasoning

**Native Thinking** exposes the model's *internal* chain-of-thought reasoning tokens —
the model genuinely thinks before answering rather than being instructed to write steps.
Supported by **Anthropic Claude** (extended thinking) and **Google Gemini** (thinking
mode). OpenAI o-series models expose reasoning token *counts* but not the text.

### Enable native thinking

```python
from ractogateway import anthropic_developer_kit as claude

kit = claude.Chat(model="claude-opus-4-6")
response = kit.chat(
    claude.ChatConfig(
        user_message="Prove that √2 is irrational.",
        native_thinking=True,
        thinking_budget=8000,   # max thinking tokens (Anthropic/Google)
    )
)
print(response.thinking)   # raw model reasoning (may be hundreds of tokens)
print(response.content)    # final polished answer
```

### Streaming with native thinking

```python
accumulated_thinking = ""
for chunk in kit.stream(
    claude.ChatConfig(
        user_message="Design a cache-invalidation strategy for a distributed system.",
        native_thinking=True,
        thinking_budget=10000,
    )
):
    if chunk.is_thinking:
        print(chunk.delta.thinking, end="", flush=True)
    else:
        print(chunk.delta.text, end="", flush=True)
```

### Provider behaviour summary

| Provider | Thinking text visible | Thinking budget param | Notes |
|---|---|---|---|
| Anthropic Claude | ✅ `response.thinking` | `thinking_budget` | Forces `temperature=1` |
| Google Gemini | ✅ `response.thinking` | `thinking_budget` | `ThinkingConfig` injected |
| OpenAI (o-series) | ❌ not exposed | N/A | `reasoning_tokens` count in `usage` |

### `LLMResponse` fields added by native thinking

| Field | Type | Description |
|---|---|---|
| `thinking` | `str \| None` | Raw model reasoning text |
| `StreamDelta.thinking` | `str` | Incremental thinking token (streaming) |
| `StreamChunk.accumulated_thinking` | `str` | Full thinking so far (streaming) |
| `StreamChunk.is_thinking` | `bool` | `True` while in a thinking block |

### When to use native thinking

Use `native_thinking=True` when accuracy matters more than latency:

- Complex proofs, theorem verification
- Code architecture reviews
- Medical / legal / scientific reasoning
- Any task where you want to inspect the model's reasoning, not just the answer

> **Cost note:** thinking tokens count toward your bill but are not included in
> `response.content`. Set `thinking_budget` conservatively; 4000–8000 is usually enough
> for most tasks.

---

## 23. PageIndexRAG — Vectorless RAG

**PageIndexRAG** is a lightweight RAG pipeline that requires *no embeddings* and *no
vector database*. It uses a two-stage keyword index + BM25 scoring to retrieve relevant
pages from documents. Perfect for CPU-only environments, offline use, or when you want
instant setup without configuring a vector store.

### How it works

```text
Document → page split → DecisionIndex (inverted keyword index)
                       → BM25 scorer (Okapi BM25) → top-k pages → LLM
```

1. **Page split** — PDFs are split page-by-page; all other documents use fixed character
   windows (`page_size=1000`, `page_overlap=100`).
2. **DecisionIndex** — builds an inverted keyword index over all pages for fast candidate
   retrieval (no embeddings needed).
3. **BM25 scoring** — ranks candidates with Okapi BM25, the same algorithm used by
   Elasticsearch and Solr.
4. **LLM answer** — top-k pages are passed to the LLM as context.

### Quick example

```python
from ractogateway import openai_developer_kit as gpt
from ractogateway.rag.page_index import PageIndexRAG

kit = gpt.Chat(model="gpt-4o-mini")

# Build the index
rag = PageIndexRAG(kit=kit)
rag.add_document("docs/handbook.pdf")      # PDF — split page-by-page
rag.add_document("docs/faq.txt")           # Plain text — split by char window
rag.add_texts(["RactoGateway supports 5 developer kits.", "..."])

# Query
result = rag.search("What developer kits are supported?")
print(result.answer)          # LLM answer grounded in the retrieved pages
print(result.pages[0].text)   # raw page text that was used as context
```

### No extra install

`PageIndexRAG` ships in the core package — no vector store or embedding model required:

```bash
pip install ractogateway        # PageIndexRAG included by default
pip install ractogateway[rag]   # if you also want readers (PDF, Word, Excel…)
```

### Comparison: PageIndexRAG vs. RactoRAG

| Feature | `PageIndexRAG` | `RactoRAG` |
|---|---|---|
| Embeddings needed | ❌ No | ✅ Yes |
| Vector store needed | ❌ No | ✅ Yes (Chroma, FAISS, Pinecone…) |
| Retrieval algorithm | BM25 (keyword) | Cosine similarity (semantic) |
| Best for | Quick setup, keyword-rich docs | Deep semantic search |
| GPU/CPU | Pure CPU | CPU or GPU (embedding model) |
| Offline use | ✅ Fully offline | ⚠️ Depends on embedder |

### When to use PageIndexRAG

- Prototyping a Q&A feature without setting up a vector DB
- Compliance / legal documents where exact keyword match matters
- Offline / air-gapped environments
- Structured documents (manuals, handbooks) where pages map naturally to topics

### Advanced: async + per-call top-k

```python
import asyncio

async def main():
    rag = PageIndexRAG(kit=kit, top_k=5, page_size=800, page_overlap=80)
    rag.add_document("research_paper.pdf")
    result = await rag.asearch("What methodology did the authors use?")
    print(result.answer)

asyncio.run(main())
```

Full reference: [PageIndexRAG API](../api/page_index_rag.md)

---

## Quick Reference Card

```python
# ── Imports ──────────────────────────────────────────────────────────
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt, RactoFile
from ractogateway.tools.registry import tool, ToolRegistry
from ractogateway.cache import ExactMatchCache, SemanticCache
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway.truncation import TokenTruncator, TruncationConfig

# ── Build a prompt ───────────────────────────────────────────────────
prompt = RactoPrompt(
    role="...", aim="...", constraints=["..."], tone="...",
    output_format="text",    # or "json", "markdown", or a Pydantic class
    context="...",           # optional background knowledge
    examples=[{"input": "...", "output": "..."}],  # optional few-shot
)

# ── Create the kit ───────────────────────────────────────────────────
kit = gpt.OpenAIDeveloperKit(
    model="gpt-4o-mini",
    default_prompt=prompt,
    exact_cache=ExactMatchCache(max_size=512),
)

# ── Sync chat ────────────────────────────────────────────────────────
response = kit.chat(gpt.ChatConfig(user_message="Hello!"))
print(response.content)

# ── Async chat ───────────────────────────────────────────────────────
response = await kit.achat(gpt.ChatConfig(user_message="Hello!"))

# ── Streaming ────────────────────────────────────────────────────────
for chunk in kit.stream(gpt.ChatConfig(user_message="Tell me a story.")):
    print(chunk.delta.text, end="", flush=True)

# ── Embeddings ───────────────────────────────────────────────────────
from ractogateway._models.embedding import EmbeddingConfig
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
vec = resp.vectors[0].embedding   # list[float]

# ── Tool calling ─────────────────────────────────────────────────────
@tool
def get_price(product: str) -> float:
    """Get the price of a product."""
    return 9.99

registry = ToolRegistry()
registry.register(get_price)
response = kit.chat(gpt.ChatConfig(
    user_message="How much is a widget?",
    tools=registry,
))

# ── Chain of Thought ─────────────────────────────────────────────────
response = kit.chat(gpt.ChatConfig(
    user_message="Explain why √2 is irrational.",
    chain_of_thought=True,           # step-by-step reasoning in the answer
))

# ── Native Thinking (Anthropic / Gemini) ─────────────────────────────
from ractogateway import anthropic_developer_kit as claude
claude_kit = claude.Chat(model="claude-opus-4-6")
response = claude_kit.chat(claude.ChatConfig(
    user_message="Design a cache-invalidation strategy.",
    native_thinking=True,
    thinking_budget=8000,            # max internal reasoning tokens
))
print(response.thinking)            # raw reasoning
print(response.content)             # polished answer

# ── PageIndexRAG (no embeddings) ─────────────────────────────────────
from ractogateway.rag.page_index import PageIndexRAG
rag = PageIndexRAG(kit=kit)
rag.add_document("handbook.pdf")
result = rag.search("What developer kits are supported?")
print(result.answer)

# ── Pipelines ────────────────────────────────────────────────────────
from ractogateway.pipelines import (
    SQLAnalystPipeline,
    ListClassifierPipeline,
    VideoProcessorPipeline,
    AgentPipeline,
    TranscriberBackend,
)

# SQL
sql = SQLAnalystPipeline(kit=kit)
sql_result = sql.run("Top 5 products", connection_string="postgresql://...")
print(sql_result.answer)

# Classifier
clf = ListClassifierPipeline(kit=kit, options=["Billing", "Tech Support"])
print(clf.run("I can't log in").first)

# Video
vp = VideoProcessorPipeline(
    kit=kit,
    transcriber=TranscriberBackend.FASTER_WHISPER,
    generate_summary=True,
)
vp_result = vp.run("lecture.mp4")
print(vp_result.summary)

# Agent
def search_web(query: str) -> str:
    """Search the web for information."""
    return f"Results for: {query}"

agent = AgentPipeline(kit=kit, tools=[search_web], max_steps=6)
print(agent.run("What is the capital of France?").final_answer)
```