# RactoGateway — Complete User Guide > **Who this guide is for:** complete beginners who have never used an LLM library before, as well as experienced developers who want a deep-dive reference. Every parameter is explained in plain English *and* in technical terms, with working code examples and expected output. --- ## Table of Contents 1. [Jargon Buster — Know the Words Before You Write the Code](#1-jargon-buster) 2. [What is RactoGateway?](#2-what-is-ractogateway) 3. [Installation](#3-installation) 4. [Core Mental Model](#4-core-mental-model) 5. [RactoPrompt — The Heart of Every Request](#5-ractoprompt) 6. [Developer Kits — Your Chat Interface](#6-developer-kits) 7. [Your First Chat](#7-your-first-chat) 8. [ChatConfig — Controlling Every Request](#8-chatconfig) 9. [Getting Structured / Typed Output](#9-structured-output) - 9.1 Complex Nested Structured Output - 9.2 Validation Retries and `ResponseModelValidationError` 10. [Multi-Turn Conversations (History)](#10-multi-turn-conversations) 11. [Streaming — Real-Time Token-by-Token Output](#11-streaming) 12. [Tool Calling — LLM Calls Your Python Functions](#12-tool-calling) 13. [File Attachments — Vision & PDFs](#13-file-attachments) 14. [Embeddings — Teaching Machines to Understand Text](#14-embeddings) 15. [Performance & Cost Optimisation](#15-performance--cost-optimisation) - 15.1 Exact Match Cache - 15.2 Semantic Cache - 15.3 Token Truncation - 15.4 Cost-Aware Routing 16. [All Five Developer Kits](#16-all-five-developer-kits) - 16.1 OpenAIDeveloperKit (GPT) - 16.2 GoogleDeveloperKit (Gemini) - 16.3 AnthropicDeveloperKit (Claude) - 16.4 OllamaDeveloperKit (Local / Offline) - 16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM) 17. [RAG — Retrieval-Augmented Generation](#17-rag--retrieval-augmented-generation) 18. [Redis — Production Infrastructure](#18-redis--production-infrastructure) 19. [Common Mistakes & How to Fix Them](#19-common-mistakes--how-to-fix-them) 20. [Prebuilt Pipelines — Production Workflows](#20-prebuilt-pipelines--production-workflows) - SQL Analyst, List Classifier, Video Processor, Agent 21. [Chain of Thought Reasoning](#21-chain-of-thought-reasoning) 22. [Native Thinking / Extended Reasoning](#22-native-thinking--extended-reasoning) 23. [PageIndexRAG — Vectorless RAG](#23-pageindexrag--vectorless-rag) --- ## 1. Jargon Buster Before diving into code, here are the key terms you will encounter. Skip to §2 if you already know these. | Term | Plain-English Meaning | Technical Definition | |---|---|---| | **LLM** | A very powerful autocomplete that understands meaning | Large Language Model — a neural network trained on vast text corpora to predict/generate natural language | | **Prompt** | What you say to the AI | The input text (plus optional instructions) sent to an LLM | | **Completion / Response** | What the AI says back | The LLM's generated output tokens | | **Token** | Roughly one word (sometimes less) | The smallest unit an LLM processes; ~4 chars for English | | **System Prompt** | The AI's job description | An instruction block sent before the conversation; sets behaviour and constraints | | **Temperature** | How creative vs. predictable the AI is | Float 0–2. 0 = deterministic (same output every time). Higher = more random/creative | | **Streaming** | Getting the answer word-by-word in real time | Server-sent events where each token is pushed to the client as it is generated | | **Embedding** | Converting text into a list of numbers | A dense vector representation where semantically similar texts are numerically close | | **RAG** | Letting the AI "look things up" before answering | Retrieval-Augmented Generation — retrieve relevant chunks from a knowledge base and inject them into the prompt | | **Tool Calling** | The AI can trigger your Python functions | Function-calling protocol where the LLM emits a structured intent and the client executes a real function | | **Pydantic Model** | A Python class that validates data automatically | A `BaseModel` subclass that enforces types and field rules at runtime | | **Cache** | Store an answer so you don't ask the AI twice | In-memory or distributed key-value store keyed on request fingerprint | | **Context Window** | The AI's short-term memory | Maximum number of tokens the model can process in one request | | **Adapter** | The translator between our library and the AI provider | A thin class that converts our internal format to the OpenAI / Google / Anthropic API wire format | --- ## 2. What is RactoGateway? **Plain English:** RactoGateway is a Python library that lets you talk to different AI models (OpenAI, Google, Anthropic) using the same code. You don't need to learn three different APIs. You write your prompts using a structured template (the RACTO principle), and the library takes care of formatting, caching, routing, and more. **Technical:** RactoGateway is a provider-agnostic LLM orchestration SDK built on Pydantic. It provides: - A unified `RactoPrompt` structured prompt compiler (the RACTO principle) - Provider-specific developer kits (`OpenAIDeveloperKit`, `GoogleDeveloperKit`, `AnthropicDeveloperKit`) - Sync **and** async parity on every method - Optional middleware: exact-match cache, semantic cache, cost-aware router, token truncator - Tool calling, file attachments, streaming, embeddings, RAG, fine-tuning, and production infra (Redis, Celery, Kafka) **Why does this exist?** Without RactoGateway, switching from OpenAI to Anthropic means rewriting all your code. With RactoGateway, you swap one class name. --- ## 3. Installation ```bash # Minimum — no LLM provider yet pip install ractogateway # OpenAI (GPT models) pip install "ractogateway[openai]" # Google (Gemini models) pip install "ractogateway[google]" # Anthropic (Claude models) pip install "ractogateway[anthropic]" # All three providers at once pip install "ractogateway[all]" # RAG (document reading, chunking, embedding, stores) pip install "ractogateway[rag-all]" # Redis (distributed cache, rate limiting, chat memory) pip install "ractogateway[redis]" ``` **Requires Python 3.10 or later.** --- ## 4. Core Mental Model Think of RactoGateway in three layers: ``` ┌─────────────────────────────────────────────────────┐ │ YOUR CODE │ │ RactoPrompt → ChatConfig → kit.chat() │ ├─────────────────────────────────────────────────────┤ │ DEVELOPER KIT (OpenAIDeveloperKit, etc.) │ │ middleware: cache → route → truncate → API call │ ├─────────────────────────────────────────────────────┤ │ ADAPTER (OpenAILLMKit, GoogleLLMKit, etc.) │ │ Translates our format → provider wire format │ ├─────────────────────────────────────────────────────┤ │ PROVIDER API (OpenAI, Google, Anthropic) │ └─────────────────────────────────────────────────────┘ ``` **You only ever touch the top layer.** The kit and adapter layers are managed for you. --- ## 5. RactoPrompt `RactoPrompt` is how you write instructions for the AI. It enforces the **RACTO principle** — a structured format that dramatically reduces hallucinations and ambiguous outputs. **RACTO stands for:** | Letter | Field | Plain English | Technical | |---|---|---|---| | **R** | `role` | Who is the AI? | System identity; primes the model's behaviour via persona specification | | **A** | `aim` | What should it do? | Objective statement; the task the model must complete | | **C** | `constraints` | What must it never do? | Hard invariants; rule set injected into `[CONSTRAINTS]` block | | **T** | `tone` | How should it talk? | Communication register; affects lexical and stylistic choices | | **O** | `output_format` | What shape should the answer be in? | Output schema; can be a keyword, a string, or a Pydantic model class | Plus two optional helpers: `context` (background knowledge) and `examples` (few-shot examples). ### 5.1 Minimal Example ```python from ractogateway.prompts.engine import RactoPrompt prompt = RactoPrompt( role="You are a helpful customer-support agent for a software company.", aim="Answer the user's question about our product.", constraints=[ "Never make up features that don't exist.", "If you don't know the answer, say so.", ], tone="Friendly and concise.", output_format="text", ) # See what the compiled system prompt looks like: print(prompt.compile()) ``` **Expected output:** ``` [ROLE] You are a helpful customer-support agent for a software company. [AIM] Answer the user's question about our product. [CONSTRAINTS] - Never make up features that don't exist. - If you don't know the answer, say so. [TONE] Friendly and concise. [OUTPUT] Respond in plain text with no special formatting. [GUARDRAILS] - If you are unsure or lack sufficient information, state it explicitly rather than guessing. - Do NOT fabricate facts, citations, URLs, statistics, or code that you cannot verify. - Stick strictly to what is asked. Do not add unrequested information. - If the answer requires assumptions, list each assumption explicitly before proceeding. ``` > **Notice the `[GUARDRAILS]` section at the bottom.** This is auto-generated by `anti_hallucination=True` (the default). It tells the model to be honest about uncertainty. You can disable it with `anti_hallucination=False` if you need maximum creative freedom. --- ### 5.2 Full Parameter Reference ```python from pydantic import BaseModel class Summary(BaseModel): headline: str bullet_points: list[str] confidence_score: float # 0.0 to 1.0 prompt = RactoPrompt( # ── REQUIRED ────────────────────────────────────────────────────── role="You are a senior financial analyst.", # Plain: "Tell the AI who it is" # Technical: Persona string prepended to the [ROLE] block; primes # the model's prior distribution toward domain-specific vocabulary aim="Summarise the provided earnings report into key takeaways.", # Plain: "Tell the AI what job it has to do" # Technical: Task objective injected into [AIM]; should be one clear imperative sentence constraints=[ "Only use numbers that appear in the report — never invent figures.", "Keep bullet points to at most 15 words each.", "Do not provide investment advice.", ], # Plain: "Red lines the AI must never cross" # Technical: List[str]; each item becomes a bullet in [CONSTRAINTS]. # Minimum one constraint required. tone="Professional, concise, and factual.", # Plain: "How the AI should sound" # Technical: Register specification injected into [TONE]; affects temperature # interaction and lexical formality output_format=Summary, # Plain: "Exactly what shape should the answer be in?" # Technical: Union[str, type[BaseModel]]. # - "text" → plain text # - "json" → raw JSON object # - "markdown" → markdown-formatted response # - A Pydantic model class → the full JSON Schema is embedded in the prompt; # the LLM must return JSON that validates against it. # ── OPTIONAL ────────────────────────────────────────────────────── context="Q3 2025 earnings call. Revenue: $4.2B (+12% YoY). EPS: $1.87.", # Plain: "Background knowledge the AI needs to do its job" # Technical: Domain-specific text injected between [AIM] and [CONSTRAINTS]. # Ideal for passing documents, retrieved chunks, or facts. examples=[ { "input": "Revenue grew 5% but EPS fell 10%.", "output": '{"headline": "Mixed signals: top-line growth masked by margin compression", ...}' }, ], # Plain: "Show the AI what a good answer looks like" # Technical: Few-shot exemplars injected into [EXAMPLES] block; each dict # must contain exactly "input" and "output" keys. anti_hallucination=True, # Plain: "Should the AI be told to say 'I don't know' instead of guessing?" # Technical: Boolean flag. When True, appends [GUARDRAILS] block with # explicit uncertainty-disclosure directives. Default: True. ) ``` --- ## 6. Developer Kits A **Developer Kit** is your interface to a specific LLM provider. All five kits (`OpenAIDeveloperKit`, `GoogleDeveloperKit`, `AnthropicDeveloperKit`, `OllamaDeveloperKit`, `HuggingFaceDeveloperKit`) share the same six method names. ### OpenAIDeveloperKit — Full Parameter Reference ```python from ractogateway import openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit( model="gpt-4o", # Plain: "Which AI model should I use?" # Technical: Chat model ID passed to openai.chat.completions.create(model=...). # Use "auto" to enable cost-aware routing (requires router= param). # Common values: "gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "o3-mini" api_key="sk-...", # Plain: "My OpenAI account password" # Technical: Bearer token for OpenAI API auth. Falls back to # os.environ["OPENAI_API_KEY"] when omitted. base_url=None, # Plain: "Send requests to a different server (e.g. Azure or your own proxy)" # Technical: Override for openai.base_url. Used for Azure OpenAI endpoints or # local model servers that implement the OpenAI protocol. embedding_model="text-embedding-3-small", # Plain: "Which model to use when converting text to numbers (embeddings)" # Technical: Default model ID for embed() / aembed() calls. # Passed to openai.embeddings.create(model=...). default_prompt=None, # Plain: "A prompt to use for every request unless I override it" # Technical: RactoPrompt instance used when ChatConfig.prompt is None. # If both are None, kit.chat() raises ValueError. exact_cache=None, # Plain: "Store answers so I don't pay for the same question twice" # Technical: ExactMatchCache instance. On a byte-identical request the cached # LLMResponse is returned without an API call. O(1) lookup. semantic_cache=None, # Plain: "Store answers and also reuse them for questions that mean the same thing" # Technical: SemanticCache instance. Uses cosine similarity on embeddings. # Returns cached response when similarity >= threshold. router=None, # Plain: "Automatically pick the cheapest model that can handle each question" # Technical: CostAwareRouter instance. Routes each request to the first tier # whose max_score >= the computed prompt complexity score. # Required when model="auto". truncator=None, # Plain: "Automatically shorten old conversation history if it gets too long" # Technical: TokenTruncator instance. Trims history messages to keep total # token count within the model's context window before each API call. ) ``` --- ## 7. Your First Chat Let's put it all together — a complete, working example. ```python import os from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt # 1. Define who the AI is and what it should do prompt = RactoPrompt( role="You are a helpful Python tutor.", aim="Explain the concept the user asks about in simple terms.", constraints=["Use beginner-friendly language.", "Keep the answer under 3 sentences."], tone="Warm, encouraging, and clear.", output_format="text", ) # 2. Create the kit (reads OPENAI_API_KEY from environment automatically) kit = gpt.OpenAIDeveloperKit( model="gpt-4o-mini", default_prompt=prompt, ) # 3. Send a message and get a response response = kit.chat(gpt.ChatConfig(user_message="What is a Python list?")) print(response.content) # A list in Python is an ordered collection of items that can hold any type # of data — numbers, strings, even other lists. You create one with square # brackets, like my_list = [1, "hello", True]. You can add, remove, or # change items at any time! print(f"Tokens used: {response.usage}") # Tokens used: {'prompt_tokens': 127, 'completion_tokens': 54, 'total_tokens': 181} print(f"Why did generation stop: {response.finish_reason}") # Why did generation stop: FinishReason.STOP # Provider-specific fields (e.g. which model ran) live in the raw response: print(response.raw.model) # gpt-4o-mini (OpenAI ChatCompletion object) ``` ### What is `LLMResponse`? The return type of `kit.chat()` is an `LLMResponse` object. Here are its key fields: | Field | Type | Plain English | Technical | |---|---|---|---| | `content` | `str \| None` | The AI's answer as a string | Raw text of the completion (markdown fences auto-stripped) | | `parsed` | `dict \| list \| None` | The answer as structured data (when response is valid JSON) | JSON-decoded via `try_parse_json()`; further validated when `response_model` is set | | `finish_reason` | `FinishReason` | Why the AI stopped generating | Enum: `STOP` (natural end), `LENGTH` (hit max_tokens), `TOOL_CALL` | | `usage` | `dict[str, int]` | How many tokens were used | `prompt_tokens`, `completion_tokens`, `total_tokens` | | `tool_calls` | `list[ToolCallResult]` | Any tools the AI wanted to call | Non-empty when the model returns a function-call intent | | `raw` | `Any` | The raw provider response object | Original SDK object (e.g. `openai.ChatCompletion`); use `response.raw.model` to get the model name | --- ## 8. ChatConfig `ChatConfig` is the object you pass to every `chat()`, `achat()`, `stream()`, and `astream()` call. It controls the details of a single request. ```python from pydantic import BaseModel from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt class ProductReview(BaseModel): sentiment: str # "positive" | "neutral" | "negative" score: int # 1–10 summary: str config = gpt.ChatConfig( user_message="The keyboard is amazing but the battery dies in 3 hours.", # Plain: "The question or text you want to send to the AI" # Technical: The human turn content. Minimum 1 character (enforced by Pydantic). prompt=RactoPrompt( role="You are a product review classifier.", aim="Classify the review and return a structured analysis.", constraints=["Scores must be integers from 1 to 10."], tone="Neutral and objective.", output_format=ProductReview, ), # Plain: "Override the kit's default prompt for just this one request" # Technical: Per-request RactoPrompt. Takes precedence over kit.default_prompt. # If both are None, raises ValueError. temperature=0.0, # Plain: "How predictable vs. creative should the answer be?" # Technical: Sampling temperature. Float in [0.0, 2.0]. # 0.0 → argmax decoding (fully deterministic, same output for same input) # ~0.7 → balanced creativity/coherence (good for most tasks) # 1.5+ → very random; may become incoherent for structured tasks max_tokens=512, # Plain: "Maximum length of the AI's answer" # Technical: Hard cap on completion tokens. If the model hasn't finished, # generation stops and finish_reason becomes LENGTH. # Default is 4096. Keep lower for short structured tasks to save cost. response_model=ProductReview, # Plain: "Validate the AI's JSON answer against this Python class" # Technical: type[BaseModel]. After the API call, the raw JSON content is # parsed and validated via ProductReview.model_validate(). # On repeated failure, ResponseModelValidationError is raised. # If omitted and prompt.output_format is a BaseModel, the kit # infers response_model automatically. history=[], # Plain: "Previous messages in the conversation (for multi-turn chat)" # Technical: list[Message]. Each Message has role (user/assistant/system) and # content (str). Injected between the system prompt and the current # user message. Managed manually or via RedisChatMemory. tools=None, # Plain: "Python functions the AI is allowed to call" # Technical: ToolRegistry instance. The adapter serialises its schemas into # provider-specific function-calling format before the API call. auto_execute_tools=False, # Plain: "Should the kit execute tool calls automatically and return final content?" # Technical: If True, chat()/achat() run a local tool loop: # LLM tool call -> execute registry callables -> follow-up LLM call. max_tool_turns=3, # Plain: "How many tool-call rounds are allowed in auto mode?" # Technical: Safety cap for auto_execute_tools loop. Range 1..10. extra={}, # Plain: "Any other provider-specific settings I want to pass" # Technical: Pass-through dict merged into the API request kwargs. # E.g. extra={"seed": 42, "top_p": 0.9, "stop": ["\n\n"]} ) response = kit.chat(config) print(response.parsed) # {'sentiment': 'neutral', 'score': 5, 'summary': 'Great keyboard but very poor battery life.'} ``` --- ## 9. Structured Output One of the most powerful features: getting a validated Python object back from the AI instead of raw text. ### Step 1 — Define your output shape with Pydantic ```python from pydantic import BaseModel class WeatherReport(BaseModel): city: str temperature_celsius: float condition: str # e.g. "sunny", "rainy", "cloudy" uv_index: int ``` ### Step 2 — Pass the class as `output_format` in RactoPrompt ```python from ractogateway.prompts.engine import RactoPrompt prompt = RactoPrompt( role="You are a weather data formatter.", aim="Parse the user's description into a structured weather report.", constraints=["Always use Celsius.", "UV index must be 0–11."], tone="Concise and data-focused.", output_format=WeatherReport, # <-- the Pydantic class ) ``` ### Step 3 — Also pass it as `response_model` in ChatConfig ```python from ractogateway import openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt) config = gpt.ChatConfig( user_message="London, 18 degrees, overcast, UV 3.", response_model=WeatherReport, # <-- validates the parsed JSON ) response = kit.chat(config) # response.parsed is a dict already validated against WeatherReport print(response.parsed) # {'city': 'London', 'temperature_celsius': 18.0, 'condition': 'overcast', 'uv_index': 3} # To get a proper WeatherReport instance: report = WeatherReport(**response.parsed) print(report.city) # London print(report.uv_index) # 3 print(type(report)) # ``` > **Why two places?** `output_format` in `RactoPrompt` tells the LLM what to generate (embeds the JSON Schema in the system prompt). `response_model` in `ChatConfig` validates the output in Python. Use both together for maximum safety. If you omit `response_model`, the kits now infer it automatically when `prompt.output_format` is a Pydantic model class. --- ### 9.1 Complex Nested Structured Output — Enterprise Vendor Evaluation Real-world schemas are deeply nested with enums, constrained integers, and lists of sub-models. This example shows a board-level vendor risk evaluation with six sub-models. > **Key Rule — always make score ranges explicit in your constraints.** > Pydantic enforces bounds silently (a validation error, not an API > error), so the LLM has no way to know the range unless you state it > in the prompt. Use `conint(ge=1, le=100)` for percentage-like scores > and tell the model `"all scores are integers on a 1–100 scale"` in > the constraints list. ```python from typing import List, Literal from pydantic import BaseModel, conint, confloat from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt # ── Sub-models ───────────────────────────────────────────────────────────── class FinancialRisk(BaseModel): burn_rate_risk: Literal["low", "medium", "high"] runway_months: conint(ge=0, le=60) profitability_projection_years: conint(ge=0, le=10) financial_score: conint(ge=1, le=100) # 1–100, higher = healthier finances class SecurityAssessment(BaseModel): data_encryption: Literal["none", "at_rest_only", "at_rest_and_in_transit"] iso_certified: bool soc2_certified: bool gdpr_compliant: bool vulnerabilities_found: conint(ge=0, le=100) security_score: conint(ge=1, le=100) # 1–100, higher = more secure class TechnicalArchitecture(BaseModel): architecture_style: Literal["monolith", "microservices", "serverless", "hybrid"] cloud_provider: Literal["aws", "gcp", "azure", "multi-cloud", "on-prem"] scalability_rating: conint(ge=1, le=100) # 1–100, higher = more scalable reliability_sla: confloat(ge=0.0, le=100.0) vendor_lock_in_risk: Literal["low", "medium", "high"] class RiskMatrix(BaseModel): category: Literal["financial", "security", "technical", "operational"] probability: Literal["low", "medium", "high"] impact: Literal["low", "medium", "high"] mitigation_strategy: str class MigrationPhase(BaseModel): phase_name: str duration_months: conint(ge=1, le=36) complexity_score: conint(ge=1, le=10) # 1–10 scale (task complexity) key_deliverables: List[str] class FinalRecommendation(BaseModel): decision: Literal["approve", "approve_with_conditions", "reject"] confidence_score: conint(ge=1, le=100) key_strengths: List[str] critical_weaknesses: List[str] board_summary: str class VendorEvaluation(BaseModel): vendor_name: str industry: str annual_contract_value_usd: conint(ge=10_000, le=10_000_000) financial_risk: FinancialRisk security_assessment: SecurityAssessment technical_architecture: TechnicalArchitecture top_risks: List[RiskMatrix] migration_plan: List[MigrationPhase] overall_risk_score: conint(ge=1, le=100) # 1–100, higher = riskier final_recommendation: FinalRecommendation # ── User input ───────────────────────────────────────────────────────────── vendor_brief = """ We are evaluating NeuroStack AI as a strategic enterprise AI vendor. Company Profile: - 3 years old, monthly burn rate: $1.2M, raised $25M Series A - Not profitable; expected profitability in 4–5 years Security: - ISO 27001 certified, no SOC 2, encryption at rest and in transit - 3 minor vulnerabilities last year, GDPR compliant Technical: - Hybrid architecture hosted on AWS, SLA 99.2% - Heavy proprietary API usage; deep workflow integration required Financials: - Annual contract: $2.4M, operational dependency: Critical - Moderate probability of vendor collapse in next 18 months """ # ── Prompt ───────────────────────────────────────────────────────────────── kit = gpt.OpenAIDeveloperKit(model="gpt-4o") config = gpt.ChatConfig( user_message=vendor_brief, prompt=RactoPrompt( role="You are a Chief Risk Officer conducting a board-level enterprise vendor risk evaluation.", aim="Produce a structured, multi-dimensional vendor evaluation strictly matching the schema.", constraints=[ # ✅ Always state numeric ranges explicitly — do not rely on the model # guessing Pydantic bounds from the schema description alone. "financial_score, security_score, scalability_rating, overall_risk_score, and confidence_score are all integers on a 1–100 scale.", "complexity_score inside each MigrationPhase is an integer on a 1–10 scale.", "runway_months must be derived from (cash raised ÷ monthly burn) realistically.", "overall_risk_score must reflect the sub-scores logically.", "decision must align with overall_risk_score: ≤35 approve, 36–65 approve_with_conditions, >65 reject.", "Provide at least 3 top_risks entries.", "Provide exactly 3 migration phases.", ], tone="Executive, analytical, objective.", output_format=VendorEvaluation, ), temperature=0.0, max_tokens=2000, response_model=VendorEvaluation, ) # ── Execute ──────────────────────────────────────────────────────────────── from ractogateway.exceptions import ResponseModelValidationError try: response = kit.chat(config) print("======== PARSED STRUCTURED OUTPUT ========") print(response.parsed) print("\n======== RAW JSON OUTPUT ========") print(response.content) except ResponseModelValidationError as e: print(f"Validation failed after {e.attempts} attempt(s)") print(f"Last error: {e.last_error}") print(f"Raw output: {e.raw_response}") ``` **Expected output (values will vary slightly with the model):** ```text ======== PARSED STRUCTURED OUTPUT ======== { 'vendor_name': 'NeuroStack AI', 'industry': 'Artificial Intelligence', 'annual_contract_value_usd': 2400000, 'financial_risk': { 'burn_rate_risk': 'high', 'runway_months': 20, 'profitability_projection_years': 4, 'financial_score': 40 }, 'security_assessment': { 'data_encryption': 'at_rest_and_in_transit', 'iso_certified': True, 'soc2_certified': False, 'gdpr_compliant': True, 'vulnerabilities_found': 3, 'security_score': 70 }, 'technical_architecture': { 'architecture_style': 'hybrid', 'cloud_provider': 'aws', 'scalability_rating': 75, 'reliability_sla': 99.2, 'vendor_lock_in_risk': 'high' }, ... 'overall_risk_score': 55, 'final_recommendation': { 'decision': 'approve_with_conditions', 'confidence_score': 65, ... } } ``` --- ### 9.2 Validation Retries and `ResponseModelValidationError` When `response_model` is set, RactoGateway automatically retries the API call with a targeted correction prompt if Pydantic rejects the output. This is controlled by `max_validation_retries` in `ChatConfig` (default: **2**). **Retry flow:** 1. Initial API call → Pydantic validation attempt. 2. On failure → the exact field errors and the bad JSON are fed back to the LLM. 3. The LLM is asked to return a corrected JSON (keeping all valid fields). 4. Steps 2–3 repeat up to `max_validation_retries` times. 5. If all attempts fail → `ResponseModelValidationError` is raised. ```python from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt from ractogateway.exceptions import ResponseModelValidationError from pydantic import BaseModel, conint class Score(BaseModel): label: str value: conint(ge=1, le=10) # strict 1–10 kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini") config = gpt.ChatConfig( user_message="Rate 'Python' as a programming language.", prompt=RactoPrompt( role="You are a language evaluator.", aim="Return a score for the given language.", constraints=["value must be an integer from 1 to 10."], tone="Concise.", output_format=Score, ), response_model=Score, max_validation_retries=2, # default — retry up to 2 times on bad output ) try: response = kit.chat(config) print(response.parsed) # {'label': 'Python', 'value': 9} except ResponseModelValidationError as e: # All retries exhausted — inspect what went wrong print(f"Failed after {e.attempts} attempt(s)") print(f"Last Pydantic error: {e.last_error}") print(f"Raw LLM output: {e.raw_response}") ``` **`ResponseModelValidationError` attributes:** | Attribute | Type | Meaning | |---|---|---| | `attempts` | `int` | Total API calls made (1 initial + N retries) | | `last_error` | `pydantic.ValidationError` | The final Pydantic error | | `raw_response` | `str \| None` | Raw text from the last LLM attempt | **`max_validation_retries` in `ChatConfig`:** | Value | Behaviour | |---|---| | `0` | No retries — raise immediately on first validation failure | | `1` | One retry after the initial call | | `2` | Two retries (default) | | `3–5` | More retries for complex schemas (max allowed: 5) | > **Streaming note:** `stream()` and `astream()` cannot retry because content > is already delivered token-by-token. If validation fails on the final chunk, > `ResponseModelValidationError` is raised directly. Wrap your stream loop in > `try/except ResponseModelValidationError` if you use `response_model` with > streaming. --- ## 10. Multi-Turn Conversations To have a conversation with memory, pass the `history` list to each `ChatConfig`: ```python from ractogateway import openai_developer_kit as gpt from ractogateway._models.chat import Message, MessageRole from ractogateway.prompts.engine import RactoPrompt kit = gpt.OpenAIDeveloperKit( model="gpt-4o-mini", default_prompt=RactoPrompt( role="You are a helpful AI assistant.", aim="Carry on a friendly conversation.", constraints=["Remember what the user said earlier."], tone="Casual and friendly.", output_format="text", ), ) # Turn 1 response1 = kit.chat(gpt.ChatConfig(user_message="My name is Alice.")) print(response1.content) # Nice to meet you, Alice! How can I help you today? # Build the history from turn 1 history = [ Message(role=MessageRole.USER, content="My name is Alice."), Message(role=MessageRole.ASSISTANT, content=response1.content), ] # Turn 2 — the model now "remembers" turn 1 response2 = kit.chat(gpt.ChatConfig( user_message="What is my name?", history=history, # <-- inject previous turns )) print(response2.content) # Your name is Alice! 😊 ``` **Tip:** For production multi-user apps, use `RedisChatMemory` (see §18) to store history in Redis so it survives server restarts. --- ## 11. Streaming Streaming lets you display the AI's answer word-by-word as it is generated — much better UX than waiting for the full response. ### Synchronous Streaming ```python from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt kit = gpt.OpenAIDeveloperKit( model="gpt-4o-mini", default_prompt=RactoPrompt( role="You are a storyteller.", aim="Write a short story based on the user's prompt.", constraints=["Keep it under 100 words."], tone="Vivid and imaginative.", output_format="text", ), ) config = gpt.ChatConfig(user_message="A robot discovers it can dream.") for chunk in kit.stream(config): # chunk.delta.text is the new text in this chunk (may be empty string) print(chunk.delta.text, end="", flush=True) if chunk.is_final: print() # newline after the story print(f"Finish reason: {chunk.finish_reason}") print(f"Total tokens: {chunk.usage.get('total_tokens', '?')}") ``` **Expected output (streaming, printed token-by-token):** ``` In the hum of the server room, Unit-7 closed its optical sensors... and dreamed of open fields and laughter it had never known. When it woke, it understood why humans called sleep a gift. Finish reason: FinishReason.STOP Total tokens: 112 ``` ### Asynchronous Streaming ```python import asyncio from ractogateway import openai_developer_kit as gpt async def main(): async for chunk in kit.astream(config): print(chunk.delta.text, end="", flush=True) if chunk.is_final: break asyncio.run(main()) ``` ### What is `StreamChunk`? | Field | Plain English | Technical | |---|---|---| | `delta.text` | New text arrived in this chunk | Incremental token string from the current event | | `accumulated_text` | Everything generated so far | Concatenation of all previous `delta.text` values | | `is_final` | Is this the last chunk? | `True` when `finish_reason` is set | | `finish_reason` | Why did generation end? | `FinishReason.STOP`, `LENGTH`, or `TOOL_CALL` | | `usage` | Token counts (only in final chunk) | Dict with `prompt_tokens`, `completion_tokens`, `total_tokens` | | `tool_calls` | Tools the model wants to call | Non-empty list when `finish_reason == TOOL_CALL` | | `parsed` | Parsed + validated object (if `response_model` set) | Available on final chunk only | --- ## 12. Tool Calling Tool calling lets the LLM trigger your Python functions. Useful for live data, calculators, search, and business actions. ### Step 1 — Define tools and register them ```python from ractogateway.tools.registry import tool, ToolRegistry registry = ToolRegistry() @tool(registry) def get_weather(city: str, unit: str = "celsius") -> str: """Get the current weather for a city.""" return f"The weather in {city} is 22°{'C' if unit == 'celsius' else 'F'} and sunny." @tool(registry) def get_time(timezone: str) -> str: """Return the current time in the given timezone.""" from datetime import datetime import zoneinfo tz = zoneinfo.ZoneInfo(timezone) return datetime.now(tz).strftime("%H:%M on %A, %d %B %Y") print(list(registry.tools.keys())) # ['get_weather', 'get_time'] ``` You can also use `@tool` without a registry and register later: ```python @tool def calculate(expression: str) -> float: return eval(expression) # noqa: S307 registry.register(calculate) ``` ### Step 2 — One-call final answer (recommended) Set `auto_execute_tools=True` to keep `response.content` behavior consistent with non-tool requests. ```python from ractogateway.prompts.engine import RactoPrompt from ractogateway import openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit( model="gpt-4o", default_prompt=RactoPrompt( role="You are a helpful assistant with access to live data tools.", aim="Answer the user's question using the available tools.", constraints=["Always use the tools when relevant."], tone="Helpful and precise.", output_format="text", ), ) config = gpt.ChatConfig( user_message="What's the weather like in Paris and what time is it there?", tools=registry, auto_execute_tools=True, max_tool_turns=3, ) response = kit.chat(config) print(response.content) # Final integrated answer ``` ### Step 3 — Manual tool loop (advanced) If you prefer full control, keep `auto_execute_tools=False` (default) and execute `response.tool_calls` yourself. ```python response = kit.chat( gpt.ChatConfig( user_message="What's the weather in Tokyo and what is 12 * 8?", tools=registry, ) ) if response.tool_calls: for tc in response.tool_calls: fn = registry.get_callable(tc.name) if fn: print(tc.name, tc.arguments, "->", fn(**tc.arguments)) ``` > **What is `ToolCallResult`?** It has three fields: `id` (unique call ID from the API), > `name` (function name), and `arguments` (dict ready to `**unpack` into your function). --- ## 13. File Attachments Send images, PDFs, and text files alongside your text message using `RactoFile`. ```python from ractogateway.prompts.engine import RactoPrompt, RactoFile from ractogateway import openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit( model="gpt-4o", # must be a vision-capable model default_prompt=RactoPrompt( role="You are a visual QA assistant.", aim="Describe what you see in the attached image.", constraints=["Be specific about colours, shapes, and text visible in the image."], tone="Descriptive and precise.", output_format="text", ), ) # Load an image from disk (MIME type is auto-detected) image = RactoFile.from_path("/path/to/screenshot.png") # Or from raw bytes: # image = RactoFile.from_bytes(open("photo.jpg","rb").read(), "image/jpeg") messages = prompt.to_messages( user_message="What is shown in this image?", attachments=[image], provider="openai", # formats content blocks for the correct provider ) # You can also just use kit.chat() with a ChatConfig — attachments can be # baked into the prompt's to_messages() call directly ``` ### `RactoFile` Parameter Reference | Method / Param | Plain English | Technical | |---|---|---| | `RactoFile.from_path(path)` | Load a file from your disk | Reads bytes and auto-detects MIME type via `mimetypes.guess_type` | | `RactoFile.from_bytes(data, mime_type)` | Create from raw bytes you already have | No disk I/O; pass `bytes` + an explicit MIME type string | | `data` | The file's raw bytes | `bytes` object | | `mime_type` | What type of file it is | MIME string: `"image/png"`, `"image/jpeg"`, `"application/pdf"`, `"text/plain"`, etc. | | `name` | An optional filename label | `str`; used for display/debugging only | | `is_image` | Is it a picture? | `True` for JPEG, PNG, GIF, WEBP | | `is_pdf` | Is it a PDF? | `True` for `application/pdf` | | `base64_data` | File as a base64 string | Used internally by the provider adapters | --- ## 14. Embeddings Embeddings convert text into lists of numbers (vectors) where semantically similar texts end up numerically close. This powers semantic search, clustering, and RAG. ```python from ractogateway import openai_developer_kit as gpt from ractogateway._models.embedding import EmbeddingConfig kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini") config = EmbeddingConfig( texts=["Python is a programming language.", "I love apples.", "Java is also a language."], # Plain: "The list of strings to convert into number vectors" # Technical: List[str] passed to openai.embeddings.create(input=...) model="text-embedding-3-small", # Plain: "Which embedding model to use" # Technical: Overrides kit.embedding_model for this specific call. # None means use the kit's default. dimensions=None, # Plain: "How many numbers should each vector have?" # Technical: Optional int. For text-embedding-3-*, you can reduce from 1536 # to a smaller size (e.g. 256) for faster similarity search. ) response = kit.embed(config) for vec in response.vectors: print(f"Text: {vec.text!r}") print(f"Index: {vec.index}") print(f"Vector: [{vec.embedding[0]:.4f}, {vec.embedding[1]:.4f}, ...] (length {len(vec.embedding)})") print() ``` **Expected output:** ``` Text: 'Python is a programming language.' Index: 0 Vector: [0.0123, -0.0456, ...] (length 1536) Text: 'I love apples.' Index: 1 Vector: [-0.0234, 0.0789, ...] (length 1536) Text: 'Java is also a language.' Index: 2 Vector: [0.0118, -0.0451, ...] (length 1536) ``` > **Pro tip:** Texts 0 and 2 will have very similar vectors because they are semantically related ("programming languages"). Text 1 will be far from both. This is the essence of embedding-powered semantic search. --- ## 15. Performance & Cost Optimisation ### 15.1 Exact Match Cache **Plain English:** If someone asks the exact same question again (same words, same settings), return the cached answer instantly — no API call, no cost. **Technical:** SHA-256 keyed over `(user_message, system_prompt, model, temperature, max_tokens)`. LRU eviction with optional TTL. Thread-safe via `threading.Lock`. ```python from ractogateway import openai_developer_kit as gpt from ractogateway.cache import ExactMatchCache cache = ExactMatchCache( max_size=1024, # Plain: "How many answers to remember at most" # Technical: LRU capacity. When full, the least-recently-used entry is evicted. # 0 = unlimited (no eviction ever). ttl_seconds=3600, # Plain: "Forget an answer after this many seconds" # Technical: Float. Entries older than ttl_seconds are treated as cache misses # and lazily evicted on next access. None = never expire. ) kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", exact_cache=cache) # First call — hits the API r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?")) print(r1.content) # Paris is the capital of France. # Second call (identical) — served from cache in microseconds, $0 cost r2 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?")) print(r2.content) # Paris is the capital of France. print(cache.stats) # CacheStats(hits=1, misses=1, size=1) ``` --- ### 15.2 Semantic Cache **Plain English:** Even if the question is *worded differently*, return the cached answer if it means the same thing. **Technical:** Embeds each new query and computes cosine similarity against stored embeddings. Returns the cached response when similarity ≥ threshold. ```python from ractogateway.cache import SemanticCache import ractogateway.openai_developer_kit as gpt # You supply an embedding function — any callable (str) -> list[float] kit_for_embed = gpt.OpenAIDeveloperKit(model="gpt-4o-mini") def embed(text: str) -> list[float]: from ractogateway._models.embedding import EmbeddingConfig resp = kit_for_embed.embed(EmbeddingConfig(texts=[text])) return resp.vectors[0].embedding sem_cache = SemanticCache( embedder=embed, # Plain: "A function that converts text to a list of numbers" # Technical: Callable[[str], list[float]]. Called once for each new query # to compute its embedding for similarity comparison. similarity_threshold=0.92, # Plain: "How similar does a question have to be to reuse a cached answer?" # Technical: Float in (0, 1]. Cosine similarity minimum. Higher = stricter match. # 0.92 works well; lower (e.g. 0.85) gives more cache hits but may # return wrong answers for loosely-related questions. max_size=512, # Plain: "How many answers to remember" # Technical: LRU capacity for the semantic cache store. ) kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", semantic_cache=sem_cache) # First call r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?")) # → API call happens # Different wording, same meaning — cache HIT (if similarity >= 0.92) r2 = kit.chat(gpt.ChatConfig(user_message="Which city is France's capital?")) # → No API call; cached answer returned ``` --- ### 15.3 Token Truncation **Plain English:** Long conversations can overflow the AI's memory limit. The truncator automatically cuts old messages to keep things within bounds. **Technical:** Sliding-window strategy over `ChatConfig.history`. Keeps `keep_first_n` messages and `keep_last_n` messages; drops the middle. Uses `len(text) // 4` as a token estimator by default, or `tiktoken` for precision. ```python from ractogateway.truncation import TokenTruncator, TruncationConfig, MODEL_CONTEXT_LIMITS from ractogateway import openai_developer_kit as gpt truncator = TokenTruncator(TruncationConfig( keep_first_n=2, # Plain: "Always keep the first N history messages (e.g. important instructions)" # Technical: int. These messages are never evicted, regardless of token count. keep_last_n=8, # Plain: "Always keep the most recent N messages" # Technical: int. Recent context is preserved; only the 'middle' is dropped. safety_margin=512, # Plain: "Leave room for the model's reply" # Technical: Tokens reserved for the completion. Effective limit = # context_window - safety_margin. token_counter=None, # Plain: "How to count tokens (leave blank for fast estimate)" # Technical: Optional Callable[[str], int]. When None, uses len(text) // 4. # For precision, pass tiktoken: lambda t: len(enc.encode(t)) )) kit = gpt.OpenAIDeveloperKit(model="gpt-4o", truncator=truncator) # Now every kit.chat() / kit.achat() call will auto-trim history before sending. # Check the context limit for any model: print(MODEL_CONTEXT_LIMITS["gpt-4o"]) # 128000 print(MODEL_CONTEXT_LIMITS["gpt-4o-mini"]) # 128000 print(MODEL_CONTEXT_LIMITS["claude-opus-4-6"]) # 200000 ``` --- ### 15.4 Cost-Aware Routing **Plain English:** Not every question needs the most expensive model. Automatically send simple questions to a cheap model and hard questions to a powerful one. **Technical:** Scores each prompt (0–100) based on length, question complexity markers, and keyword signals. Routes to the first `RoutingTier` whose `max_score >= score`. Adapters are pooled for O(1) model switching. ```python from ractogateway.routing import CostAwareRouter, RoutingTier from ractogateway import openai_developer_kit as gpt router = CostAwareRouter([ RoutingTier( model="gpt-4o-mini", max_score=30, # Plain: "Use this cheap model for easy questions (score 0–30)" # Technical: First tier. model= is the ID passed to the adapter. # max_score= is the upper bound of the score range this tier handles. ), RoutingTier( model="gpt-4o", max_score=70, # Plain: "Use this mid-tier model for moderate questions (score 31–70)" ), RoutingTier( model="o3-mini", max_score=100, # Plain: "Use this powerful (expensive) model for hard questions (score 71–100)" # Technical: Final tier; also the fallback if no earlier tier matches. ), ]) kit = gpt.OpenAIDeveloperKit( model="auto", # <-- REQUIRED when using a router router=router, ) # "2+2" → very low complexity score → routed to gpt-4o-mini (cheapest) r1 = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?")) print(r1.content) # 4 print(r1.raw.model) # gpt-4o-mini (model name lives in the raw provider object) # Complex reasoning → high score → routed to o3-mini r2 = kit.chat(gpt.ChatConfig( user_message=( "Explain the mathematical proof of Gödel's incompleteness theorem " "and its implications for formal systems and computability theory." ) )) print(r2.raw.model) # o3-mini ``` ### Combining All Middleware ```python from ractogateway import openai_developer_kit as gpt from ractogateway.cache import ExactMatchCache, SemanticCache from ractogateway.routing import CostAwareRouter, RoutingTier from ractogateway.truncation import TokenTruncator, TruncationConfig kit = gpt.OpenAIDeveloperKit( model="auto", router=CostAwareRouter([ RoutingTier(model="gpt-4o-mini", max_score=30), RoutingTier(model="gpt-4o", max_score=100), ]), exact_cache=ExactMatchCache(max_size=2048, ttl_seconds=7200), semantic_cache=SemanticCache(embedder=embed, similarity_threshold=0.90), truncator=TokenTruncator(TruncationConfig(keep_last_n=10, safety_margin=1024)), ) # Each request flows: exact cache → semantic cache → route → truncate → API call ``` --- ## 16. All Five Developer Kits All five kits share identical method signatures: `chat()`, `achat()`, `stream()`, `astream()`, `embed()`, `aembed()`. Swap the import alias and kit name — everything else stays the same. | Kit | Alias | Env var | Offline? | | --- | --- | --- | --- | | `OpenAIDeveloperKit` | `gpt` | `OPENAI_API_KEY` | No | | `GoogleDeveloperKit` | `gemini` | `GOOGLE_API_KEY` | No | | `AnthropicDeveloperKit` | `claude` | `ANTHROPIC_API_KEY` | No | | `OllamaDeveloperKit` | `local` | — | **Yes** | | `HuggingFaceDeveloperKit` | `hf` | `HF_TOKEN` (optional) | Optional | ### 16.1 OpenAIDeveloperKit (GPT) The primary examples throughout this guide use `OpenAIDeveloperKit`. A quick recap: ```python from ractogateway import openai_developer_kit as gpt, RactoPrompt prompt = RactoPrompt( role="You are a helpful assistant.", aim="Answer the user clearly.", constraints=["Be concise."], tone="Friendly", output_format="text", ) kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt) response = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?")) print(response.content) # "4" ``` Install: `pip install ractogateway[openai]` · Key env var: `OPENAI_API_KEY` --- ### 16.2 GoogleDeveloperKit (Gemini) ```python from ractogateway import google_developer_kit as gemini from ractogateway.prompts.engine import RactoPrompt kit = gemini.GoogleDeveloperKit( model="gemini-2.0-flash", # or "gemini-2.0-pro" api_key="AIza...", # or set GOOGLE_API_KEY env var ) prompt = RactoPrompt( role="You are a creative writing assistant.", aim="Write a haiku about the given subject.", constraints=["Must follow 5-7-5 syllable structure."], tone="Poetic and thoughtful.", output_format="text", ) response = kit.chat(gemini.ChatConfig( user_message="Write a haiku about rain.", prompt=prompt, )) print(response.content) # Silver drops descend — # Earth drinks its ancient thirst deep. # Mud sings after rain. ``` ### 16.3 AnthropicDeveloperKit (Claude) ```python from ractogateway import anthropic_developer_kit as claude from ractogateway.prompts.engine import RactoPrompt kit = claude.AnthropicDeveloperKit( model="claude-sonnet-4-6", # or "claude-opus-4-6", "claude-haiku-4-5-20251001" api_key="sk-ant-...", # or set ANTHROPIC_API_KEY env var ) prompt = RactoPrompt( role="You are an expert code reviewer.", aim="Review the code snippet and identify any bugs or improvements.", constraints=[ "Be specific — cite line numbers.", "Prioritise correctness over style.", ], tone="Technical and direct.", output_format="markdown", ) response = kit.chat(claude.ChatConfig( user_message="def divide(a, b): return a / b", prompt=prompt, )) print(response.content) ``` Install: `pip install ractogateway[anthropic]` · Key env var: `ANTHROPIC_API_KEY` > **Note:** Anthropic does not provide a native embeddings API. > Call `embed()` / `aembed()` via `OpenAIDeveloperKit` or `GoogleDeveloperKit` > instead when you need vectors alongside Claude chat. --- ### 16.4 OllamaDeveloperKit (Local / Offline) Run any open-source model on your own hardware — no API key, no data leaving your machine. **Prerequisites:** ```bash # 1. Install Ollama → https://ollama.com/download # 2. Pull a model ollama pull llama3.2 # 2 GB general-purpose ollama pull nomic-embed-text # 274 MB embeddings model # 3. Install the Python extra pip install ractogateway[ollama] ``` ```python from ractogateway import ollama_developer_kit as local, RactoPrompt prompt = RactoPrompt( role="You are a helpful assistant.", aim="Answer questions concisely.", constraints=["Do not hallucinate."], tone="Friendly", output_format="text", ) # Ollama listens at http://localhost:11434 by default — no key needed kit = local.Chat(model="llama3.2", default_prompt=prompt) response = kit.chat(local.ChatConfig(user_message="What is a neural network?")) print(response.content) ``` **Streaming:** ```python for chunk in kit.stream(local.ChatConfig(user_message="Tell me a joke.")): print(chunk.delta.text, end="", flush=True) ``` **Embeddings** (requires a dedicated embedding model): ```python resp = kit.embed(local.EmbeddingConfig(texts=["hello", "world"])) print(resp.vectors[0].embedding[:5]) ``` **Embedded server management** — start Ollama programmatically: ```python with local.OllamaServerManager(port=11500) as srv: kit = local.Chat(model="llama3.2", base_url=srv.base_url) print(kit.chat(local.ChatConfig(user_message="Hello!")).content) # server stops automatically ``` See the full guide: {doc}`ollama` --- ### 16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM) Three deployment modes through one interface: | Mode | When to use | | --- | --- | | HF Inference API (cloud) | Quick prototyping; set `HF_TOKEN` | | Local TGI | Self-hosted Text Generation Inference | | Local vLLM / Llama.cpp | Any OpenAI-compatible HTTP server | ```bash pip install ractogateway[huggingface] export HF_TOKEN="hf_..." # obtain at https://huggingface.co/settings/tokens ``` **Cloud inference:** ```python from ractogateway import huggingface_developer_kit as hf, RactoPrompt prompt = RactoPrompt( role="You are a helpful assistant.", aim="Answer the user clearly.", constraints=["Stay on topic."], tone="Friendly", output_format="text", ) kit = hf.Chat( model="meta-llama/Llama-3.2-3B-Instruct", default_prompt=prompt, ) response = kit.chat(hf.ChatConfig(user_message="Explain transformers briefly.")) print(response.content) ``` **Local TGI server** (no API key): ```python kit = hf.Chat( model="tgi", base_url="http://localhost:8080", default_prompt=prompt, ) ``` **Embeddings:** ```python resp = kit.embed( hf.EmbeddingConfig(texts=["hello world", "goodbye world"]) ) print(f"dim={len(resp.vectors[0].embedding)}") ``` See the full guide: {doc}`huggingface` --- ## 17. RAG — Retrieval-Augmented Generation **Plain English:** RAG lets the AI answer questions about your own documents. You feed it your files, it converts them into searchable number vectors, and when someone asks a question, it finds the relevant parts and feeds them to the AI. **Technical:** Full pipeline: `FileReaderRegistry` → chunker → `ProcessingPipeline` → embedder → vector store → similarity search → `RactoPrompt` context injection. ### Complete RAG Pipeline Example ```python from ractogateway.rag import RactoRAG from ractogateway.rag.embedders import OpenAIEmbedder from ractogateway.rag.stores import InMemoryVectorStore from ractogateway.rag.chunkers import RecursiveChunker from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt # 1. Build the RAG pipeline rag = RactoRAG( embedder=OpenAIEmbedder(api_key="sk-..."), store=InMemoryVectorStore(), # swap for ChromaStore, FAISSStore, etc. in production chunker=RecursiveChunker(chunk_size=512, overlap=64), ) # 2. Ingest your documents rag.add_documents([ "/path/to/product_manual.pdf", "/path/to/faq.docx", "/path/to/release_notes.txt", ]) # 3. At query time, retrieve relevant chunks results = rag.retrieve("How do I reset my password?", top_k=3) # 4. Inject retrieved context into a RactoPrompt context = "\n\n".join(r.chunk.text for r in results) prompt = RactoPrompt( role="You are a product support assistant.", aim="Answer the user's question based strictly on the provided documentation.", constraints=["Only use information from the CONTEXT section.", "Quote the source if possible."], tone="Helpful and precise.", output_format="text", context=context, # <-- the retrieved chunks go here ) # 5. Ask the AI kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt) response = kit.chat(gpt.ChatConfig(user_message="How do I reset my password?")) print(response.content) ``` ### Chunkers Explained | Chunker | Plain English | Best For | |---|---|---| | `FixedChunker` | Split every N characters, no mercy | Quick prototyping, structured data | | `RecursiveChunker` | Split at sentence/paragraph boundaries, then fall back to characters | General documents (best default) | | `SentenceChunker` | Always split at sentence boundaries | Articles, legal text, Q&A content | | `SemanticChunker` | Group sentences that are about the same topic | Complex documents with topic shifts | ### Vector Stores Explained | Store | Plain English | When to Use | |---|---|---| | `InMemoryVectorStore` | Fast in-RAM store; lost on restart | Development, prototyping, tests | | `ChromaStore` | Local persistent store | Single-server apps, local dev | | `FAISSStore` | Facebook's ultra-fast similarity search | Millions of vectors, CPU-only | | `PineconeStore` | Fully managed cloud vector DB | Production, no infra to manage | | `QdrantStore` | Open-source, filterable, scalable | Production with metadata filtering | | `WeaviateStore` | Open-source with built-in ML | Multi-modal + graph features | | `MilvusStore` | Distributed vector DB | Billions of vectors at scale | | `PGVectorStore` | PostgreSQL extension | Already using Postgres | --- ## 18. Redis — Production Infrastructure Redis tools make your app production-ready: distributed cache, per-user rate limiting, and persistent chat memory that survives deployments. ```bash pip install "ractogateway[redis]" ``` ### 18.1 Distributed Exact Cache Drop-in replacement for `ExactMatchCache` that works across multiple server replicas. ```python from ractogateway.redis import RedisExactCache from ractogateway import openai_developer_kit as gpt cache = RedisExactCache( url="redis://localhost:6379/0", # Plain: "Where is your Redis server?" # Technical: Redis connection URL. Alternatively pass client= with a pre-built # redis.Redis instance. ttl_seconds=3600, # Plain: "Forget cached answers after 1 hour" # Technical: TTL applied via Redis EXPIRE on each key write. ) kit = gpt.OpenAIDeveloperKit(model="gpt-4o", exact_cache=cache) # Now all your servers share the same cache! ``` ### 18.2 Rate Limiter Prevent users from making too many expensive requests. ```python from ractogateway.redis import RedisRateLimiter, RateLimitConfig limiter = RedisRateLimiter( url="redis://localhost:6379/0", config=RateLimitConfig( max_tokens_per_minute=5_000, # Plain: "Each user can use at most 5,000 tokens per minute" # Technical: Sliding 1-minute window. Counter stored as Redis sorted set per user_id. key_prefix="rl:", # Plain: "A label to group all rate limit keys in Redis" # Technical: String prefix for Redis keys: "{key_prefix}{user_id}" ), ) # In your request handler: user_id = "user-42" estimated_tokens = 200 if not limiter.check_and_consume(user_id, tokens=estimated_tokens): raise RuntimeError("Rate limit exceeded — please try again in a minute.") remaining = limiter.get_remaining(user_id) print(f"Tokens remaining this minute: {remaining}") # Tokens remaining this minute: 4800 ``` ### 18.3 Chat Memory Store conversation history in Redis so it survives server restarts and scales across replicas. ```python from ractogateway.redis import RedisChatMemory, ChatMemoryConfig from ractogateway._models.chat import Message, MessageRole memory = RedisChatMemory( url="redis://localhost:6379/0", config=ChatMemoryConfig( max_turns=20, # Plain: "Remember the last 20 messages per conversation" # Technical: Redis List capped to 2*max_turns entries (each turn = 2 messages). # Older messages are popped from the front automatically. ttl_seconds=1800, # Plain: "Forget the conversation after 30 minutes of inactivity" # Technical: TTL reset on every append() call. key_prefix="chat:", # Plain: "Label all conversation keys in Redis" # Technical: Redis keys = "{key_prefix}{conv_id}" ), ) # When a user sends a message: conv_id = "session-abc123" memory.append(conv_id, "user", "What's the best way to learn Python?") # After getting the AI response: memory.append(conv_id, "assistant", "Start with the official tutorial, then build projects!") # Reconstruct history for the next request: history_dicts = memory.get_history(conv_id) # [{"role": "user", "content": "What's the best way..."}, {"role": "assistant", "content": "..."}] history = [Message(role=m["role"], content=m["content"]) for m in history_dicts] # Pass to ChatConfig: response = kit.chat(gpt.ChatConfig( user_message="What resources do you recommend?", history=history, )) # Wipe the conversation when the session ends: memory.clear(conv_id) print(memory.count(conv_id)) # 0 ``` --- ## 19. Common Mistakes & How to Fix Them ### Mistake 1: Using `output` instead of `output_format` in RactoPrompt ```python # WRONG — this will raise a Pydantic ValidationError prompt = RactoPrompt( role="...", aim="...", constraints=["..."], tone="...", output="text", # ❌ field is called output_format, not output! ) # CORRECT prompt = RactoPrompt( role="...", aim="...", constraints=["..."], tone="...", output_format="text", # ✅ ) ``` ### Mistake 2: Forgetting at least one constraint ```python # WRONG — constraints cannot be an empty list prompt = RactoPrompt( role="...", aim="...", constraints=[], # ❌ ValidationError: min_length=1 tone="...", output_format="text", ) # CORRECT prompt = RactoPrompt( role="...", aim="...", constraints=["Be helpful."], # ✅ at least one constraint required tone="...", output_format="text", ) ``` ### Mistake 3: Using `model="auto"` without a router ```python # WRONG — raises ValueError immediately kit = gpt.OpenAIDeveloperKit(model="auto") # ❌ # CORRECT kit = gpt.OpenAIDeveloperKit( model="auto", router=CostAwareRouter([...]), # ✅ ) ``` ### Mistake 4: Neither ChatConfig.prompt nor kit.default_prompt is set ```python # WRONG — raises ValueError when chat() is called kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini") # no default_prompt response = kit.chat(gpt.ChatConfig(user_message="Hello")) # ❌ # FIX OPTION 1: Set default_prompt on the kit kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=my_prompt) # FIX OPTION 2: Pass prompt in ChatConfig response = kit.chat(gpt.ChatConfig(user_message="Hello", prompt=my_prompt)) ``` ### Mistake 5: Expecting typed validation but not setting it explicitly ```python # BEST PRACTICE — set response_model explicitly prompt = RactoPrompt(..., output_format=WeatherReport) config = gpt.ChatConfig( user_message="...", response_model=WeatherReport, # ✅ explicit validation contract ) # ALSO SUPPORTED — inferred automatically from output_format model prompt = RactoPrompt(..., output_format=WeatherReport) config = gpt.ChatConfig(user_message="...") # ✅ inferred from prompt.output_format ``` ### Mistake 6: Missing `await` on async methods ```python # WRONG — this returns a coroutine object, not a response response = kit.achat(config) # ❌ # CORRECT response = await kit.achat(config) # ✅ (inside an async function) ``` ### Mistake 7: Not installing the provider extra ```python # WRONG — if you only ran pip install ractogateway from ractogateway import openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit(model="gpt-4o") kit.chat(...) # ❌ ImportError: The 'openai' package is required # FIX # pip install "ractogateway[openai]" ``` ### Mistake 8: Not handling `ResponseModelValidationError` When `response_model` is set, validation failures now raise `ResponseModelValidationError` after all retries are exhausted — they no longer silently append a warning string to `response.content`. ```python # WRONG — this will now raise, not return a response with garbled content response = kit.chat(config) # ❌ unhandled ResponseModelValidationError # CORRECT — wrap in try/except to handle gracefully from ractogateway.exceptions import ResponseModelValidationError try: response = kit.chat(config) report = MyModel(**response.parsed) except ResponseModelValidationError as e: # Inspect what happened and decide how to recover print(f"Validation failed after {e.attempts} attempt(s): {e.last_error}") # e.raw_response holds the last raw JSON string from the LLM ``` > **Tip:** The default `max_validation_retries=2` means the kit will > automatically retry twice before raising — most transient issues resolve > in the first retry. Set `max_validation_retries=0` to disable retries and > fail fast. --- ## 19. Telemetry & Observability RactoGateway ships production-grade observability with **zero changes** to existing call sites. Attach a `RactoTracer` and/or `GatewayMetricsMiddleware` to any kit and every LLM call is automatically instrumented. ### Installation ```bash pip install "ractogateway[observability]" # OTEL tracing + Prometheus metrics pip install "ractogateway[telemetry]" # OTEL tracing only pip install "ractogateway[prometheus]" # Prometheus metrics only ``` ### Quick start ```python from ractogateway import openai_developer_kit as opd from ractogateway.telemetry import RactoTracer, GatewayMetricsMiddleware, PrometheusExporter tracer = RactoTracer(otlp_endpoint="http://localhost:4317", console=True) metrics = GatewayMetricsMiddleware() PrometheusExporter(port=8000).start() # scrape http://localhost:8000/metrics kit = opd.OpenAIDeveloperKit( model="gpt-4o", default_prompt=prompt, tracer=tracer, metrics=metrics, ) response = kit.chat(opd.ChatConfig(user_message="Hello!")) # One OTEL span emitted, one Prometheus data-point recorded. ``` The same `tracer=` / `metrics=` parameters work on **GoogleDeveloperKit** and **AnthropicDeveloperKit**. ### What is recorded automatically | Event | Tracer span | Prometheus metrics | |---|---|---| | Successful chat/stream | `llm.chat` with latency, tokens, cost | `requests_total`, `duration_seconds`, `tokens_total`, `cost_usd_total` | | Cache hit (exact/semantic) | `llm.chat` with `cache_hit="exact"/"semantic"`, 0 tokens | `cache_hits_total` | | Cache miss | — | `cache_misses_total` | | Tool call | `tool_calls` attribute on span | `tool_calls_total{tool_name}` | | Error | `status="error"`, `error_type=ExcName` | `requests_total{status="error"}` | | Embedding | `llm.embed` | `requests_total{operation="embed"}` | ### OTEL export backends ```python # Jaeger / Grafana Tempo (gRPC) RactoTracer(otlp_endpoint="http://jaeger:4317") # Zipkin / Tempo (HTTP) RactoTracer(otlp_http_endpoint="http://tempo:4318") # In-memory capture for unit tests — no external backend needed tracer = RactoTracer(in_memory=True) kit.chat(...) assert tracer.spans[0].provider == "openai" tracer.clear_spans() ``` ### Custom pricing ```python from ractogateway.telemetry import ModelPricing, RactoTracer custom = {"my-ft-gpt4": ModelPricing(input_per_million=5.0, output_per_million=15.0)} tracer = RactoTracer(otlp_endpoint="...", price_table=custom) ``` ### Grafana dashboard Import `dashboards/grafana_dashboard.json` into Grafana to get 20+ pre-built panels covering latency percentiles (p50/p95/p99), token rate, cost rate, cache hit/miss ratio, error rate, tool call distribution, and a per-model summary table. Full reference: [Telemetry guide](telemetry.md) | [API reference](../api/telemetry.md) --- ## 20. Prebuilt Pipelines — Production Workflows RactoGateway includes prebuilt pipelines for common end-to-end tasks where a single `chat()` call is not enough. ### Available pipelines | Pipeline | Classes | Use case | |---|---|---| | SQL Analyst | `SQLAnalystPipeline`, `AsyncSQLAnalystPipeline` | Natural language analytics over SQL databases | | List Classifier | `ListClassifierPipeline`, `AsyncListClassifierPipeline` | Map user text to one or more options from a list | | Video Processor | `VideoProcessorPipeline`, `AsyncVideoProcessorPipeline` | Extract frames, transcribe audio, analyse with vision LLM, summarise | | Agent | `AgentPipeline`, `AsyncAgentPipeline` | Autonomous ReAct agent — reason + call tools + observe → answer | ### Install extras ```bash # SQL Analyst pip install ractogateway[pipelines-sql] # core (no charts) pip install ractogateway[pipelines-sql-viz] # + Plotly charts # Video Processor pip install ractogateway[pipelines-video] # OpenCV + ffmpeg + pHash pip install ractogateway[pipelines-video-whisper] # + faster-whisper (local ASR) pip install ractogateway[pipelines-video-yt] # + yt-dlp (YouTube download) # Agent pip install ractogateway[pipelines-agent] # core (no extra deps) pip install ractogateway[pipelines-agent-http] # + httpx (http_get tool) ``` ### SQL Analyst — quick example ```python from ractogateway import openai_developer_kit as gpt from ractogateway.pipelines import SQLAnalystPipeline sql_pipeline = SQLAnalystPipeline(kit=gpt.Chat(model="gpt-4o")) result = sql_pipeline.run( user_query="Top 5 products by revenue", connection_string="postgresql://user:pass@localhost:5432/shop", ) print(result.answer) ``` ### List Classifier — quick example ```python from ractogateway.pipelines import ListClassifierPipeline classifier = ListClassifierPipeline( kit=gpt.Chat(model="gpt-4o-mini"), options=["Billing", "Technical Support", "Sales"], include_confidence=True, include_reasoning=True, ) result = classifier.run("I cannot update my payment method") print(result.first) # "Billing" print(result.top_confidence) # e.g. 0.96 ``` ### Video Processor — quick example Process a lecture or tutorial video end-to-end — extract key frames, transcribe speech, use a vision LLM to read whiteboards/screens, and produce a structured Markdown report. ```python from ractogateway import openai_developer_kit as gpt from ractogateway.pipelines import VideoProcessorPipeline, TranscriberBackend, DeduplicationMethod pipeline = VideoProcessorPipeline( kit=gpt.Chat(model="gpt-4o"), # vision LLM + summary fps=1.0, # sample one frame per second similarity_threshold=85.0, # drop frames that are ≥85% similar to the previous dedup_method=DeduplicationMethod.PHASH, transcriber=TranscriberBackend.FASTER_WHISPER, transcriber_model="base", analyze_frames=True, generate_summary=True, safe_mode=True, ) # Accepts: local path, HTTP URL, YouTube URL, raw bytes, or pre-extracted frame list result = pipeline.run("lecture.mp4") print(f"Frames kept : {result.usage.frames_kept}/{result.usage.frames_extracted}") print(f"Tokens used : {result.usage.total_tokens}") print(result.summary) # structured Markdown summary result.to_markdown("report.md") # save full report ``` **What it produces (`VideoProcessorResult`):** | Field | Type | Description | |---|---|---| | `frames` | `list[FrameEntry]` | Every extracted frame with its LLM analysis | | `transcript` | `list[TranscriptSegment]` | Timed speech-to-text segments | | `sections` | `list[VideoSection]` | Time windows merging visual + audio content | | `summary` | `str` | 7-section Markdown summary | | `usage` | `VideoProcessorUsage` | Token counts + frame statistics | **Supported transcription backends (`TranscriberBackend`):** | Backend | Value | Requires | |---|---|---| | Faster Whisper (default) | `"faster-whisper"` | `pip install ractogateway[pipelines-video-whisper]` | | OpenAI Whisper (local) | `"openai-whisper"` | `pip install openai-whisper` | | OpenAI API | `"openai-api"` | OpenAI API key | | Groq API (ultra-fast) | `"groq-api"` | `pip install groq` + Groq API key | | Deepgram | `"deepgram-api"` | `pip install deepgram-sdk` + key | | Google Cloud STT | `"google-api"` | `pip install google-cloud-speech` + key | | HuggingFace local | `"huggingface-local"` | `pip install transformers torch` | | HuggingFace API | `"huggingface-api"` | `pip install huggingface_hub` + key | | Ollama | `"ollama"` | Running Ollama server | ### Agent — quick example An autonomous **ReAct** (Reason + Act) agent that loops: think → call tool → observe → repeat until it calls the built-in `finish()` tool. ```python from ractogateway import openai_developer_kit as gpt from ractogateway.pipelines import AgentPipeline def get_weather(city: str) -> str: """Return current weather for a city.""" return f"Sunny, 22 °C in {city}" def unit_convert(value: float, from_unit: str, to_unit: str) -> str: """Convert a value between units.""" # ... your logic here ... return f"{value} {from_unit} = ... {to_unit}" agent = AgentPipeline( kit=gpt.Chat(model="gpt-4o"), tools=[get_weather, unit_convert], max_steps=8, safe_mode=True, ) result = agent.run("What is the weather in Paris, and convert 22°C to Fahrenheit?") print(result.final_answer) print(result.to_markdown()) # step-by-step trace ``` **Agent result fields (`AgentResult`):** | Field | Type | Description | |---|---|---| | `final_answer` | `str \| None` | The agent's concluded answer | | `steps` | `list[AgentStep]` | Every thought / tool call / observation | | `stop_reason` | `StopReason` | `"finish"`, `"max_steps"`, or `"error"` | | `usage` | `AgentUsage` | Cumulative token counts across all steps | **Built-in tool factories:** ```python from ractogateway.pipelines import ( make_rag_tool, # rag_search(query) → relevant chunks from RactoRAG make_sql_tool, # sql_query(question) → answer from SQLAnalystPipeline make_http_tool, # http_get(url) → page text (requires httpx) make_memory_tools, # memory_read(key) + memory_write(key, value) ) agent = AgentPipeline( kit=gpt.Chat(model="gpt-4o"), tools=[get_weather], # your custom tools rag_pipeline=my_rag, # auto-registers rag_search sql_pipeline=my_sql, # auto-registers sql_query agent_memory={}, # dict → auto-registers memory_read/write extra_tools=[make_http_tool()], # opt-in http_get ) ``` ### Full guides - [Pipelines overview](pipelines.md) - [SQL Analyst pipeline](pipelines/sql_analyst.md) - [List Classifier pipeline](pipelines/list_classifier.md) - [Video Processor pipeline](pipelines/video_processor.md) - [Agent pipeline](pipelines/agent.md) --- ## 21. Chain of Thought Reasoning **Chain of Thought (CoT)** prompts the model to reason step-by-step before giving its final answer. RactoGateway exposes this as a single `ChatConfig` flag — no prompt engineering required. ### How to enable ```python from ractogateway import openai_developer_kit as gpt kit = gpt.Chat(model="gpt-4o") response = kit.chat( gpt.ChatConfig( user_message="If a train travels 300 km in 2.5 hours, what is its average speed?", chain_of_thought=True, # ← flip this flag ) ) print(response.content) # The model will reason through the problem before stating "120 km/h" ``` ### What it does internally Setting `chain_of_thought=True` appends a step-by-step reasoning constraint to the `RactoPrompt` before the request is sent. The constraint instructs the model to: 1. Break the problem into numbered reasoning steps. 2. Show its working at each step. 3. State the final answer clearly at the end. This is applied *per request* — it does not modify the kit's default prompt permanently. ### When to use CoT | Scenario | Benefit | |---|---| | Math / logic problems | Forces explicit calculation steps → fewer errors | | Multi-step planning | Surfaces assumptions and intermediate decisions | | Debugging assistance | Produces a traceable reasoning chain | | Exam / quiz apps | Provides explanation alongside the answer | ### Combining with structured output ```python from pydantic import BaseModel class ReasonedAnswer(BaseModel): steps: list[str] final_answer: str response = kit.chat( gpt.ChatConfig( user_message="How many seconds are in a leap year?", chain_of_thought=True, response_model=ReasonedAnswer, # parse result into Pydantic model ) ) print(response.parsed.steps) print(response.parsed.final_answer) ``` --- ## 22. Native Thinking / Extended Reasoning **Native Thinking** exposes the model's *internal* chain-of-thought reasoning tokens — the model genuinely thinks before answering rather than being instructed to write steps. Supported by **Anthropic Claude** (extended thinking) and **Google Gemini** (thinking mode). OpenAI o-series models expose reasoning token *counts* but not the text. ### Enable native thinking ```python from ractogateway import anthropic_developer_kit as claude kit = claude.Chat(model="claude-opus-4-6") response = kit.chat( claude.ChatConfig( user_message="Prove that √2 is irrational.", native_thinking=True, thinking_budget=8000, # max thinking tokens (Anthropic/Google) ) ) print(response.thinking) # raw model reasoning (may be hundreds of tokens) print(response.content) # final polished answer ``` ### Streaming with native thinking ```python accumulated_thinking = "" for chunk in kit.stream( claude.ChatConfig( user_message="Design a cache-invalidation strategy for a distributed system.", native_thinking=True, thinking_budget=10000, ) ): if chunk.is_thinking: print(chunk.delta.thinking, end="", flush=True) else: print(chunk.delta.text, end="", flush=True) ``` ### Provider behaviour summary | Provider | Thinking text visible | Thinking budget param | Notes | |---|---|---|---| | Anthropic Claude | ✅ `response.thinking` | `thinking_budget` | Forces `temperature=1` | | Google Gemini | ✅ `response.thinking` | `thinking_budget` | `ThinkingConfig` injected | | OpenAI (o-series) | ❌ not exposed | N/A | `reasoning_tokens` count in `usage` | ### `LLMResponse` fields added by native thinking | Field | Type | Description | |---|---|---| | `thinking` | `str \| None` | Raw model reasoning text | | `StreamDelta.thinking` | `str` | Incremental thinking token (streaming) | | `StreamChunk.accumulated_thinking` | `str` | Full thinking so far (streaming) | | `StreamChunk.is_thinking` | `bool` | `True` while in a thinking block | ### When to use native thinking Use `native_thinking=True` when accuracy matters more than latency: - Complex proofs, theorem verification - Code architecture reviews - Medical / legal / scientific reasoning - Any task where you want to inspect the model's reasoning, not just the answer > **Cost note:** thinking tokens count toward your bill but are not included in > `response.content`. Set `thinking_budget` conservatively; 4000–8000 is usually enough > for most tasks. --- ## 23. PageIndexRAG — Vectorless RAG **PageIndexRAG** is a lightweight RAG pipeline that requires *no embeddings* and *no vector database*. It uses a two-stage keyword index + BM25 scoring to retrieve relevant pages from documents. Perfect for CPU-only environments, offline use, or when you want instant setup without configuring a vector store. ### How it works ```text Document → page split → DecisionIndex (inverted keyword index) → BM25 scorer (Okapi BM25) → top-k pages → LLM ``` 1. **Page split** — PDFs are split page-by-page; all other documents use fixed character windows (`page_size=1000`, `page_overlap=100`). 2. **DecisionIndex** — builds an inverted keyword index over all pages for fast candidate retrieval (no embeddings needed). 3. **BM25 scoring** — ranks candidates with Okapi BM25, the same algorithm used by Elasticsearch and Solr. 4. **LLM answer** — top-k pages are passed to the LLM as context. ### Quick example ```python from ractogateway import openai_developer_kit as gpt from ractogateway.rag.page_index import PageIndexRAG kit = gpt.Chat(model="gpt-4o-mini") # Build the index rag = PageIndexRAG(kit=kit) rag.add_document("docs/handbook.pdf") # PDF — split page-by-page rag.add_document("docs/faq.txt") # Plain text — split by char window rag.add_texts(["RactoGateway supports 5 developer kits.", "..."]) # Query result = rag.search("What developer kits are supported?") print(result.answer) # LLM answer grounded in the retrieved pages print(result.pages[0].text) # raw page text that was used as context ``` ### No extra install `PageIndexRAG` ships in the core package — no vector store or embedding model required: ```bash pip install ractogateway # PageIndexRAG included by default pip install ractogateway[rag] # if you also want readers (PDF, Word, Excel…) ``` ### Comparison: PageIndexRAG vs. RactoRAG | Feature | `PageIndexRAG` | `RactoRAG` | |---|---|---| | Embeddings needed | ❌ No | ✅ Yes | | Vector store needed | ❌ No | ✅ Yes (Chroma, FAISS, Pinecone…) | | Retrieval algorithm | BM25 (keyword) | Cosine similarity (semantic) | | Best for | Quick setup, keyword-rich docs | Deep semantic search | | GPU/CPU | Pure CPU | CPU or GPU (embedding model) | | Offline use | ✅ Fully offline | ⚠️ Depends on embedder | ### When to use PageIndexRAG - Prototyping a Q&A feature without setting up a vector DB - Compliance / legal documents where exact keyword match matters - Offline / air-gapped environments - Structured documents (manuals, handbooks) where pages map naturally to topics ### Advanced: async + per-call top-k ```python import asyncio async def main(): rag = PageIndexRAG(kit=kit, top_k=5, page_size=800, page_overlap=80) rag.add_document("research_paper.pdf") result = await rag.asearch("What methodology did the authors use?") print(result.answer) asyncio.run(main()) ``` Full reference: [PageIndexRAG API](../api/page_index_rag.md) --- ## Quick Reference Card ```python # ── Imports ────────────────────────────────────────────────────────── from ractogateway import openai_developer_kit as gpt from ractogateway.prompts.engine import RactoPrompt, RactoFile from ractogateway.tools.registry import tool, ToolRegistry from ractogateway.cache import ExactMatchCache, SemanticCache from ractogateway.routing import CostAwareRouter, RoutingTier from ractogateway.truncation import TokenTruncator, TruncationConfig # ── Build a prompt ─────────────────────────────────────────────────── prompt = RactoPrompt( role="...", aim="...", constraints=["..."], tone="...", output_format="text", # or "json", "markdown", or a Pydantic class context="...", # optional background knowledge examples=[{"input": "...", "output": "..."}], # optional few-shot ) # ── Create the kit ─────────────────────────────────────────────────── kit = gpt.OpenAIDeveloperKit( model="gpt-4o-mini", default_prompt=prompt, exact_cache=ExactMatchCache(max_size=512), ) # ── Sync chat ──────────────────────────────────────────────────────── response = kit.chat(gpt.ChatConfig(user_message="Hello!")) print(response.content) # ── Async chat ─────────────────────────────────────────────────────── response = await kit.achat(gpt.ChatConfig(user_message="Hello!")) # ── Streaming ──────────────────────────────────────────────────────── for chunk in kit.stream(gpt.ChatConfig(user_message="Tell me a story.")): print(chunk.delta.text, end="", flush=True) # ── Embeddings ─────────────────────────────────────────────────────── from ractogateway._models.embedding import EmbeddingConfig resp = kit.embed(EmbeddingConfig(texts=["hello", "world"])) vec = resp.vectors[0].embedding # list[float] # ── Tool calling ───────────────────────────────────────────────────── @tool def get_price(product: str) -> float: """Get the price of a product.""" return 9.99 registry = ToolRegistry() registry.register(get_price) response = kit.chat(gpt.ChatConfig( user_message="How much is a widget?", tools=registry, )) # ── Chain of Thought ───────────────────────────────────────────────── response = kit.chat(gpt.ChatConfig( user_message="Explain why √2 is irrational.", chain_of_thought=True, # step-by-step reasoning in the answer )) # ── Native Thinking (Anthropic / Gemini) ───────────────────────────── from ractogateway import anthropic_developer_kit as claude claude_kit = claude.Chat(model="claude-opus-4-6") response = claude_kit.chat(claude.ChatConfig( user_message="Design a cache-invalidation strategy.", native_thinking=True, thinking_budget=8000, # max internal reasoning tokens )) print(response.thinking) # raw reasoning print(response.content) # polished answer # ── PageIndexRAG (no embeddings) ───────────────────────────────────── from ractogateway.rag.page_index import PageIndexRAG rag = PageIndexRAG(kit=kit) rag.add_document("handbook.pdf") result = rag.search("What developer kits are supported?") print(result.answer) # ── Pipelines ──────────────────────────────────────────────────────── from ractogateway.pipelines import ( SQLAnalystPipeline, ListClassifierPipeline, VideoProcessorPipeline, AgentPipeline, TranscriberBackend, ) # SQL sql = SQLAnalystPipeline(kit=kit) sql_result = sql.run("Top 5 products", connection_string="postgresql://...") print(sql_result.answer) # Classifier clf = ListClassifierPipeline(kit=kit, options=["Billing", "Tech Support"]) print(clf.run("I can't log in").first) # Video vp = VideoProcessorPipeline( kit=kit, transcriber=TranscriberBackend.FASTER_WHISPER, generate_summary=True, ) vp_result = vp.run("lecture.mp4") print(vp_result.summary) # Agent def search_web(query: str) -> str: """Search the web for information.""" return f"Results for: {query}" agent = AgentPipeline(kit=kit, tools=[search_web], max_steps=6) print(agent.run("What is the capital of France?").final_answer) ```