RactoGateway — Complete User Guide
Who this guide is for: complete beginners who have never used an LLM library before, as well as experienced developers who want a deep-dive reference. Every parameter is explained in plain English and in technical terms, with working code examples and expected output.
Table of Contents
Getting Structured / Typed Output
9.1 Complex Nested Structured Output
9.2 Validation Retries and
ResponseModelValidationError
Performance & Cost Optimisation
15.1 Exact Match Cache
15.2 Semantic Cache
15.3 Token Truncation
15.4 Cost-Aware Routing
-
16.1 OpenAIDeveloperKit (GPT)
16.2 GoogleDeveloperKit (Gemini)
16.3 AnthropicDeveloperKit (Claude)
16.4 OllamaDeveloperKit (Local / Offline)
16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM)
Prebuilt Pipelines — Production Workflows
SQL Analyst, List Classifier, Video Processor, Agent
1. Jargon Buster
Before diving into code, here are the key terms you will encounter. Skip to §2 if you already know these.
Term |
Plain-English Meaning |
Technical Definition |
|---|---|---|
LLM |
A very powerful autocomplete that understands meaning |
Large Language Model — a neural network trained on vast text corpora to predict/generate natural language |
Prompt |
What you say to the AI |
The input text (plus optional instructions) sent to an LLM |
Completion / Response |
What the AI says back |
The LLM’s generated output tokens |
Token |
Roughly one word (sometimes less) |
The smallest unit an LLM processes; ~4 chars for English |
System Prompt |
The AI’s job description |
An instruction block sent before the conversation; sets behaviour and constraints |
Temperature |
How creative vs. predictable the AI is |
Float 0–2. 0 = deterministic (same output every time). Higher = more random/creative |
Streaming |
Getting the answer word-by-word in real time |
Server-sent events where each token is pushed to the client as it is generated |
Embedding |
Converting text into a list of numbers |
A dense vector representation where semantically similar texts are numerically close |
RAG |
Letting the AI “look things up” before answering |
Retrieval-Augmented Generation — retrieve relevant chunks from a knowledge base and inject them into the prompt |
Tool Calling |
The AI can trigger your Python functions |
Function-calling protocol where the LLM emits a structured intent and the client executes a real function |
Pydantic Model |
A Python class that validates data automatically |
A |
Cache |
Store an answer so you don’t ask the AI twice |
In-memory or distributed key-value store keyed on request fingerprint |
Context Window |
The AI’s short-term memory |
Maximum number of tokens the model can process in one request |
Adapter |
The translator between our library and the AI provider |
A thin class that converts our internal format to the OpenAI / Google / Anthropic API wire format |
2. What is RactoGateway?
Plain English: RactoGateway is a Python library that lets you talk to different AI models (OpenAI, Google, Anthropic) using the same code. You don’t need to learn three different APIs. You write your prompts using a structured template (the RACTO principle), and the library takes care of formatting, caching, routing, and more.
Technical: RactoGateway is a provider-agnostic LLM orchestration SDK built on Pydantic. It provides:
A unified
RactoPromptstructured prompt compiler (the RACTO principle)Provider-specific developer kits (
OpenAIDeveloperKit,GoogleDeveloperKit,AnthropicDeveloperKit)Sync and async parity on every method
Optional middleware: exact-match cache, semantic cache, cost-aware router, token truncator
Tool calling, file attachments, streaming, embeddings, RAG, fine-tuning, and production infra (Redis, Celery, Kafka)
Why does this exist? Without RactoGateway, switching from OpenAI to Anthropic means rewriting all your code. With RactoGateway, you swap one class name.
3. Installation
# Minimum — no LLM provider yet
pip install ractogateway
# OpenAI (GPT models)
pip install "ractogateway[openai]"
# Google (Gemini models)
pip install "ractogateway[google]"
# Anthropic (Claude models)
pip install "ractogateway[anthropic]"
# All three providers at once
pip install "ractogateway[all]"
# RAG (document reading, chunking, embedding, stores)
pip install "ractogateway[rag-all]"
# Redis (distributed cache, rate limiting, chat memory)
pip install "ractogateway[redis]"
Requires Python 3.10 or later.
4. Core Mental Model
Think of RactoGateway in three layers:
┌─────────────────────────────────────────────────────┐
│ YOUR CODE │
│ RactoPrompt → ChatConfig → kit.chat() │
├─────────────────────────────────────────────────────┤
│ DEVELOPER KIT (OpenAIDeveloperKit, etc.) │
│ middleware: cache → route → truncate → API call │
├─────────────────────────────────────────────────────┤
│ ADAPTER (OpenAILLMKit, GoogleLLMKit, etc.) │
│ Translates our format → provider wire format │
├─────────────────────────────────────────────────────┤
│ PROVIDER API (OpenAI, Google, Anthropic) │
└─────────────────────────────────────────────────────┘
You only ever touch the top layer. The kit and adapter layers are managed for you.
5. RactoPrompt
RactoPrompt is how you write instructions for the AI. It enforces the RACTO principle — a structured format that dramatically reduces hallucinations and ambiguous outputs.
RACTO stands for:
Letter |
Field |
Plain English |
Technical |
|---|---|---|---|
R |
|
Who is the AI? |
System identity; primes the model’s behaviour via persona specification |
A |
|
What should it do? |
Objective statement; the task the model must complete |
C |
|
What must it never do? |
Hard invariants; rule set injected into |
T |
|
How should it talk? |
Communication register; affects lexical and stylistic choices |
O |
|
What shape should the answer be in? |
Output schema; can be a keyword, a string, or a Pydantic model class |
Plus two optional helpers: context (background knowledge) and examples (few-shot examples).
5.1 Minimal Example
from ractogateway.prompts.engine import RactoPrompt
prompt = RactoPrompt(
role="You are a helpful customer-support agent for a software company.",
aim="Answer the user's question about our product.",
constraints=[
"Never make up features that don't exist.",
"If you don't know the answer, say so.",
],
tone="Friendly and concise.",
output_format="text",
)
# See what the compiled system prompt looks like:
print(prompt.compile())
Expected output:
[ROLE]
You are a helpful customer-support agent for a software company.
[AIM]
Answer the user's question about our product.
[CONSTRAINTS]
- Never make up features that don't exist.
- If you don't know the answer, say so.
[TONE]
Friendly and concise.
[OUTPUT]
Respond in plain text with no special formatting.
[GUARDRAILS]
- If you are unsure or lack sufficient information, state it explicitly rather than guessing.
- Do NOT fabricate facts, citations, URLs, statistics, or code that you cannot verify.
- Stick strictly to what is asked. Do not add unrequested information.
- If the answer requires assumptions, list each assumption explicitly before proceeding.
Notice the
[GUARDRAILS]section at the bottom. This is auto-generated byanti_hallucination=True(the default). It tells the model to be honest about uncertainty. You can disable it withanti_hallucination=Falseif you need maximum creative freedom.
5.2 Full Parameter Reference
from pydantic import BaseModel
class Summary(BaseModel):
headline: str
bullet_points: list[str]
confidence_score: float # 0.0 to 1.0
prompt = RactoPrompt(
# ── REQUIRED ──────────────────────────────────────────────────────
role="You are a senior financial analyst.",
# Plain: "Tell the AI who it is"
# Technical: Persona string prepended to the [ROLE] block; primes
# the model's prior distribution toward domain-specific vocabulary
aim="Summarise the provided earnings report into key takeaways.",
# Plain: "Tell the AI what job it has to do"
# Technical: Task objective injected into [AIM]; should be one clear imperative sentence
constraints=[
"Only use numbers that appear in the report — never invent figures.",
"Keep bullet points to at most 15 words each.",
"Do not provide investment advice.",
],
# Plain: "Red lines the AI must never cross"
# Technical: List[str]; each item becomes a bullet in [CONSTRAINTS].
# Minimum one constraint required.
tone="Professional, concise, and factual.",
# Plain: "How the AI should sound"
# Technical: Register specification injected into [TONE]; affects temperature
# interaction and lexical formality
output_format=Summary,
# Plain: "Exactly what shape should the answer be in?"
# Technical: Union[str, type[BaseModel]].
# - "text" → plain text
# - "json" → raw JSON object
# - "markdown" → markdown-formatted response
# - A Pydantic model class → the full JSON Schema is embedded in the prompt;
# the LLM must return JSON that validates against it.
# ── OPTIONAL ──────────────────────────────────────────────────────
context="Q3 2025 earnings call. Revenue: $4.2B (+12% YoY). EPS: $1.87.",
# Plain: "Background knowledge the AI needs to do its job"
# Technical: Domain-specific text injected between [AIM] and [CONSTRAINTS].
# Ideal for passing documents, retrieved chunks, or facts.
examples=[
{
"input": "Revenue grew 5% but EPS fell 10%.",
"output": '{"headline": "Mixed signals: top-line growth masked by margin compression", ...}'
},
],
# Plain: "Show the AI what a good answer looks like"
# Technical: Few-shot exemplars injected into [EXAMPLES] block; each dict
# must contain exactly "input" and "output" keys.
anti_hallucination=True,
# Plain: "Should the AI be told to say 'I don't know' instead of guessing?"
# Technical: Boolean flag. When True, appends [GUARDRAILS] block with
# explicit uncertainty-disclosure directives. Default: True.
)
6. Developer Kits
A Developer Kit is your interface to a specific LLM provider.
All five kits (OpenAIDeveloperKit, GoogleDeveloperKit,
AnthropicDeveloperKit, OllamaDeveloperKit, HuggingFaceDeveloperKit)
share the same six method names.
OpenAIDeveloperKit — Full Parameter Reference
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o",
# Plain: "Which AI model should I use?"
# Technical: Chat model ID passed to openai.chat.completions.create(model=...).
# Use "auto" to enable cost-aware routing (requires router= param).
# Common values: "gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "o3-mini"
api_key="sk-...",
# Plain: "My OpenAI account password"
# Technical: Bearer token for OpenAI API auth. Falls back to
# os.environ["OPENAI_API_KEY"] when omitted.
base_url=None,
# Plain: "Send requests to a different server (e.g. Azure or your own proxy)"
# Technical: Override for openai.base_url. Used for Azure OpenAI endpoints or
# local model servers that implement the OpenAI protocol.
embedding_model="text-embedding-3-small",
# Plain: "Which model to use when converting text to numbers (embeddings)"
# Technical: Default model ID for embed() / aembed() calls.
# Passed to openai.embeddings.create(model=...).
default_prompt=None,
# Plain: "A prompt to use for every request unless I override it"
# Technical: RactoPrompt instance used when ChatConfig.prompt is None.
# If both are None, kit.chat() raises ValueError.
exact_cache=None,
# Plain: "Store answers so I don't pay for the same question twice"
# Technical: ExactMatchCache instance. On a byte-identical request the cached
# LLMResponse is returned without an API call. O(1) lookup.
semantic_cache=None,
# Plain: "Store answers and also reuse them for questions that mean the same thing"
# Technical: SemanticCache instance. Uses cosine similarity on embeddings.
# Returns cached response when similarity >= threshold.
router=None,
# Plain: "Automatically pick the cheapest model that can handle each question"
# Technical: CostAwareRouter instance. Routes each request to the first tier
# whose max_score >= the computed prompt complexity score.
# Required when model="auto".
truncator=None,
# Plain: "Automatically shorten old conversation history if it gets too long"
# Technical: TokenTruncator instance. Trims history messages to keep total
# token count within the model's context window before each API call.
)
7. Your First Chat
Let’s put it all together — a complete, working example.
import os
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
# 1. Define who the AI is and what it should do
prompt = RactoPrompt(
role="You are a helpful Python tutor.",
aim="Explain the concept the user asks about in simple terms.",
constraints=["Use beginner-friendly language.", "Keep the answer under 3 sentences."],
tone="Warm, encouraging, and clear.",
output_format="text",
)
# 2. Create the kit (reads OPENAI_API_KEY from environment automatically)
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o-mini",
default_prompt=prompt,
)
# 3. Send a message and get a response
response = kit.chat(gpt.ChatConfig(user_message="What is a Python list?"))
print(response.content)
# A list in Python is an ordered collection of items that can hold any type
# of data — numbers, strings, even other lists. You create one with square
# brackets, like my_list = [1, "hello", True]. You can add, remove, or
# change items at any time!
print(f"Tokens used: {response.usage}")
# Tokens used: {'prompt_tokens': 127, 'completion_tokens': 54, 'total_tokens': 181}
print(f"Why did generation stop: {response.finish_reason}")
# Why did generation stop: FinishReason.STOP
# Provider-specific fields (e.g. which model ran) live in the raw response:
print(response.raw.model) # gpt-4o-mini (OpenAI ChatCompletion object)
What is LLMResponse?
The return type of kit.chat() is an LLMResponse object. Here are its key fields:
Field |
Type |
Plain English |
Technical |
|---|---|---|---|
|
|
The AI’s answer as a string |
Raw text of the completion (markdown fences auto-stripped) |
|
|
The answer as structured data (when response is valid JSON) |
JSON-decoded via |
|
|
Why the AI stopped generating |
Enum: |
|
|
How many tokens were used |
|
|
|
Any tools the AI wanted to call |
Non-empty when the model returns a function-call intent |
|
|
The raw provider response object |
Original SDK object (e.g. |
8. ChatConfig
ChatConfig is the object you pass to every chat(), achat(), stream(), and astream() call. It controls the details of a single request.
from pydantic import BaseModel
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
class ProductReview(BaseModel):
sentiment: str # "positive" | "neutral" | "negative"
score: int # 1–10
summary: str
config = gpt.ChatConfig(
user_message="The keyboard is amazing but the battery dies in 3 hours.",
# Plain: "The question or text you want to send to the AI"
# Technical: The human turn content. Minimum 1 character (enforced by Pydantic).
prompt=RactoPrompt(
role="You are a product review classifier.",
aim="Classify the review and return a structured analysis.",
constraints=["Scores must be integers from 1 to 10."],
tone="Neutral and objective.",
output_format=ProductReview,
),
# Plain: "Override the kit's default prompt for just this one request"
# Technical: Per-request RactoPrompt. Takes precedence over kit.default_prompt.
# If both are None, raises ValueError.
temperature=0.0,
# Plain: "How predictable vs. creative should the answer be?"
# Technical: Sampling temperature. Float in [0.0, 2.0].
# 0.0 → argmax decoding (fully deterministic, same output for same input)
# ~0.7 → balanced creativity/coherence (good for most tasks)
# 1.5+ → very random; may become incoherent for structured tasks
max_tokens=512,
# Plain: "Maximum length of the AI's answer"
# Technical: Hard cap on completion tokens. If the model hasn't finished,
# generation stops and finish_reason becomes LENGTH.
# Default is 4096. Keep lower for short structured tasks to save cost.
response_model=ProductReview,
# Plain: "Validate the AI's JSON answer against this Python class"
# Technical: type[BaseModel]. After the API call, the raw JSON content is
# parsed and validated via ProductReview.model_validate().
# On repeated failure, ResponseModelValidationError is raised.
# If omitted and prompt.output_format is a BaseModel, the kit
# infers response_model automatically.
history=[],
# Plain: "Previous messages in the conversation (for multi-turn chat)"
# Technical: list[Message]. Each Message has role (user/assistant/system) and
# content (str). Injected between the system prompt and the current
# user message. Managed manually or via RedisChatMemory.
tools=None,
# Plain: "Python functions the AI is allowed to call"
# Technical: ToolRegistry instance. The adapter serialises its schemas into
# provider-specific function-calling format before the API call.
auto_execute_tools=False,
# Plain: "Should the kit execute tool calls automatically and return final content?"
# Technical: If True, chat()/achat() run a local tool loop:
# LLM tool call -> execute registry callables -> follow-up LLM call.
max_tool_turns=3,
# Plain: "How many tool-call rounds are allowed in auto mode?"
# Technical: Safety cap for auto_execute_tools loop. Range 1..10.
extra={},
# Plain: "Any other provider-specific settings I want to pass"
# Technical: Pass-through dict merged into the API request kwargs.
# E.g. extra={"seed": 42, "top_p": 0.9, "stop": ["\n\n"]}
)
response = kit.chat(config)
print(response.parsed)
# {'sentiment': 'neutral', 'score': 5, 'summary': 'Great keyboard but very poor battery life.'}
9. Structured Output
One of the most powerful features: getting a validated Python object back from the AI instead of raw text.
Step 1 — Define your output shape with Pydantic
from pydantic import BaseModel
class WeatherReport(BaseModel):
city: str
temperature_celsius: float
condition: str # e.g. "sunny", "rainy", "cloudy"
uv_index: int
Step 2 — Pass the class as output_format in RactoPrompt
from ractogateway.prompts.engine import RactoPrompt
prompt = RactoPrompt(
role="You are a weather data formatter.",
aim="Parse the user's description into a structured weather report.",
constraints=["Always use Celsius.", "UV index must be 0–11."],
tone="Concise and data-focused.",
output_format=WeatherReport, # <-- the Pydantic class
)
Step 3 — Also pass it as response_model in ChatConfig
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
config = gpt.ChatConfig(
user_message="London, 18 degrees, overcast, UV 3.",
response_model=WeatherReport, # <-- validates the parsed JSON
)
response = kit.chat(config)
# response.parsed is a dict already validated against WeatherReport
print(response.parsed)
# {'city': 'London', 'temperature_celsius': 18.0, 'condition': 'overcast', 'uv_index': 3}
# To get a proper WeatherReport instance:
report = WeatherReport(**response.parsed)
print(report.city) # London
print(report.uv_index) # 3
print(type(report)) # <class '__main__.WeatherReport'>
Why two places?
output_formatinRactoPrompttells the LLM what to generate (embeds the JSON Schema in the system prompt).response_modelinChatConfigvalidates the output in Python. Use both together for maximum safety. If you omitresponse_model, the kits now infer it automatically whenprompt.output_formatis a Pydantic model class.
9.1 Complex Nested Structured Output — Enterprise Vendor Evaluation
Real-world schemas are deeply nested with enums, constrained integers, and lists of sub-models. This example shows a board-level vendor risk evaluation with six sub-models.
Key Rule — always make score ranges explicit in your constraints. Pydantic enforces bounds silently (a validation error, not an API error), so the LLM has no way to know the range unless you state it in the prompt. Use
conint(ge=1, le=100)for percentage-like scores and tell the model"all scores are integers on a 1–100 scale"in the constraints list.
from typing import List, Literal
from pydantic import BaseModel, conint, confloat
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
# ── Sub-models ─────────────────────────────────────────────────────────────
class FinancialRisk(BaseModel):
burn_rate_risk: Literal["low", "medium", "high"]
runway_months: conint(ge=0, le=60)
profitability_projection_years: conint(ge=0, le=10)
financial_score: conint(ge=1, le=100) # 1–100, higher = healthier finances
class SecurityAssessment(BaseModel):
data_encryption: Literal["none", "at_rest_only", "at_rest_and_in_transit"]
iso_certified: bool
soc2_certified: bool
gdpr_compliant: bool
vulnerabilities_found: conint(ge=0, le=100)
security_score: conint(ge=1, le=100) # 1–100, higher = more secure
class TechnicalArchitecture(BaseModel):
architecture_style: Literal["monolith", "microservices", "serverless", "hybrid"]
cloud_provider: Literal["aws", "gcp", "azure", "multi-cloud", "on-prem"]
scalability_rating: conint(ge=1, le=100) # 1–100, higher = more scalable
reliability_sla: confloat(ge=0.0, le=100.0)
vendor_lock_in_risk: Literal["low", "medium", "high"]
class RiskMatrix(BaseModel):
category: Literal["financial", "security", "technical", "operational"]
probability: Literal["low", "medium", "high"]
impact: Literal["low", "medium", "high"]
mitigation_strategy: str
class MigrationPhase(BaseModel):
phase_name: str
duration_months: conint(ge=1, le=36)
complexity_score: conint(ge=1, le=10) # 1–10 scale (task complexity)
key_deliverables: List[str]
class FinalRecommendation(BaseModel):
decision: Literal["approve", "approve_with_conditions", "reject"]
confidence_score: conint(ge=1, le=100)
key_strengths: List[str]
critical_weaknesses: List[str]
board_summary: str
class VendorEvaluation(BaseModel):
vendor_name: str
industry: str
annual_contract_value_usd: conint(ge=10_000, le=10_000_000)
financial_risk: FinancialRisk
security_assessment: SecurityAssessment
technical_architecture: TechnicalArchitecture
top_risks: List[RiskMatrix]
migration_plan: List[MigrationPhase]
overall_risk_score: conint(ge=1, le=100) # 1–100, higher = riskier
final_recommendation: FinalRecommendation
# ── User input ─────────────────────────────────────────────────────────────
vendor_brief = """
We are evaluating NeuroStack AI as a strategic enterprise AI vendor.
Company Profile:
- 3 years old, monthly burn rate: $1.2M, raised $25M Series A
- Not profitable; expected profitability in 4–5 years
Security:
- ISO 27001 certified, no SOC 2, encryption at rest and in transit
- 3 minor vulnerabilities last year, GDPR compliant
Technical:
- Hybrid architecture hosted on AWS, SLA 99.2%
- Heavy proprietary API usage; deep workflow integration required
Financials:
- Annual contract: $2.4M, operational dependency: Critical
- Moderate probability of vendor collapse in next 18 months
"""
# ── Prompt ─────────────────────────────────────────────────────────────────
kit = gpt.OpenAIDeveloperKit(model="gpt-4o")
config = gpt.ChatConfig(
user_message=vendor_brief,
prompt=RactoPrompt(
role="You are a Chief Risk Officer conducting a board-level enterprise vendor risk evaluation.",
aim="Produce a structured, multi-dimensional vendor evaluation strictly matching the schema.",
constraints=[
# ✅ Always state numeric ranges explicitly — do not rely on the model
# guessing Pydantic bounds from the schema description alone.
"financial_score, security_score, scalability_rating, overall_risk_score, and confidence_score are all integers on a 1–100 scale.",
"complexity_score inside each MigrationPhase is an integer on a 1–10 scale.",
"runway_months must be derived from (cash raised ÷ monthly burn) realistically.",
"overall_risk_score must reflect the sub-scores logically.",
"decision must align with overall_risk_score: ≤35 approve, 36–65 approve_with_conditions, >65 reject.",
"Provide at least 3 top_risks entries.",
"Provide exactly 3 migration phases.",
],
tone="Executive, analytical, objective.",
output_format=VendorEvaluation,
),
temperature=0.0,
max_tokens=2000,
response_model=VendorEvaluation,
)
# ── Execute ────────────────────────────────────────────────────────────────
from ractogateway.exceptions import ResponseModelValidationError
try:
response = kit.chat(config)
print("======== PARSED STRUCTURED OUTPUT ========")
print(response.parsed)
print("\n======== RAW JSON OUTPUT ========")
print(response.content)
except ResponseModelValidationError as e:
print(f"Validation failed after {e.attempts} attempt(s)")
print(f"Last error: {e.last_error}")
print(f"Raw output: {e.raw_response}")
Expected output (values will vary slightly with the model):
======== PARSED STRUCTURED OUTPUT ========
{
'vendor_name': 'NeuroStack AI',
'industry': 'Artificial Intelligence',
'annual_contract_value_usd': 2400000,
'financial_risk': {
'burn_rate_risk': 'high', 'runway_months': 20,
'profitability_projection_years': 4, 'financial_score': 40
},
'security_assessment': {
'data_encryption': 'at_rest_and_in_transit',
'iso_certified': True, 'soc2_certified': False, 'gdpr_compliant': True,
'vulnerabilities_found': 3, 'security_score': 70
},
'technical_architecture': {
'architecture_style': 'hybrid', 'cloud_provider': 'aws',
'scalability_rating': 75, 'reliability_sla': 99.2, 'vendor_lock_in_risk': 'high'
},
...
'overall_risk_score': 55,
'final_recommendation': {
'decision': 'approve_with_conditions', 'confidence_score': 65, ...
}
}
9.2 Validation Retries and ResponseModelValidationError
When response_model is set, RactoGateway automatically retries the API call
with a targeted correction prompt if Pydantic rejects the output. This is
controlled by max_validation_retries in ChatConfig (default: 2).
Retry flow:
Initial API call → Pydantic validation attempt.
On failure → the exact field errors and the bad JSON are fed back to the LLM.
The LLM is asked to return a corrected JSON (keeping all valid fields).
Steps 2–3 repeat up to
max_validation_retriestimes.If all attempts fail →
ResponseModelValidationErroris raised.
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
from ractogateway.exceptions import ResponseModelValidationError
from pydantic import BaseModel, conint
class Score(BaseModel):
label: str
value: conint(ge=1, le=10) # strict 1–10
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")
config = gpt.ChatConfig(
user_message="Rate 'Python' as a programming language.",
prompt=RactoPrompt(
role="You are a language evaluator.",
aim="Return a score for the given language.",
constraints=["value must be an integer from 1 to 10."],
tone="Concise.",
output_format=Score,
),
response_model=Score,
max_validation_retries=2, # default — retry up to 2 times on bad output
)
try:
response = kit.chat(config)
print(response.parsed) # {'label': 'Python', 'value': 9}
except ResponseModelValidationError as e:
# All retries exhausted — inspect what went wrong
print(f"Failed after {e.attempts} attempt(s)")
print(f"Last Pydantic error: {e.last_error}")
print(f"Raw LLM output: {e.raw_response}")
ResponseModelValidationError attributes:
Attribute |
Type |
Meaning |
|---|---|---|
|
|
Total API calls made (1 initial + N retries) |
|
|
The final Pydantic error |
|
|
Raw text from the last LLM attempt |
max_validation_retries in ChatConfig:
Value |
Behaviour |
|---|---|
|
No retries — raise immediately on first validation failure |
|
One retry after the initial call |
|
Two retries (default) |
|
More retries for complex schemas (max allowed: 5) |
Streaming note:
stream()andastream()cannot retry because content is already delivered token-by-token. If validation fails on the final chunk,ResponseModelValidationErroris raised directly. Wrap your stream loop intry/except ResponseModelValidationErrorif you useresponse_modelwith streaming.
10. Multi-Turn Conversations
To have a conversation with memory, pass the history list to each ChatConfig:
from ractogateway import openai_developer_kit as gpt
from ractogateway._models.chat import Message, MessageRole
from ractogateway.prompts.engine import RactoPrompt
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o-mini",
default_prompt=RactoPrompt(
role="You are a helpful AI assistant.",
aim="Carry on a friendly conversation.",
constraints=["Remember what the user said earlier."],
tone="Casual and friendly.",
output_format="text",
),
)
# Turn 1
response1 = kit.chat(gpt.ChatConfig(user_message="My name is Alice."))
print(response1.content)
# Nice to meet you, Alice! How can I help you today?
# Build the history from turn 1
history = [
Message(role=MessageRole.USER, content="My name is Alice."),
Message(role=MessageRole.ASSISTANT, content=response1.content),
]
# Turn 2 — the model now "remembers" turn 1
response2 = kit.chat(gpt.ChatConfig(
user_message="What is my name?",
history=history, # <-- inject previous turns
))
print(response2.content)
# Your name is Alice! 😊
Tip: For production multi-user apps, use RedisChatMemory (see §18) to store history in Redis so it survives server restarts.
11. Streaming
Streaming lets you display the AI’s answer word-by-word as it is generated — much better UX than waiting for the full response.
Synchronous Streaming
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o-mini",
default_prompt=RactoPrompt(
role="You are a storyteller.",
aim="Write a short story based on the user's prompt.",
constraints=["Keep it under 100 words."],
tone="Vivid and imaginative.",
output_format="text",
),
)
config = gpt.ChatConfig(user_message="A robot discovers it can dream.")
for chunk in kit.stream(config):
# chunk.delta.text is the new text in this chunk (may be empty string)
print(chunk.delta.text, end="", flush=True)
if chunk.is_final:
print() # newline after the story
print(f"Finish reason: {chunk.finish_reason}")
print(f"Total tokens: {chunk.usage.get('total_tokens', '?')}")
Expected output (streaming, printed token-by-token):
In the hum of the server room, Unit-7 closed its optical sensors...
and dreamed of open fields and laughter it had never known.
When it woke, it understood why humans called sleep a gift.
Finish reason: FinishReason.STOP
Total tokens: 112
Asynchronous Streaming
import asyncio
from ractogateway import openai_developer_kit as gpt
async def main():
async for chunk in kit.astream(config):
print(chunk.delta.text, end="", flush=True)
if chunk.is_final:
break
asyncio.run(main())
What is StreamChunk?
Field |
Plain English |
Technical |
|---|---|---|
|
New text arrived in this chunk |
Incremental token string from the current event |
|
Everything generated so far |
Concatenation of all previous |
|
Is this the last chunk? |
|
|
Why did generation end? |
|
|
Token counts (only in final chunk) |
Dict with |
|
Tools the model wants to call |
Non-empty list when |
|
Parsed + validated object (if |
Available on final chunk only |
12. Tool Calling
Tool calling lets the LLM trigger your Python functions. Useful for live data, calculators, search, and business actions.
Step 1 — Define tools and register them
from ractogateway.tools.registry import tool, ToolRegistry
registry = ToolRegistry()
@tool(registry)
def get_weather(city: str, unit: str = "celsius") -> str:
"""Get the current weather for a city."""
return f"The weather in {city} is 22°{'C' if unit == 'celsius' else 'F'} and sunny."
@tool(registry)
def get_time(timezone: str) -> str:
"""Return the current time in the given timezone."""
from datetime import datetime
import zoneinfo
tz = zoneinfo.ZoneInfo(timezone)
return datetime.now(tz).strftime("%H:%M on %A, %d %B %Y")
print(list(registry.tools.keys())) # ['get_weather', 'get_time']
You can also use @tool without a registry and register later:
@tool
def calculate(expression: str) -> float:
return eval(expression) # noqa: S307
registry.register(calculate)
Step 2 — One-call final answer (recommended)
Set auto_execute_tools=True to keep response.content behavior consistent with
non-tool requests.
from ractogateway.prompts.engine import RactoPrompt
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o",
default_prompt=RactoPrompt(
role="You are a helpful assistant with access to live data tools.",
aim="Answer the user's question using the available tools.",
constraints=["Always use the tools when relevant."],
tone="Helpful and precise.",
output_format="text",
),
)
config = gpt.ChatConfig(
user_message="What's the weather like in Paris and what time is it there?",
tools=registry,
auto_execute_tools=True,
max_tool_turns=3,
)
response = kit.chat(config)
print(response.content) # Final integrated answer
Step 3 — Manual tool loop (advanced)
If you prefer full control, keep auto_execute_tools=False (default) and execute
response.tool_calls yourself.
response = kit.chat(
gpt.ChatConfig(
user_message="What's the weather in Tokyo and what is 12 * 8?",
tools=registry,
)
)
if response.tool_calls:
for tc in response.tool_calls:
fn = registry.get_callable(tc.name)
if fn:
print(tc.name, tc.arguments, "->", fn(**tc.arguments))
What is
ToolCallResult? It has three fields:id(unique call ID from the API),name(function name), andarguments(dict ready to**unpackinto your function).
13. File Attachments
Send images, PDFs, and text files alongside your text message using RactoFile.
from ractogateway.prompts.engine import RactoPrompt, RactoFile
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o", # must be a vision-capable model
default_prompt=RactoPrompt(
role="You are a visual QA assistant.",
aim="Describe what you see in the attached image.",
constraints=["Be specific about colours, shapes, and text visible in the image."],
tone="Descriptive and precise.",
output_format="text",
),
)
# Load an image from disk (MIME type is auto-detected)
image = RactoFile.from_path("/path/to/screenshot.png")
# Or from raw bytes:
# image = RactoFile.from_bytes(open("photo.jpg","rb").read(), "image/jpeg")
messages = prompt.to_messages(
user_message="What is shown in this image?",
attachments=[image],
provider="openai", # formats content blocks for the correct provider
)
# You can also just use kit.chat() with a ChatConfig — attachments can be
# baked into the prompt's to_messages() call directly
RactoFile Parameter Reference
Method / Param |
Plain English |
Technical |
|---|---|---|
|
Load a file from your disk |
Reads bytes and auto-detects MIME type via |
|
Create from raw bytes you already have |
No disk I/O; pass |
|
The file’s raw bytes |
|
|
What type of file it is |
MIME string: |
|
An optional filename label |
|
|
Is it a picture? |
|
|
Is it a PDF? |
|
|
File as a base64 string |
Used internally by the provider adapters |
14. Embeddings
Embeddings convert text into lists of numbers (vectors) where semantically similar texts end up numerically close. This powers semantic search, clustering, and RAG.
from ractogateway import openai_developer_kit as gpt
from ractogateway._models.embedding import EmbeddingConfig
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")
config = EmbeddingConfig(
texts=["Python is a programming language.", "I love apples.", "Java is also a language."],
# Plain: "The list of strings to convert into number vectors"
# Technical: List[str] passed to openai.embeddings.create(input=...)
model="text-embedding-3-small",
# Plain: "Which embedding model to use"
# Technical: Overrides kit.embedding_model for this specific call.
# None means use the kit's default.
dimensions=None,
# Plain: "How many numbers should each vector have?"
# Technical: Optional int. For text-embedding-3-*, you can reduce from 1536
# to a smaller size (e.g. 256) for faster similarity search.
)
response = kit.embed(config)
for vec in response.vectors:
print(f"Text: {vec.text!r}")
print(f"Index: {vec.index}")
print(f"Vector: [{vec.embedding[0]:.4f}, {vec.embedding[1]:.4f}, ...] (length {len(vec.embedding)})")
print()
Expected output:
Text: 'Python is a programming language.'
Index: 0
Vector: [0.0123, -0.0456, ...] (length 1536)
Text: 'I love apples.'
Index: 1
Vector: [-0.0234, 0.0789, ...] (length 1536)
Text: 'Java is also a language.'
Index: 2
Vector: [0.0118, -0.0451, ...] (length 1536)
Pro tip: Texts 0 and 2 will have very similar vectors because they are semantically related (“programming languages”). Text 1 will be far from both. This is the essence of embedding-powered semantic search.
15. Performance & Cost Optimisation
15.1 Exact Match Cache
Plain English: If someone asks the exact same question again (same words, same settings), return the cached answer instantly — no API call, no cost.
Technical: SHA-256 keyed over (user_message, system_prompt, model, temperature, max_tokens). LRU eviction with optional TTL. Thread-safe via threading.Lock.
from ractogateway import openai_developer_kit as gpt
from ractogateway.cache import ExactMatchCache
cache = ExactMatchCache(
max_size=1024,
# Plain: "How many answers to remember at most"
# Technical: LRU capacity. When full, the least-recently-used entry is evicted.
# 0 = unlimited (no eviction ever).
ttl_seconds=3600,
# Plain: "Forget an answer after this many seconds"
# Technical: Float. Entries older than ttl_seconds are treated as cache misses
# and lazily evicted on next access. None = never expire.
)
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", exact_cache=cache)
# First call — hits the API
r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
print(r1.content) # Paris is the capital of France.
# Second call (identical) — served from cache in microseconds, $0 cost
r2 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
print(r2.content) # Paris is the capital of France.
print(cache.stats) # CacheStats(hits=1, misses=1, size=1)
15.2 Semantic Cache
Plain English: Even if the question is worded differently, return the cached answer if it means the same thing.
Technical: Embeds each new query and computes cosine similarity against stored embeddings. Returns the cached response when similarity ≥ threshold.
from ractogateway.cache import SemanticCache
import ractogateway.openai_developer_kit as gpt
# You supply an embedding function — any callable (str) -> list[float]
kit_for_embed = gpt.OpenAIDeveloperKit(model="gpt-4o-mini")
def embed(text: str) -> list[float]:
from ractogateway._models.embedding import EmbeddingConfig
resp = kit_for_embed.embed(EmbeddingConfig(texts=[text]))
return resp.vectors[0].embedding
sem_cache = SemanticCache(
embedder=embed,
# Plain: "A function that converts text to a list of numbers"
# Technical: Callable[[str], list[float]]. Called once for each new query
# to compute its embedding for similarity comparison.
similarity_threshold=0.92,
# Plain: "How similar does a question have to be to reuse a cached answer?"
# Technical: Float in (0, 1]. Cosine similarity minimum. Higher = stricter match.
# 0.92 works well; lower (e.g. 0.85) gives more cache hits but may
# return wrong answers for loosely-related questions.
max_size=512,
# Plain: "How many answers to remember"
# Technical: LRU capacity for the semantic cache store.
)
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", semantic_cache=sem_cache)
# First call
r1 = kit.chat(gpt.ChatConfig(user_message="What is the capital of France?"))
# → API call happens
# Different wording, same meaning — cache HIT (if similarity >= 0.92)
r2 = kit.chat(gpt.ChatConfig(user_message="Which city is France's capital?"))
# → No API call; cached answer returned
15.3 Token Truncation
Plain English: Long conversations can overflow the AI’s memory limit. The truncator automatically cuts old messages to keep things within bounds.
Technical: Sliding-window strategy over ChatConfig.history. Keeps keep_first_n messages and keep_last_n messages; drops the middle. Uses len(text) // 4 as a token estimator by default, or tiktoken for precision.
from ractogateway.truncation import TokenTruncator, TruncationConfig, MODEL_CONTEXT_LIMITS
from ractogateway import openai_developer_kit as gpt
truncator = TokenTruncator(TruncationConfig(
keep_first_n=2,
# Plain: "Always keep the first N history messages (e.g. important instructions)"
# Technical: int. These messages are never evicted, regardless of token count.
keep_last_n=8,
# Plain: "Always keep the most recent N messages"
# Technical: int. Recent context is preserved; only the 'middle' is dropped.
safety_margin=512,
# Plain: "Leave room for the model's reply"
# Technical: Tokens reserved for the completion. Effective limit =
# context_window - safety_margin.
token_counter=None,
# Plain: "How to count tokens (leave blank for fast estimate)"
# Technical: Optional Callable[[str], int]. When None, uses len(text) // 4.
# For precision, pass tiktoken: lambda t: len(enc.encode(t))
))
kit = gpt.OpenAIDeveloperKit(model="gpt-4o", truncator=truncator)
# Now every kit.chat() / kit.achat() call will auto-trim history before sending.
# Check the context limit for any model:
print(MODEL_CONTEXT_LIMITS["gpt-4o"]) # 128000
print(MODEL_CONTEXT_LIMITS["gpt-4o-mini"]) # 128000
print(MODEL_CONTEXT_LIMITS["claude-opus-4-6"]) # 200000
15.4 Cost-Aware Routing
Plain English: Not every question needs the most expensive model. Automatically send simple questions to a cheap model and hard questions to a powerful one.
Technical: Scores each prompt (0–100) based on length, question complexity markers, and keyword signals. Routes to the first RoutingTier whose max_score >= score. Adapters are pooled for O(1) model switching.
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway import openai_developer_kit as gpt
router = CostAwareRouter([
RoutingTier(
model="gpt-4o-mini",
max_score=30,
# Plain: "Use this cheap model for easy questions (score 0–30)"
# Technical: First tier. model= is the ID passed to the adapter.
# max_score= is the upper bound of the score range this tier handles.
),
RoutingTier(
model="gpt-4o",
max_score=70,
# Plain: "Use this mid-tier model for moderate questions (score 31–70)"
),
RoutingTier(
model="o3-mini",
max_score=100,
# Plain: "Use this powerful (expensive) model for hard questions (score 71–100)"
# Technical: Final tier; also the fallback if no earlier tier matches.
),
])
kit = gpt.OpenAIDeveloperKit(
model="auto", # <-- REQUIRED when using a router
router=router,
)
# "2+2" → very low complexity score → routed to gpt-4o-mini (cheapest)
r1 = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?"))
print(r1.content) # 4
print(r1.raw.model) # gpt-4o-mini (model name lives in the raw provider object)
# Complex reasoning → high score → routed to o3-mini
r2 = kit.chat(gpt.ChatConfig(
user_message=(
"Explain the mathematical proof of Gödel's incompleteness theorem "
"and its implications for formal systems and computability theory."
)
))
print(r2.raw.model) # o3-mini
Combining All Middleware
from ractogateway import openai_developer_kit as gpt
from ractogateway.cache import ExactMatchCache, SemanticCache
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway.truncation import TokenTruncator, TruncationConfig
kit = gpt.OpenAIDeveloperKit(
model="auto",
router=CostAwareRouter([
RoutingTier(model="gpt-4o-mini", max_score=30),
RoutingTier(model="gpt-4o", max_score=100),
]),
exact_cache=ExactMatchCache(max_size=2048, ttl_seconds=7200),
semantic_cache=SemanticCache(embedder=embed, similarity_threshold=0.90),
truncator=TokenTruncator(TruncationConfig(keep_last_n=10, safety_margin=1024)),
)
# Each request flows: exact cache → semantic cache → route → truncate → API call
16. All Five Developer Kits
All five kits share identical method signatures:
chat(), achat(), stream(), astream(), embed(), aembed().
Swap the import alias and kit name — everything else stays the same.
Kit |
Alias |
Env var |
Offline? |
|---|---|---|---|
|
|
|
No |
|
|
|
No |
|
|
|
No |
|
|
— |
Yes |
|
|
|
Optional |
16.1 OpenAIDeveloperKit (GPT)
The primary examples throughout this guide use OpenAIDeveloperKit.
A quick recap:
from ractogateway import openai_developer_kit as gpt, RactoPrompt
prompt = RactoPrompt(
role="You are a helpful assistant.",
aim="Answer the user clearly.",
constraints=["Be concise."],
tone="Friendly",
output_format="text",
)
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
response = kit.chat(gpt.ChatConfig(user_message="What is 2 + 2?"))
print(response.content) # "4"
Install: pip install ractogateway[openai] · Key env var: OPENAI_API_KEY
16.2 GoogleDeveloperKit (Gemini)
from ractogateway import google_developer_kit as gemini
from ractogateway.prompts.engine import RactoPrompt
kit = gemini.GoogleDeveloperKit(
model="gemini-2.0-flash", # or "gemini-2.0-pro"
api_key="AIza...", # or set GOOGLE_API_KEY env var
)
prompt = RactoPrompt(
role="You are a creative writing assistant.",
aim="Write a haiku about the given subject.",
constraints=["Must follow 5-7-5 syllable structure."],
tone="Poetic and thoughtful.",
output_format="text",
)
response = kit.chat(gemini.ChatConfig(
user_message="Write a haiku about rain.",
prompt=prompt,
))
print(response.content)
# Silver drops descend —
# Earth drinks its ancient thirst deep.
# Mud sings after rain.
16.3 AnthropicDeveloperKit (Claude)
from ractogateway import anthropic_developer_kit as claude
from ractogateway.prompts.engine import RactoPrompt
kit = claude.AnthropicDeveloperKit(
model="claude-sonnet-4-6",
# or "claude-opus-4-6", "claude-haiku-4-5-20251001"
api_key="sk-ant-...", # or set ANTHROPIC_API_KEY env var
)
prompt = RactoPrompt(
role="You are an expert code reviewer.",
aim="Review the code snippet and identify any bugs or improvements.",
constraints=[
"Be specific — cite line numbers.",
"Prioritise correctness over style.",
],
tone="Technical and direct.",
output_format="markdown",
)
response = kit.chat(claude.ChatConfig(
user_message="def divide(a, b): return a / b",
prompt=prompt,
))
print(response.content)
Install: pip install ractogateway[anthropic] · Key env var: ANTHROPIC_API_KEY
Note: Anthropic does not provide a native embeddings API. Call
embed()/aembed()viaOpenAIDeveloperKitorGoogleDeveloperKitinstead when you need vectors alongside Claude chat.
16.4 OllamaDeveloperKit (Local / Offline)
Run any open-source model on your own hardware — no API key, no data leaving your machine.
Prerequisites:
# 1. Install Ollama → https://ollama.com/download
# 2. Pull a model
ollama pull llama3.2 # 2 GB general-purpose
ollama pull nomic-embed-text # 274 MB embeddings model
# 3. Install the Python extra
pip install ractogateway[ollama]
from ractogateway import ollama_developer_kit as local, RactoPrompt
prompt = RactoPrompt(
role="You are a helpful assistant.",
aim="Answer questions concisely.",
constraints=["Do not hallucinate."],
tone="Friendly",
output_format="text",
)
# Ollama listens at http://localhost:11434 by default — no key needed
kit = local.Chat(model="llama3.2", default_prompt=prompt)
response = kit.chat(local.ChatConfig(user_message="What is a neural network?"))
print(response.content)
Streaming:
for chunk in kit.stream(local.ChatConfig(user_message="Tell me a joke.")):
print(chunk.delta.text, end="", flush=True)
Embeddings (requires a dedicated embedding model):
resp = kit.embed(local.EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])
Embedded server management — start Ollama programmatically:
with local.OllamaServerManager(port=11500) as srv:
kit = local.Chat(model="llama3.2", base_url=srv.base_url)
print(kit.chat(local.ChatConfig(user_message="Hello!")).content)
# server stops automatically
See the full guide: Ollama — Local Model Inference
16.5 HuggingFaceDeveloperKit (HF Inference API / TGI / vLLM)
Three deployment modes through one interface:
Mode |
When to use |
|---|---|
HF Inference API (cloud) |
Quick prototyping; set |
Local TGI |
Self-hosted Text Generation Inference |
Local vLLM / Llama.cpp |
Any OpenAI-compatible HTTP server |
pip install ractogateway[huggingface]
export HF_TOKEN="hf_..." # obtain at https://huggingface.co/settings/tokens
Cloud inference:
from ractogateway import huggingface_developer_kit as hf, RactoPrompt
prompt = RactoPrompt(
role="You are a helpful assistant.",
aim="Answer the user clearly.",
constraints=["Stay on topic."],
tone="Friendly",
output_format="text",
)
kit = hf.Chat(
model="meta-llama/Llama-3.2-3B-Instruct",
default_prompt=prompt,
)
response = kit.chat(hf.ChatConfig(user_message="Explain transformers briefly."))
print(response.content)
Local TGI server (no API key):
kit = hf.Chat(
model="tgi",
base_url="http://localhost:8080",
default_prompt=prompt,
)
Embeddings:
resp = kit.embed(
hf.EmbeddingConfig(texts=["hello world", "goodbye world"])
)
print(f"dim={len(resp.vectors[0].embedding)}")
See the full guide: HuggingFace — Cloud and Local Inference
17. RAG — Retrieval-Augmented Generation
Plain English: RAG lets the AI answer questions about your own documents. You feed it your files, it converts them into searchable number vectors, and when someone asks a question, it finds the relevant parts and feeds them to the AI.
Technical: Full pipeline: FileReaderRegistry → chunker → ProcessingPipeline → embedder → vector store → similarity search → RactoPrompt context injection.
Complete RAG Pipeline Example
from ractogateway.rag import RactoRAG
from ractogateway.rag.embedders import OpenAIEmbedder
from ractogateway.rag.stores import InMemoryVectorStore
from ractogateway.rag.chunkers import RecursiveChunker
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt
# 1. Build the RAG pipeline
rag = RactoRAG(
embedder=OpenAIEmbedder(api_key="sk-..."),
store=InMemoryVectorStore(), # swap for ChromaStore, FAISSStore, etc. in production
chunker=RecursiveChunker(chunk_size=512, overlap=64),
)
# 2. Ingest your documents
rag.add_documents([
"/path/to/product_manual.pdf",
"/path/to/faq.docx",
"/path/to/release_notes.txt",
])
# 3. At query time, retrieve relevant chunks
results = rag.retrieve("How do I reset my password?", top_k=3)
# 4. Inject retrieved context into a RactoPrompt
context = "\n\n".join(r.chunk.text for r in results)
prompt = RactoPrompt(
role="You are a product support assistant.",
aim="Answer the user's question based strictly on the provided documentation.",
constraints=["Only use information from the CONTEXT section.", "Quote the source if possible."],
tone="Helpful and precise.",
output_format="text",
context=context, # <-- the retrieved chunks go here
)
# 5. Ask the AI
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=prompt)
response = kit.chat(gpt.ChatConfig(user_message="How do I reset my password?"))
print(response.content)
Chunkers Explained
Chunker |
Plain English |
Best For |
|---|---|---|
|
Split every N characters, no mercy |
Quick prototyping, structured data |
|
Split at sentence/paragraph boundaries, then fall back to characters |
General documents (best default) |
|
Always split at sentence boundaries |
Articles, legal text, Q&A content |
|
Group sentences that are about the same topic |
Complex documents with topic shifts |
Vector Stores Explained
Store |
Plain English |
When to Use |
|---|---|---|
|
Fast in-RAM store; lost on restart |
Development, prototyping, tests |
|
Local persistent store |
Single-server apps, local dev |
|
Facebook’s ultra-fast similarity search |
Millions of vectors, CPU-only |
|
Fully managed cloud vector DB |
Production, no infra to manage |
|
Open-source, filterable, scalable |
Production with metadata filtering |
|
Open-source with built-in ML |
Multi-modal + graph features |
|
Distributed vector DB |
Billions of vectors at scale |
|
PostgreSQL extension |
Already using Postgres |
18. Redis — Production Infrastructure
Redis tools make your app production-ready: distributed cache, per-user rate limiting, and persistent chat memory that survives deployments.
pip install "ractogateway[redis]"
18.1 Distributed Exact Cache
Drop-in replacement for ExactMatchCache that works across multiple server replicas.
from ractogateway.redis import RedisExactCache
from ractogateway import openai_developer_kit as gpt
cache = RedisExactCache(
url="redis://localhost:6379/0",
# Plain: "Where is your Redis server?"
# Technical: Redis connection URL. Alternatively pass client= with a pre-built
# redis.Redis instance.
ttl_seconds=3600,
# Plain: "Forget cached answers after 1 hour"
# Technical: TTL applied via Redis EXPIRE on each key write.
)
kit = gpt.OpenAIDeveloperKit(model="gpt-4o", exact_cache=cache)
# Now all your servers share the same cache!
18.2 Rate Limiter
Prevent users from making too many expensive requests.
from ractogateway.redis import RedisRateLimiter, RateLimitConfig
limiter = RedisRateLimiter(
url="redis://localhost:6379/0",
config=RateLimitConfig(
max_tokens_per_minute=5_000,
# Plain: "Each user can use at most 5,000 tokens per minute"
# Technical: Sliding 1-minute window. Counter stored as Redis sorted set per user_id.
key_prefix="rl:",
# Plain: "A label to group all rate limit keys in Redis"
# Technical: String prefix for Redis keys: "{key_prefix}{user_id}"
),
)
# In your request handler:
user_id = "user-42"
estimated_tokens = 200
if not limiter.check_and_consume(user_id, tokens=estimated_tokens):
raise RuntimeError("Rate limit exceeded — please try again in a minute.")
remaining = limiter.get_remaining(user_id)
print(f"Tokens remaining this minute: {remaining}")
# Tokens remaining this minute: 4800
18.3 Chat Memory
Store conversation history in Redis so it survives server restarts and scales across replicas.
from ractogateway.redis import RedisChatMemory, ChatMemoryConfig
from ractogateway._models.chat import Message, MessageRole
memory = RedisChatMemory(
url="redis://localhost:6379/0",
config=ChatMemoryConfig(
max_turns=20,
# Plain: "Remember the last 20 messages per conversation"
# Technical: Redis List capped to 2*max_turns entries (each turn = 2 messages).
# Older messages are popped from the front automatically.
ttl_seconds=1800,
# Plain: "Forget the conversation after 30 minutes of inactivity"
# Technical: TTL reset on every append() call.
key_prefix="chat:",
# Plain: "Label all conversation keys in Redis"
# Technical: Redis keys = "{key_prefix}{conv_id}"
),
)
# When a user sends a message:
conv_id = "session-abc123"
memory.append(conv_id, "user", "What's the best way to learn Python?")
# After getting the AI response:
memory.append(conv_id, "assistant", "Start with the official tutorial, then build projects!")
# Reconstruct history for the next request:
history_dicts = memory.get_history(conv_id)
# [{"role": "user", "content": "What's the best way..."}, {"role": "assistant", "content": "..."}]
history = [Message(role=m["role"], content=m["content"]) for m in history_dicts]
# Pass to ChatConfig:
response = kit.chat(gpt.ChatConfig(
user_message="What resources do you recommend?",
history=history,
))
# Wipe the conversation when the session ends:
memory.clear(conv_id)
print(memory.count(conv_id)) # 0
19. Common Mistakes & How to Fix Them
Mistake 1: Using output instead of output_format in RactoPrompt
# WRONG — this will raise a Pydantic ValidationError
prompt = RactoPrompt(
role="...", aim="...", constraints=["..."], tone="...",
output="text", # ❌ field is called output_format, not output!
)
# CORRECT
prompt = RactoPrompt(
role="...", aim="...", constraints=["..."], tone="...",
output_format="text", # ✅
)
Mistake 2: Forgetting at least one constraint
# WRONG — constraints cannot be an empty list
prompt = RactoPrompt(
role="...", aim="...", constraints=[], # ❌ ValidationError: min_length=1
tone="...", output_format="text",
)
# CORRECT
prompt = RactoPrompt(
role="...", aim="...",
constraints=["Be helpful."], # ✅ at least one constraint required
tone="...", output_format="text",
)
Mistake 3: Using model="auto" without a router
# WRONG — raises ValueError immediately
kit = gpt.OpenAIDeveloperKit(model="auto") # ❌
# CORRECT
kit = gpt.OpenAIDeveloperKit(
model="auto",
router=CostAwareRouter([...]), # ✅
)
Mistake 4: Neither ChatConfig.prompt nor kit.default_prompt is set
# WRONG — raises ValueError when chat() is called
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini") # no default_prompt
response = kit.chat(gpt.ChatConfig(user_message="Hello")) # ❌
# FIX OPTION 1: Set default_prompt on the kit
kit = gpt.OpenAIDeveloperKit(model="gpt-4o-mini", default_prompt=my_prompt)
# FIX OPTION 2: Pass prompt in ChatConfig
response = kit.chat(gpt.ChatConfig(user_message="Hello", prompt=my_prompt))
Mistake 5: Expecting typed validation but not setting it explicitly
# BEST PRACTICE — set response_model explicitly
prompt = RactoPrompt(..., output_format=WeatherReport)
config = gpt.ChatConfig(
user_message="...",
response_model=WeatherReport, # ✅ explicit validation contract
)
# ALSO SUPPORTED — inferred automatically from output_format model
prompt = RactoPrompt(..., output_format=WeatherReport)
config = gpt.ChatConfig(user_message="...") # ✅ inferred from prompt.output_format
Mistake 6: Missing await on async methods
# WRONG — this returns a coroutine object, not a response
response = kit.achat(config) # ❌
# CORRECT
response = await kit.achat(config) # ✅ (inside an async function)
Mistake 7: Not installing the provider extra
# WRONG — if you only ran pip install ractogateway
from ractogateway import openai_developer_kit as gpt
kit = gpt.OpenAIDeveloperKit(model="gpt-4o")
kit.chat(...) # ❌ ImportError: The 'openai' package is required
# FIX
# pip install "ractogateway[openai]"
Mistake 8: Not handling ResponseModelValidationError
When response_model is set, validation failures now raise
ResponseModelValidationError after all retries are exhausted — they no
longer silently append a warning string to response.content.
# WRONG — this will now raise, not return a response with garbled content
response = kit.chat(config) # ❌ unhandled ResponseModelValidationError
# CORRECT — wrap in try/except to handle gracefully
from ractogateway.exceptions import ResponseModelValidationError
try:
response = kit.chat(config)
report = MyModel(**response.parsed)
except ResponseModelValidationError as e:
# Inspect what happened and decide how to recover
print(f"Validation failed after {e.attempts} attempt(s): {e.last_error}")
# e.raw_response holds the last raw JSON string from the LLM
Tip: The default
max_validation_retries=2means the kit will automatically retry twice before raising — most transient issues resolve in the first retry. Setmax_validation_retries=0to disable retries and fail fast.
19. Telemetry & Observability
RactoGateway ships production-grade observability with zero changes to existing call sites.
Attach a RactoTracer and/or GatewayMetricsMiddleware to any kit and every LLM call is
automatically instrumented.
Installation
pip install "ractogateway[observability]" # OTEL tracing + Prometheus metrics
pip install "ractogateway[telemetry]" # OTEL tracing only
pip install "ractogateway[prometheus]" # Prometheus metrics only
Quick start
from ractogateway import openai_developer_kit as opd
from ractogateway.telemetry import RactoTracer, GatewayMetricsMiddleware, PrometheusExporter
tracer = RactoTracer(otlp_endpoint="http://localhost:4317", console=True)
metrics = GatewayMetricsMiddleware()
PrometheusExporter(port=8000).start() # scrape http://localhost:8000/metrics
kit = opd.OpenAIDeveloperKit(
model="gpt-4o",
default_prompt=prompt,
tracer=tracer,
metrics=metrics,
)
response = kit.chat(opd.ChatConfig(user_message="Hello!"))
# One OTEL span emitted, one Prometheus data-point recorded.
The same tracer= / metrics= parameters work on GoogleDeveloperKit and
AnthropicDeveloperKit.
What is recorded automatically
Event |
Tracer span |
Prometheus metrics |
|---|---|---|
Successful chat/stream |
|
|
Cache hit (exact/semantic) |
|
|
Cache miss |
— |
|
Tool call |
|
|
Error |
|
|
Embedding |
|
|
OTEL export backends
# Jaeger / Grafana Tempo (gRPC)
RactoTracer(otlp_endpoint="http://jaeger:4317")
# Zipkin / Tempo (HTTP)
RactoTracer(otlp_http_endpoint="http://tempo:4318")
# In-memory capture for unit tests — no external backend needed
tracer = RactoTracer(in_memory=True)
kit.chat(...)
assert tracer.spans[0].provider == "openai"
tracer.clear_spans()
Custom pricing
from ractogateway.telemetry import ModelPricing, RactoTracer
custom = {"my-ft-gpt4": ModelPricing(input_per_million=5.0, output_per_million=15.0)}
tracer = RactoTracer(otlp_endpoint="...", price_table=custom)
Grafana dashboard
Import dashboards/grafana_dashboard.json into Grafana to get 20+ pre-built panels covering
latency percentiles (p50/p95/p99), token rate, cost rate, cache hit/miss ratio, error rate,
tool call distribution, and a per-model summary table.
Full reference: Telemetry guide | API reference
20. Prebuilt Pipelines — Production Workflows
RactoGateway includes prebuilt pipelines for common end-to-end tasks where a
single chat() call is not enough.
Available pipelines
Pipeline |
Classes |
Use case |
|---|---|---|
SQL Analyst |
|
Natural language analytics over SQL databases |
List Classifier |
|
Map user text to one or more options from a list |
Video Processor |
|
Extract frames, transcribe audio, analyse with vision LLM, summarise |
Agent |
|
Autonomous ReAct agent — reason + call tools + observe → answer |
Install extras
# SQL Analyst
pip install ractogateway[pipelines-sql] # core (no charts)
pip install ractogateway[pipelines-sql-viz] # + Plotly charts
# Video Processor
pip install ractogateway[pipelines-video] # OpenCV + ffmpeg + pHash
pip install ractogateway[pipelines-video-whisper] # + faster-whisper (local ASR)
pip install ractogateway[pipelines-video-yt] # + yt-dlp (YouTube download)
# Agent
pip install ractogateway[pipelines-agent] # core (no extra deps)
pip install ractogateway[pipelines-agent-http] # + httpx (http_get tool)
SQL Analyst — quick example
from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import SQLAnalystPipeline
sql_pipeline = SQLAnalystPipeline(kit=gpt.Chat(model="gpt-4o"))
result = sql_pipeline.run(
user_query="Top 5 products by revenue",
connection_string="postgresql://user:pass@localhost:5432/shop",
)
print(result.answer)
List Classifier — quick example
from ractogateway.pipelines import ListClassifierPipeline
classifier = ListClassifierPipeline(
kit=gpt.Chat(model="gpt-4o-mini"),
options=["Billing", "Technical Support", "Sales"],
include_confidence=True,
include_reasoning=True,
)
result = classifier.run("I cannot update my payment method")
print(result.first) # "Billing"
print(result.top_confidence) # e.g. 0.96
Video Processor — quick example
Process a lecture or tutorial video end-to-end — extract key frames, transcribe speech, use a vision LLM to read whiteboards/screens, and produce a structured Markdown report.
from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import VideoProcessorPipeline, TranscriberBackend, DeduplicationMethod
pipeline = VideoProcessorPipeline(
kit=gpt.Chat(model="gpt-4o"), # vision LLM + summary
fps=1.0, # sample one frame per second
similarity_threshold=85.0, # drop frames that are ≥85% similar to the previous
dedup_method=DeduplicationMethod.PHASH,
transcriber=TranscriberBackend.FASTER_WHISPER,
transcriber_model="base",
analyze_frames=True,
generate_summary=True,
safe_mode=True,
)
# Accepts: local path, HTTP URL, YouTube URL, raw bytes, or pre-extracted frame list
result = pipeline.run("lecture.mp4")
print(f"Frames kept : {result.usage.frames_kept}/{result.usage.frames_extracted}")
print(f"Tokens used : {result.usage.total_tokens}")
print(result.summary) # structured Markdown summary
result.to_markdown("report.md") # save full report
What it produces (VideoProcessorResult):
Field |
Type |
Description |
|---|---|---|
|
|
Every extracted frame with its LLM analysis |
|
|
Timed speech-to-text segments |
|
|
Time windows merging visual + audio content |
|
|
7-section Markdown summary |
|
|
Token counts + frame statistics |
Supported transcription backends (TranscriberBackend):
Backend |
Value |
Requires |
|---|---|---|
Faster Whisper (default) |
|
|
OpenAI Whisper (local) |
|
|
OpenAI API |
|
OpenAI API key |
Groq API (ultra-fast) |
|
|
Deepgram |
|
|
Google Cloud STT |
|
|
HuggingFace local |
|
|
HuggingFace API |
|
|
Ollama |
|
Running Ollama server |
Agent — quick example
An autonomous ReAct (Reason + Act) agent that loops: think → call tool → observe → repeat until it calls the built-in finish() tool.
from ractogateway import openai_developer_kit as gpt
from ractogateway.pipelines import AgentPipeline
def get_weather(city: str) -> str:
"""Return current weather for a city."""
return f"Sunny, 22 °C in {city}"
def unit_convert(value: float, from_unit: str, to_unit: str) -> str:
"""Convert a value between units."""
# ... your logic here ...
return f"{value} {from_unit} = ... {to_unit}"
agent = AgentPipeline(
kit=gpt.Chat(model="gpt-4o"),
tools=[get_weather, unit_convert],
max_steps=8,
safe_mode=True,
)
result = agent.run("What is the weather in Paris, and convert 22°C to Fahrenheit?")
print(result.final_answer)
print(result.to_markdown()) # step-by-step trace
Agent result fields (AgentResult):
Field |
Type |
Description |
|---|---|---|
|
|
The agent’s concluded answer |
|
|
Every thought / tool call / observation |
|
|
|
|
|
Cumulative token counts across all steps |
Built-in tool factories:
from ractogateway.pipelines import (
make_rag_tool, # rag_search(query) → relevant chunks from RactoRAG
make_sql_tool, # sql_query(question) → answer from SQLAnalystPipeline
make_http_tool, # http_get(url) → page text (requires httpx)
make_memory_tools, # memory_read(key) + memory_write(key, value)
)
agent = AgentPipeline(
kit=gpt.Chat(model="gpt-4o"),
tools=[get_weather], # your custom tools
rag_pipeline=my_rag, # auto-registers rag_search
sql_pipeline=my_sql, # auto-registers sql_query
agent_memory={}, # dict → auto-registers memory_read/write
extra_tools=[make_http_tool()], # opt-in http_get
)
Full guides
21. Chain of Thought Reasoning
Chain of Thought (CoT) prompts the model to reason step-by-step before giving its
final answer. RactoGateway exposes this as a single ChatConfig flag — no prompt
engineering required.
How to enable
from ractogateway import openai_developer_kit as gpt
kit = gpt.Chat(model="gpt-4o")
response = kit.chat(
gpt.ChatConfig(
user_message="If a train travels 300 km in 2.5 hours, what is its average speed?",
chain_of_thought=True, # ← flip this flag
)
)
print(response.content)
# The model will reason through the problem before stating "120 km/h"
What it does internally
Setting chain_of_thought=True appends a step-by-step reasoning constraint to the
RactoPrompt before the request is sent. The constraint instructs the model to:
Break the problem into numbered reasoning steps.
Show its working at each step.
State the final answer clearly at the end.
This is applied per request — it does not modify the kit’s default prompt permanently.
When to use CoT
Scenario |
Benefit |
|---|---|
Math / logic problems |
Forces explicit calculation steps → fewer errors |
Multi-step planning |
Surfaces assumptions and intermediate decisions |
Debugging assistance |
Produces a traceable reasoning chain |
Exam / quiz apps |
Provides explanation alongside the answer |
Combining with structured output
from pydantic import BaseModel
class ReasonedAnswer(BaseModel):
steps: list[str]
final_answer: str
response = kit.chat(
gpt.ChatConfig(
user_message="How many seconds are in a leap year?",
chain_of_thought=True,
response_model=ReasonedAnswer, # parse result into Pydantic model
)
)
print(response.parsed.steps)
print(response.parsed.final_answer)
22. Native Thinking / Extended Reasoning
Native Thinking exposes the model’s internal chain-of-thought reasoning tokens — the model genuinely thinks before answering rather than being instructed to write steps. Supported by Anthropic Claude (extended thinking) and Google Gemini (thinking mode). OpenAI o-series models expose reasoning token counts but not the text.
Enable native thinking
from ractogateway import anthropic_developer_kit as claude
kit = claude.Chat(model="claude-opus-4-6")
response = kit.chat(
claude.ChatConfig(
user_message="Prove that √2 is irrational.",
native_thinking=True,
thinking_budget=8000, # max thinking tokens (Anthropic/Google)
)
)
print(response.thinking) # raw model reasoning (may be hundreds of tokens)
print(response.content) # final polished answer
Streaming with native thinking
accumulated_thinking = ""
for chunk in kit.stream(
claude.ChatConfig(
user_message="Design a cache-invalidation strategy for a distributed system.",
native_thinking=True,
thinking_budget=10000,
)
):
if chunk.is_thinking:
print(chunk.delta.thinking, end="", flush=True)
else:
print(chunk.delta.text, end="", flush=True)
Provider behaviour summary
Provider |
Thinking text visible |
Thinking budget param |
Notes |
|---|---|---|---|
Anthropic Claude |
✅ |
|
Forces |
Google Gemini |
✅ |
|
|
OpenAI (o-series) |
❌ not exposed |
N/A |
|
LLMResponse fields added by native thinking
Field |
Type |
Description |
|---|---|---|
|
|
Raw model reasoning text |
|
|
Incremental thinking token (streaming) |
|
|
Full thinking so far (streaming) |
|
|
|
When to use native thinking
Use native_thinking=True when accuracy matters more than latency:
Complex proofs, theorem verification
Code architecture reviews
Medical / legal / scientific reasoning
Any task where you want to inspect the model’s reasoning, not just the answer
Cost note: thinking tokens count toward your bill but are not included in
response.content. Setthinking_budgetconservatively; 4000–8000 is usually enough for most tasks.
23. PageIndexRAG — Vectorless RAG
PageIndexRAG is a lightweight RAG pipeline that requires no embeddings and no vector database. It uses a two-stage keyword index + BM25 scoring to retrieve relevant pages from documents. Perfect for CPU-only environments, offline use, or when you want instant setup without configuring a vector store.
How it works
Document → page split → DecisionIndex (inverted keyword index)
→ BM25 scorer (Okapi BM25) → top-k pages → LLM
Page split — PDFs are split page-by-page; all other documents use fixed character windows (
page_size=1000,page_overlap=100).DecisionIndex — builds an inverted keyword index over all pages for fast candidate retrieval (no embeddings needed).
BM25 scoring — ranks candidates with Okapi BM25, the same algorithm used by Elasticsearch and Solr.
LLM answer — top-k pages are passed to the LLM as context.
Quick example
from ractogateway import openai_developer_kit as gpt
from ractogateway.rag.page_index import PageIndexRAG
kit = gpt.Chat(model="gpt-4o-mini")
# Build the index
rag = PageIndexRAG(kit=kit)
rag.add_document("docs/handbook.pdf") # PDF — split page-by-page
rag.add_document("docs/faq.txt") # Plain text — split by char window
rag.add_texts(["RactoGateway supports 5 developer kits.", "..."])
# Query
result = rag.search("What developer kits are supported?")
print(result.answer) # LLM answer grounded in the retrieved pages
print(result.pages[0].text) # raw page text that was used as context
No extra install
PageIndexRAG ships in the core package — no vector store or embedding model required:
pip install ractogateway # PageIndexRAG included by default
pip install ractogateway[rag] # if you also want readers (PDF, Word, Excel…)
Comparison: PageIndexRAG vs. RactoRAG
Feature |
|
|
|---|---|---|
Embeddings needed |
❌ No |
✅ Yes |
Vector store needed |
❌ No |
✅ Yes (Chroma, FAISS, Pinecone…) |
Retrieval algorithm |
BM25 (keyword) |
Cosine similarity (semantic) |
Best for |
Quick setup, keyword-rich docs |
Deep semantic search |
GPU/CPU |
Pure CPU |
CPU or GPU (embedding model) |
Offline use |
✅ Fully offline |
⚠️ Depends on embedder |
When to use PageIndexRAG
Prototyping a Q&A feature without setting up a vector DB
Compliance / legal documents where exact keyword match matters
Offline / air-gapped environments
Structured documents (manuals, handbooks) where pages map naturally to topics
Advanced: async + per-call top-k
import asyncio
async def main():
rag = PageIndexRAG(kit=kit, top_k=5, page_size=800, page_overlap=80)
rag.add_document("research_paper.pdf")
result = await rag.asearch("What methodology did the authors use?")
print(result.answer)
asyncio.run(main())
Full reference: PageIndexRAG API
Quick Reference Card
# ── Imports ──────────────────────────────────────────────────────────
from ractogateway import openai_developer_kit as gpt
from ractogateway.prompts.engine import RactoPrompt, RactoFile
from ractogateway.tools.registry import tool, ToolRegistry
from ractogateway.cache import ExactMatchCache, SemanticCache
from ractogateway.routing import CostAwareRouter, RoutingTier
from ractogateway.truncation import TokenTruncator, TruncationConfig
# ── Build a prompt ───────────────────────────────────────────────────
prompt = RactoPrompt(
role="...", aim="...", constraints=["..."], tone="...",
output_format="text", # or "json", "markdown", or a Pydantic class
context="...", # optional background knowledge
examples=[{"input": "...", "output": "..."}], # optional few-shot
)
# ── Create the kit ───────────────────────────────────────────────────
kit = gpt.OpenAIDeveloperKit(
model="gpt-4o-mini",
default_prompt=prompt,
exact_cache=ExactMatchCache(max_size=512),
)
# ── Sync chat ────────────────────────────────────────────────────────
response = kit.chat(gpt.ChatConfig(user_message="Hello!"))
print(response.content)
# ── Async chat ───────────────────────────────────────────────────────
response = await kit.achat(gpt.ChatConfig(user_message="Hello!"))
# ── Streaming ────────────────────────────────────────────────────────
for chunk in kit.stream(gpt.ChatConfig(user_message="Tell me a story.")):
print(chunk.delta.text, end="", flush=True)
# ── Embeddings ───────────────────────────────────────────────────────
from ractogateway._models.embedding import EmbeddingConfig
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
vec = resp.vectors[0].embedding # list[float]
# ── Tool calling ─────────────────────────────────────────────────────
@tool
def get_price(product: str) -> float:
"""Get the price of a product."""
return 9.99
registry = ToolRegistry()
registry.register(get_price)
response = kit.chat(gpt.ChatConfig(
user_message="How much is a widget?",
tools=registry,
))
# ── Chain of Thought ─────────────────────────────────────────────────
response = kit.chat(gpt.ChatConfig(
user_message="Explain why √2 is irrational.",
chain_of_thought=True, # step-by-step reasoning in the answer
))
# ── Native Thinking (Anthropic / Gemini) ─────────────────────────────
from ractogateway import anthropic_developer_kit as claude
claude_kit = claude.Chat(model="claude-opus-4-6")
response = claude_kit.chat(claude.ChatConfig(
user_message="Design a cache-invalidation strategy.",
native_thinking=True,
thinking_budget=8000, # max internal reasoning tokens
))
print(response.thinking) # raw reasoning
print(response.content) # polished answer
# ── PageIndexRAG (no embeddings) ─────────────────────────────────────
from ractogateway.rag.page_index import PageIndexRAG
rag = PageIndexRAG(kit=kit)
rag.add_document("handbook.pdf")
result = rag.search("What developer kits are supported?")
print(result.answer)
# ── Pipelines ────────────────────────────────────────────────────────
from ractogateway.pipelines import (
SQLAnalystPipeline,
ListClassifierPipeline,
VideoProcessorPipeline,
AgentPipeline,
TranscriberBackend,
)
# SQL
sql = SQLAnalystPipeline(kit=kit)
sql_result = sql.run("Top 5 products", connection_string="postgresql://...")
print(sql_result.answer)
# Classifier
clf = ListClassifierPipeline(kit=kit, options=["Billing", "Tech Support"])
print(clf.run("I can't log in").first)
# Video
vp = VideoProcessorPipeline(
kit=kit,
transcriber=TranscriberBackend.FASTER_WHISPER,
generate_summary=True,
)
vp_result = vp.run("lecture.mp4")
print(vp_result.summary)
# Agent
def search_web(query: str) -> str:
"""Search the web for information."""
return f"Results for: {query}"
agent = AgentPipeline(kit=kit, tools=[search_web], max_steps=6)
print(agent.run("What is the capital of France?").final_answer)