Cache

Models

Shared data models for caching subsystem.

class ractogateway.cache._models.CacheConfig(**data)[source]

Bases: BaseModel

Configuration for cache instances.

Parameters:

max_size (int) – Maximum number of entries to hold. When full, the least-recently-used entry is evicted (LRU policy). 0 means unlimited.
ttl_seconds (float | None) – Time-to-live in seconds. Entries older than this are treated as misses and evicted lazily. None disables TTL.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

max_size: int

ttl_seconds: float | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.cache._models.CacheEntry(**data)[source]

Bases: BaseModel

A single cached LLM response.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

response: LLMResponse

created_at: float

hit_count: int

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.cache._models.CacheStats(**data)[source]

Bases: BaseModel

Snapshot of cache performance counters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

hits: int

misses: int

size: int

property total: int: Total requests seen by the cache.

property hit_rate: float: Fraction of requests that were cache hits (0.0-1.0).

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.cache._models.SemanticCacheConfig(**data)[source]

Bases: BaseModel

Configuration for the semantic similarity cache.

Parameters:

threshold (float) – Minimum cosine similarity (0.0-1.0) required to declare a cache hit. Defaults to 0.95 (very strict — avoids false positives).
max_size (int) – Maximum entries before LRU eviction. 0 means unlimited.
ttl_seconds (float | None) – Optional TTL; None disables expiry.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

threshold: float

max_size: int

ttl_seconds: float | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.cache._models.SemanticCacheEntry(**data)[source]

Bases: BaseModel

One entry in the semantic cache, pairing an embedding with a response.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

vector: list[float]

response: LLMResponse

created_at: float

hit_count: int

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_SemanticCacheEntry__context)[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Return type:: None

Exact Match Cache

Exact-match key-value cache with LRU eviction and optional TTL.

Uses collections.OrderedDict for O(1) get / put / evict — a standard least-recently-used (LRU) cache pattern. No external dependencies.

Thread-safety is provided by a threading.Lock so the cache is safe to share across threads without any external synchronisation.

class ractogateway.cache.exact_cache.ExactMatchCache(max_size=1024, ttl_seconds=None)[source]

Bases: object

Ultra-low-latency key-value cache for identical LLM requests.

Parameters:

max_size (int) – LRU capacity. 0 = unlimited (no eviction).
ttl_seconds (float | None) – Entries older than ttl_seconds are treated as misses and transparently evicted. None disables expiry.
Example:: –
from ractogateway.cache import ExactMatchCache

cache = ExactMatchCache(max_size=512, ttl_seconds=3600)

# Wire into a kit: kit = OpenAIDeveloperKit(model=”gpt-4o”, exact_cache=cache)

get(user_message, system_prompt, model, temperature, max_tokens)[source]

Return a cached response or None on a miss.

O(1) — dictionary lookup + optional move-to-end.

Return type:: LLMResponse | None

put(user_message, system_prompt, model, temperature, max_tokens, response)[source]

Store a response. Evicts LRU entry when at capacity.

O(1) amortised — dictionary insert + optional popitem(last=False).

Return type:: None

invalidate(user_message, system_prompt, model, temperature, max_tokens)[source]

Remove a specific entry. Returns True if it was present.

Return type:: bool

clear()[source]

Evict all cached entries and reset counters.

Return type:: None

property stats: CacheStats: Return a snapshot of hit/miss/size counters.

Semantic Cache

Semantic similarity cache backed by any embedding function.

Caches LLM responses by semantic meaning rather than exact string match. When a new query arrives, it is embedded and compared (cosine similarity) against all stored query embeddings. If the best match exceeds the configured threshold, the cached response is returned without making a new API call — saving cost and latency.

Embedding protocol: The cache accepts any callable (text: str) -> list[float]. Wire in the kit’s own embed method, a RAG embedder, or any other embedding service:

from ractogateway.cache import SemanticCache

def my_embedder(text: str) -> list[float]:
    # call your embedding API here
    return [0.1, 0.2, ...]

cache = SemanticCache(embed_fn=my_embedder, similarity_threshold=0.95)

Complexity: O(n) per lookup where n = number of stored entries. For large caches (> 10 k entries) consider using a proper ANN index (e.g. FAISS) as the embedder backend and reducing max_size.

class ractogateway.cache.semantic_cache.SemanticCache(embed_fn, similarity_threshold=0.95, max_size=512, ttl_seconds=None)[source]

Bases: object

Vector-similarity cache — returns cached answers for semantically similar queries, costing $0 in API calls.

Parameters:

embed_fn (Callable[[str], list[float]]) – Any callable (text: str) -> list[float]. Called once per new query (cache miss) and once at put() time.
similarity_threshold (float) – Minimum cosine similarity to declare a hit. Default 0.95 is intentionally strict to avoid incorrect responses.
max_size (int) – Maximum number of entries (LRU eviction). 0 = unlimited.
ttl_seconds (float | None) – Optional per-entry TTL. None disables expiry.

Examples

import ractogateway.openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(model="gpt-4o")

def embed(text: str) -> list[float]:
    import openai
    r = openai.OpenAI().embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return r.data[0].embedding

cache = SemanticCache(embed_fn=embed, similarity_threshold=0.95)

get(query)[source]

Embed query and return a cached response if cosine-sim ≥ threshold.

Returns None on a cache miss (caller should make the real API call and then invoke put()).

Complexity: O(n·d) where n = number of entries, d = embedding dim.

Return type:: LLMResponse | None

put(query, response)[source]

Embed query and store response for future similar queries.

Evicts LRU entry when at capacity.

Return type:: None

clear()[source]

Remove all entries and reset counters.

Return type:: None

property stats: CacheStats: Return a snapshot of hit/miss/size counters.