ractogateway.cache.semantic_cache

Semantic similarity cache backed by any embedding function.

Caches LLM responses by semantic meaning rather than exact string match. When a new query arrives, it is embedded and compared (cosine similarity) against all stored query embeddings. If the best match exceeds the configured threshold, the cached response is returned without making a new API call — saving cost and latency.

Embedding protocol: The cache accepts any callable (text: str) -> list[float]. Wire in the kit’s own embed method, a RAG embedder, or any other embedding service:

from ractogateway.cache import SemanticCache

def my_embedder(text: str) -> list[float]:
    # call your embedding API here
    return [0.1, 0.2, ...]

cache = SemanticCache(embed_fn=my_embedder, similarity_threshold=0.95)

Complexity: O(n) per lookup where n = number of stored entries. For large caches (> 10 k entries) consider using a proper ANN index (e.g. FAISS) as the embedder backend and reducing max_size.

class ractogateway.cache.semantic_cache.SemanticCache(embed_fn, similarity_threshold=0.95, max_size=512, ttl_seconds=None)[source]

Bases: object

Vector-similarity cache — returns cached answers for semantically similar queries, costing $0 in API calls.

Parameters:
  • embed_fn (Callable[[str], list[float]]) – Any callable (text: str) -> list[float]. Called once per new query (cache miss) and once at put() time.

  • similarity_threshold (float) – Minimum cosine similarity to declare a hit. Default 0.95 is intentionally strict to avoid incorrect responses.

  • max_size (int) – Maximum number of entries (LRU eviction). 0 = unlimited.

  • ttl_seconds (float | None) – Optional per-entry TTL. None disables expiry.

Examples

import ractogateway.openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(model="gpt-4o")

def embed(text: str) -> list[float]:
    import openai
    r = openai.OpenAI().embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return r.data[0].embedding

cache = SemanticCache(embed_fn=embed, similarity_threshold=0.95)
get(query)[source]

Embed query and return a cached response if cosine-sim ≥ threshold.

Returns None on a cache miss (caller should make the real API call and then invoke put()).

Complexity: O(n·d) where n = number of entries, d = embedding dim.

Return type:

LLMResponse | None

put(query, response)[source]

Embed query and store response for future similar queries.

Evicts LRU entry when at capacity.

Return type:

None

clear()[source]

Remove all entries and reset counters.

Return type:

None

property stats: CacheStats

Return a snapshot of hit/miss/size counters.