ractogateway.cache.semantic_cache
Semantic similarity cache backed by any embedding function.
Caches LLM responses by semantic meaning rather than exact string match. When a new query arrives, it is embedded and compared (cosine similarity) against all stored query embeddings. If the best match exceeds the configured threshold, the cached response is returned without making a new API call — saving cost and latency.
Embedding protocol: The cache accepts any callable
(text: str) -> list[float]. Wire in the kit’s own embed method, a RAG
embedder, or any other embedding service:
from ractogateway.cache import SemanticCache
def my_embedder(text: str) -> list[float]:
# call your embedding API here
return [0.1, 0.2, ...]
cache = SemanticCache(embed_fn=my_embedder, similarity_threshold=0.95)
Complexity: O(n) per lookup where n = number of stored entries. For large caches (> 10 k entries) consider using a proper ANN index (e.g. FAISS) as the embedder backend and reducing max_size.
- class ractogateway.cache.semantic_cache.SemanticCache(embed_fn, similarity_threshold=0.95, max_size=512, ttl_seconds=None)[source]
Bases:
objectVector-similarity cache — returns cached answers for semantically similar queries, costing $0 in API calls.
- Parameters:
embed_fn (
Callable[[str],list[float]]) – Any callable(text: str) -> list[float]. Called once per new query (cache miss) and once atput()time.similarity_threshold (
float) – Minimum cosine similarity to declare a hit. Default0.95is intentionally strict to avoid incorrect responses.max_size (
int) – Maximum number of entries (LRU eviction).0= unlimited.ttl_seconds (
float|None) – Optional per-entry TTL.Nonedisables expiry.
Examples
import ractogateway.openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit(model="gpt-4o") def embed(text: str) -> list[float]: import openai r = openai.OpenAI().embeddings.create( model="text-embedding-3-small", input=text ) return r.data[0].embedding cache = SemanticCache(embed_fn=embed, similarity_threshold=0.95)
- get(query)[source]
Embed query and return a cached response if cosine-sim ≥ threshold.
Returns
Noneon a cache miss (caller should make the real API call and then invokeput()).Complexity: O(n·d) where n = number of entries, d = embedding dim.
- Return type:
- put(query, response)[source]
Embed query and store response for future similar queries.
Evicts LRU entry when at capacity.
- Return type:
- property stats: CacheStats
Return a snapshot of hit/miss/size counters.