ractogateway.huggingface_developer_kit
HuggingFace Developer Kit — from ractogateway import huggingface_developer_kit as hf.
Short usage:
from ractogateway import huggingface_developer_kit as hf
# Cloud inference (set HF_TOKEN env var)
kit = hf.Chat(model="meta-llama/Llama-3.2-3B-Instruct")
# Local TGI / vLLM server (no token needed)
kit = hf.Chat(model="tgi", base_url="http://localhost:8080")
# Full class name (identical)
kit = hf.HuggingFaceDeveloperKit(model="meta-llama/Llama-3.2-3B-Instruct")
- ractogateway.huggingface_developer_kit.Chat
Short alias —
hf.Chat(model="...")is identical tohf.HuggingFaceDeveloperKit(...).
- class ractogateway.huggingface_developer_kit.ChatConfig(**data)[source]
Bases:
BaseModelValidated input for every
chat/achat/stream/astreamcall.Pass a single
ChatConfigto any developer-kit method. Every field has a safe default so you only need to supply what you actually need.Minimal example:
config = ChatConfig(user_message="Explain Python generators.") response = kit.chat(config)
Vision / multimodal example:
from ractogateway.prompts.engine import RactoFile config = ChatConfig( user_message="Describe this chart.", attachments=[RactoFile.from_path("sales_q4.png")], )
Structured JSON output example:
class Sentiment(BaseModel): label: str score: float config = ChatConfig( user_message="I love this library!", response_model=Sentiment, )
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- user_message: str
- prompt: RactoPrompt | None
- temperature: float
- max_tokens: int
- tools: ToolRegistry | None
- auto_execute_tools: bool
- max_tool_turns: int
- max_validation_retries: int
- history: list[Message]
- chain_of_thought: bool
- native_thinking: bool
- thinking_budget: int
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.CostAwareRouter(tiers)[source]
Bases:
objectRoutes LLM requests to the appropriate model tier based on message complexity — without making any extra API calls.
- Parameters:
tiers (
list[RoutingTier]) – Ordered list ofRoutingTierobjects, sorted ascending bymax_score(cheapest first). The last tier’smax_scoreshould be100to act as fallback.- Raises:
ValueError – If
tiersis empty or not sorted ascending bymax_score.Example — 3-tier OpenAI ladder:: – from ractogateway.routing import CostAwareRouter, RoutingTier router = CostAwareRouter([ RoutingTier(model=”gpt-4o-mini”, max_score=30), RoutingTier(model=”gpt-4o”, max_score=70), RoutingTier(model=”o3-mini”, max_score=100), ]) model = router.route(“What is 2+2?”) # → “gpt-4o-mini” model = router.route(“Analyze the trade-offs between Redis Cluster and ” “Cassandra for a write-heavy time-series workload …”) # → “o3-mini”
Example — binary routing (2 tiers):: – router = CostAwareRouter([ RoutingTier(model=”claude-haiku-4-5-20251001”, max_score=40), RoutingTier(model=”claude-opus-4-6”, max_score=100), ])
- score(text)[source]
Compute a complexity score in [0, 100] for text.
A higher score means a more complex task.
- Return type:
Algorithm
token_pts = min(len(text)//4, SAT) * (MAX_TP / SAT) kw_pts = min(matches * PPK, MAX_KP) score = clamp(token_pts + kw_pts, 0, 100)
- route(text)[source]
Return the model identifier for text.
Walks tiers (cheapest first) and returns the first model whose
max_score ≥ complexity_score. Always returns a model because the last tier hasmax_score == 100(validated at construction).Complexity: O(k) where k = number of tiers.
- Return type:
- property tiers: tuple[RoutingTier, ...]
Immutable view of the configured tiers.
- class ractogateway.huggingface_developer_kit.EmbeddingConfig(**data)[source]
Bases:
BaseModelValidated input for
embed/aembedcalls.Example:
config = EmbeddingConfig(texts=["Hello world", "Goodbye world"]) response = kit.embed(config)
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.EmbeddingResponse(**data)[source]
Bases:
BaseModelUnified response from an embedding call.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- vectors: list[EmbeddingVector]
- model: str
- raw: Any
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.EmbeddingVector(**data)[source]
Bases:
BaseModelA single embedding result.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- index: int
- text: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.ExactMatchCache(max_size=1024, ttl_seconds=None)[source]
Bases:
objectUltra-low-latency key-value cache for identical LLM requests.
- Parameters:
max_size (
int) – LRU capacity.0= unlimited (no eviction).ttl_seconds (
float|None) – Entries older than ttl_seconds are treated as misses and transparently evicted.Nonedisables expiry.Example:: –
from ractogateway.cache import ExactMatchCache
cache = ExactMatchCache(max_size=512, ttl_seconds=3600)
# Wire into a kit: kit = OpenAIDeveloperKit(model=”gpt-4o”, exact_cache=cache)
- get(user_message, system_prompt, model, temperature, max_tokens)[source]
Return a cached response or
Noneon a miss.O(1) — dictionary lookup + optional move-to-end.
- Return type:
- put(user_message, system_prompt, model, temperature, max_tokens, response)[source]
Store a response. Evicts LRU entry when at capacity.
O(1) amortised — dictionary insert + optional popitem(last=False).
- Return type:
- invalidate(user_message, system_prompt, model, temperature, max_tokens)[source]
Remove a specific entry. Returns
Trueif it was present.- Return type:
- property stats: CacheStats
Return a snapshot of hit/miss/size counters.
- class ractogateway.huggingface_developer_kit.FinishReason(*values)[source]
-
Why the model stopped generating.
- STOP = 'stop'
- TOOL_CALL = 'tool_call'
- LENGTH = 'length'
- CONTENT_FILTER = 'content_filter'
- ERROR = 'error'
- class ractogateway.huggingface_developer_kit.HuggingFaceDeveloperKit(model='meta-llama/Llama-3.2-3B-Instruct', *, api_key=None, base_url=None, embedding_model='sentence-transformers/all-MiniLM-L6-v2', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]
Bases:
objectComplete HuggingFace developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.
Works with both the HuggingFace Inference API (cloud) and local deployments (TGI / vLLM / Llama.cpp) via
base_url.- Parameters:
model (
str) – HuggingFace model repo ID (e.g."meta-llama/Llama-3.2-3B-Instruct"). For local servers use any identifier the server expects (e.g."tgi"). Use"auto"when aCostAwareRouteris provided — the router will select the model per-request.api_key (
str|None) – HuggingFace token. Falls back toHF_TOKENthenHUGGINGFACE_TOKENenvironment variables.base_url (
str|None) – Custom endpoint URL. When set, requests go to the local/private server instead of the HuggingFace Inference API.embedding_model (
str) – Default model for embedding calls. Defaults to"sentence-transformers/all-MiniLM-L6-v2".default_prompt (
RactoPrompt|None) – RACTO prompt used whenChatConfig.promptisNone.exact_cache (
ExactMatchCache|None) – OptionalExactMatchCache.semantic_cache (
SemanticCache|None) – OptionalSemanticCache.router (
CostAwareRouter|None) – OptionalCostAwareRouter. Required whenmodel="auto".truncator (
TokenTruncator|None) – OptionalTokenTruncator.tracer (
RactoTracer|None) – OptionalRactoTracer.metrics (
GatewayMetricsMiddleware|None) – OptionalGatewayMetricsMiddleware.
- provider: str = 'huggingface'
- chat(config)[source]
Synchronous chat completion with optional middleware pipeline.
Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.
- Return type:
- async achat(config)[source]
Async chat completion with optional middleware pipeline.
- Return type:
- stream(config)[source]
Synchronous streaming — yields
StreamChunkobjects.Example:
for chunk in kit.stream(config): print(chunk.delta.text, end="", flush=True) if chunk.is_final: print(f"\nTokens: {chunk.usage}")
- Return type:
Iterator[StreamChunk]
- async astream(config)[source]
Async streaming — yields
StreamChunkobjects.- Return type:
AsyncIterator[StreamChunk]
- embed(config)[source]
Synchronous embedding via HuggingFace
feature_extraction.Example:
resp = kit.embed(EmbeddingConfig(texts=["hello", "world"])) print(resp.vectors[0].embedding[:5])
- Return type:
EmbeddingResponse
- async aembed(config)[source]
Async embedding via HuggingFace
feature_extraction.- Return type:
EmbeddingResponse
- class ractogateway.huggingface_developer_kit.LLMResponse(**data)[source]
Bases:
BaseModelUnified, provider-agnostic response envelope.
Every adapter’s
run()method returns one of these, regardless of whether the underlying provider is OpenAI, Gemini, or Anthropic.Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- tool_calls: list[ToolCallResult]
- finish_reason: FinishReason
- raw: Any
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.Message(**data)[source]
Bases:
BaseModelA single conversation turn.
Used inside
ChatConfig.historyto provide prior conversation context to the model for multi-turn conversations.Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- role: MessageRole
- content: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.MessageRole(*values)[source]
-
Role of a single message in a conversation.
- SYSTEM = 'system'
- USER = 'user'
- ASSISTANT = 'assistant'
- class ractogateway.huggingface_developer_kit.RoutingTier(**data)[source]
Bases:
BaseModelOne tier in the cost-aware routing ladder.
The router evaluates a complexity score (0-100) for each incoming message and selects the first tier whose
max_scoreis >= that score. The last tier in the list always acts as the catch-all fallback.- Parameters:
model (str) – The LLM model identifier to use for requests that fall in this tier (e.g.
"gpt-4o-mini","gemini-2.0-flash","claude-haiku-4-5-20251001").max_score (float) – Inclusive upper bound on the complexity score that routes to this model. Range: 0-100. Set to
100for the last (most powerful) tier so it catches everything.
Examples
tiers = [ RoutingTier(model="gpt-4o-mini", max_score=30), RoutingTier(model="gpt-4o", max_score=70), RoutingTier(model="o3-mini", max_score=100), ]
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model: str
- max_score: float
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.SemanticCache(embed_fn, similarity_threshold=0.95, max_size=512, ttl_seconds=None)[source]
Bases:
objectVector-similarity cache — returns cached answers for semantically similar queries, costing $0 in API calls.
- Parameters:
embed_fn (
Callable[[str],list[float]]) – Any callable(text: str) -> list[float]. Called once per new query (cache miss) and once atput()time.similarity_threshold (
float) – Minimum cosine similarity to declare a hit. Default0.95is intentionally strict to avoid incorrect responses.max_size (
int) – Maximum number of entries (LRU eviction).0= unlimited.ttl_seconds (
float|None) – Optional per-entry TTL.Nonedisables expiry.
Examples
import ractogateway.openai_developer_kit as gpt kit = gpt.OpenAIDeveloperKit(model="gpt-4o") def embed(text: str) -> list[float]: import openai r = openai.OpenAI().embeddings.create( model="text-embedding-3-small", input=text ) return r.data[0].embedding cache = SemanticCache(embed_fn=embed, similarity_threshold=0.95)
- get(query)[source]
Embed query and return a cached response if cosine-sim ≥ threshold.
Returns
Noneon a cache miss (caller should make the real API call and then invokeput()).Complexity: O(n·d) where n = number of entries, d = embedding dim.
- Return type:
- put(query, response)[source]
Embed query and store response for future similar queries.
Evicts LRU entry when at capacity.
- Return type:
- property stats: CacheStats
Return a snapshot of hit/miss/size counters.
- class ractogateway.huggingface_developer_kit.StreamChunk(**data)[source]
Bases:
BaseModelA single piece of a streaming response.
Consumers iterate over
StreamChunkobjects — they never touch raw provider events directly.- delta
The incremental content for this chunk.
- accumulated_text
Running concatenation of all
delta.textvalues so far.
- finish_reason
Nonefor intermediate chunks; set on the final chunk.
- tool_calls
Empty until the final chunk (
is_final=True).
- usage
Token counts — populated on the final chunk only.
- is_final
Trueonly for the very last chunk in the stream.
- raw
The underlying provider event (escape-hatch for advanced users).
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- delta: StreamDelta
- accumulated_text: str
- accumulated_thinking: str
- is_thinking: bool
- finish_reason: FinishReason | None
- tool_calls: list[ToolCallResult]
- is_final: bool
- raw: Any
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.StreamDelta(**data)[source]
Bases:
BaseModelIncremental content produced by a single streaming event.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- text: str
- thinking: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.TokenTruncator(config=None)[source]
Bases:
objectSmart conversation-history trimmer.
- Parameters:
config (
TruncationConfig|None) –TruncationConfiginstance. If omitted a default config is used (approximate counter, 8 k limit).
Examples
from ractogateway.truncation import TokenTruncator, TruncationConfig import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") truncator = TokenTruncator( TruncationConfig( token_counter=lambda t: len(enc.encode(t)), keep_first_n=2, keep_last_n=8, ) ) kit = OpenAIDeveloperKit(model="gpt-4o", truncator=truncator)
- truncate(chat_config, model)[source]
Return a copy of chat_config with trimmed history if necessary.
If the total estimated token count (system prompt + history + user_message) fits within the model’s context limit, the original
ChatConfigis returned unchanged.- Parameters:
chat_config (
ChatConfig) – The chat configuration to potentially truncate.model (
str) – The resolved model name used to look up the context-window limit.
- Return type:
ChatConfig- Returns:
ChatConfig – A new
ChatConfiginstance with (possibly shorter) history. Theuser_messageand all other fields are preserved verbatim.
- class ractogateway.huggingface_developer_kit.ToolCallResult(**data)[source]
Bases:
BaseModelA single tool/function call returned by the model.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- id: str
- name: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.huggingface_developer_kit.TruncationConfig(**data)[source]
Bases:
BaseModelConfiguration for
TokenTruncator.- Parameters:
max_context_tokens (int | None) – Hard cap on total prompt tokens before calling the API. When
None, the truncator looks up the model inMODEL_CONTEXT_LIMITS(falling back to8 192).keep_first_n (int) – Number of history messages to always preserve from the start of the conversation (anchors context). Defaults to
2.keep_last_n (int) – Number of history messages to always preserve from the most recent end of the conversation. Defaults to
6.token_counter (Callable[[str], int]) –
Callable
(text: str) -> int. Defaults to the built-in approximate counter (len // 4). Swap fortiktokenfor exact OpenAI token counts:import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") config = TruncationConfig(token_counter=lambda t: len(enc.encode(t)))
safety_margin (int) – Extra token budget reserved beyond the system prompt and user message. Defaults to
512.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- keep_first_n: int
- keep_last_n: int
- safety_margin: int
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- resolve_limit(model)[source]
Return the effective token limit for model.
Priority:
max_context_tokens→MODEL_CONTEXT_LIMITSlookup →_DEFAULT_CONTEXT.- Return type: