ractogateway.huggingface_developer_kit

HuggingFace Developer Kit — from ractogateway import huggingface_developer_kit as hf.

Short usage:

from ractogateway import huggingface_developer_kit as hf

# Cloud inference (set HF_TOKEN env var)
kit = hf.Chat(model="meta-llama/Llama-3.2-3B-Instruct")

# Local TGI / vLLM server (no token needed)
kit = hf.Chat(model="tgi", base_url="http://localhost:8080")

# Full class name (identical)
kit = hf.HuggingFaceDeveloperKit(model="meta-llama/Llama-3.2-3B-Instruct")

ractogateway.huggingface_developer_kit.Chat: Short alias — hf.Chat(model="...") is identical to hf.HuggingFaceDeveloperKit(...).

class ractogateway.huggingface_developer_kit.ChatConfig(**data)[source]

Bases: BaseModel

Validated input for every chat / achat / stream / astream call.

Pass a single ChatConfig to any developer-kit method. Every field has a safe default so you only need to supply what you actually need.

Minimal example:

config = ChatConfig(user_message="Explain Python generators.")
response = kit.chat(config)

Vision / multimodal example:

from ractogateway.prompts.engine import RactoFile

config = ChatConfig(
    user_message="Describe this chart.",
    attachments=[RactoFile.from_path("sales_q4.png")],
)

Structured JSON output example:

class Sentiment(BaseModel):
    label: str
    score: float

config = ChatConfig(
    user_message="I love this library!",
    response_model=Sentiment,
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

user_message: str

prompt: RactoPrompt | None

temperature: float

max_tokens: int

tools: ToolRegistry | None

auto_execute_tools: bool

max_tool_turns: int

response_model: type[BaseModel] | None

max_validation_retries: int

history: list[Message]

attachments: list[RactoFile] | None

chain_of_thought: bool

native_thinking: bool

thinking_budget: int

extra: dict[str, Any]

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.CostAwareRouter(tiers)[source]

Bases: object

Routes LLM requests to the appropriate model tier based on message complexity — without making any extra API calls.

Parameters:

tiers (list[RoutingTier]) – Ordered list of RoutingTier objects, sorted ascending by max_score (cheapest first). The last tier’s max_score should be 100 to act as fallback.

Raises:

ValueError – If tiers is empty or not sorted ascending by max_score.
Example — 3-tier OpenAI ladder:: – from ractogateway.routing import CostAwareRouter, RoutingTier router = CostAwareRouter([ RoutingTier(model=”gpt-4o-mini”, max_score=30), RoutingTier(model=”gpt-4o”, max_score=70), RoutingTier(model=”o3-mini”, max_score=100), ]) model = router.route(“What is 2+2?”) # → “gpt-4o-mini” model = router.route(“Analyze the trade-offs between Redis Cluster and ” “Cassandra for a write-heavy time-series workload …”) # → “o3-mini”
Example — binary routing (2 tiers):: – router = CostAwareRouter([ RoutingTier(model=”claude-haiku-4-5-20251001”, max_score=40), RoutingTier(model=”claude-opus-4-6”, max_score=100), ])

score(text)[source]

Compute a complexity score in [0, 100] for text.

A higher score means a more complex task.

Return type:: int

Algorithm

token_pts = min(len(text)//4, SAT) * (MAX_TP / SAT) kw_pts = min(matches * PPK, MAX_KP) score = clamp(token_pts + kw_pts, 0, 100)

route(text)[source]

Return the model identifier for text.

Walks tiers (cheapest first) and returns the first model whose max_score ≥ complexity_score. Always returns a model because the last tier has max_score == 100 (validated at construction).

Complexity: O(k) where k = number of tiers.

Return type:: str

property tiers: tuple[RoutingTier, ...]: Immutable view of the configured tiers.

class ractogateway.huggingface_developer_kit.EmbeddingConfig(**data)[source]

Bases: BaseModel

Validated input for embed / aembed calls.

Example:

config = EmbeddingConfig(texts=["Hello world", "Goodbye world"])
response = kit.embed(config)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

texts: list[str]

model: str | None

dimensions: int | None

extra: dict[str, Any]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.EmbeddingResponse(**data)[source]

Bases: BaseModel

Unified response from an embedding call.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

vectors: list[EmbeddingVector]

model: str

usage: dict[str, int]

raw: Any

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.EmbeddingVector(**data)[source]

Bases: BaseModel

A single embedding result.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

index: int

text: str

embedding: list[float]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.ExactMatchCache(max_size=1024, ttl_seconds=None)[source]

Bases: object

Ultra-low-latency key-value cache for identical LLM requests.

Parameters:

max_size (int) – LRU capacity. 0 = unlimited (no eviction).
ttl_seconds (float | None) – Entries older than ttl_seconds are treated as misses and transparently evicted. None disables expiry.
Example:: –
from ractogateway.cache import ExactMatchCache

cache = ExactMatchCache(max_size=512, ttl_seconds=3600)

# Wire into a kit: kit = OpenAIDeveloperKit(model=”gpt-4o”, exact_cache=cache)

get(user_message, system_prompt, model, temperature, max_tokens)[source]

Return a cached response or None on a miss.

O(1) — dictionary lookup + optional move-to-end.

Return type:: LLMResponse | None

put(user_message, system_prompt, model, temperature, max_tokens, response)[source]

Store a response. Evicts LRU entry when at capacity.

O(1) amortised — dictionary insert + optional popitem(last=False).

Return type:: None

invalidate(user_message, system_prompt, model, temperature, max_tokens)[source]

Remove a specific entry. Returns True if it was present.

Return type:: bool

clear()[source]

Evict all cached entries and reset counters.

Return type:: None

property stats: CacheStats: Return a snapshot of hit/miss/size counters.

class ractogateway.huggingface_developer_kit.FinishReason(*values)[source]

Bases: str, Enum

Why the model stopped generating.

STOP = 'stop'

TOOL_CALL = 'tool_call'

LENGTH = 'length'

CONTENT_FILTER = 'content_filter'

ERROR = 'error'

class ractogateway.huggingface_developer_kit.HuggingFaceDeveloperKit(model='meta-llama/Llama-3.2-3B-Instruct', *, api_key=None, base_url=None, embedding_model='sentence-transformers/all-MiniLM-L6-v2', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]

Bases: object

Complete HuggingFace developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.

Works with both the HuggingFace Inference API (cloud) and local deployments (TGI / vLLM / Llama.cpp) via base_url.

Parameters:

model (str) – HuggingFace model repo ID (e.g. "meta-llama/Llama-3.2-3B-Instruct"). For local servers use any identifier the server expects (e.g. "tgi"). Use "auto" when a CostAwareRouter is provided — the router will select the model per-request.
api_key (str | None) – HuggingFace token. Falls back to HF_TOKEN then HUGGINGFACE_TOKEN environment variables.
base_url (str | None) – Custom endpoint URL. When set, requests go to the local/private server instead of the HuggingFace Inference API.
embedding_model (str) – Default model for embedding calls. Defaults to "sentence-transformers/all-MiniLM-L6-v2".
default_prompt (RactoPrompt | None) – RACTO prompt used when ChatConfig.prompt is None.
exact_cache (ExactMatchCache | None) – Optional ExactMatchCache.
semantic_cache (SemanticCache | None) – Optional SemanticCache.
router (CostAwareRouter | None) – Optional CostAwareRouter. Required when model="auto".
truncator (TokenTruncator | None) – Optional TokenTruncator.
tracer (RactoTracer | None) – Optional RactoTracer.
metrics (GatewayMetricsMiddleware | None) – Optional GatewayMetricsMiddleware.

provider: str = 'huggingface'

chat(config)[source]

Synchronous chat completion with optional middleware pipeline.

Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.

Return type:: LLMResponse

async achat(config)[source]

Async chat completion with optional middleware pipeline.

Return type:: LLMResponse

stream(config)[source]

Synchronous streaming — yields StreamChunk objects.

Example:

for chunk in kit.stream(config):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")

Return type:: Iterator[StreamChunk]

async astream(config)[source]

Async streaming — yields StreamChunk objects.

Return type:: AsyncIterator[StreamChunk]

embed(config)[source]

Synchronous embedding via HuggingFace feature_extraction.

Example:

resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])

Return type:: EmbeddingResponse

async aembed(config)[source]

Async embedding via HuggingFace feature_extraction.

Return type:: EmbeddingResponse

class ractogateway.huggingface_developer_kit.LLMResponse(**data)[source]

Bases: BaseModel

Unified, provider-agnostic response envelope.

Every adapter’s run() method returns one of these, regardless of whether the underlying provider is OpenAI, Gemini, or Anthropic.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

content: str | None

thinking: str | None

parsed: dict[str, Any] | list[Any] | None

tool_calls: list[ToolCallResult]

finish_reason: FinishReason

usage: dict[str, int]

raw: Any

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.Message(**data)[source]

Bases: BaseModel

A single conversation turn.

Used inside ChatConfig.history to provide prior conversation context to the model for multi-turn conversations.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

role: MessageRole

content: str

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.MessageRole(*values)[source]

Bases: str, Enum

Role of a single message in a conversation.

SYSTEM = 'system'

USER = 'user'

ASSISTANT = 'assistant'

class ractogateway.huggingface_developer_kit.RoutingTier(**data)[source]

Bases: BaseModel

One tier in the cost-aware routing ladder.

The router evaluates a complexity score (0-100) for each incoming message and selects the first tier whose max_score is >= that score. The last tier in the list always acts as the catch-all fallback.

Parameters:

model (str) – The LLM model identifier to use for requests that fall in this tier (e.g. "gpt-4o-mini", "gemini-2.0-flash", "claude-haiku-4-5-20251001").
max_score (float) – Inclusive upper bound on the complexity score that routes to this model. Range: 0-100. Set to 100 for the last (most powerful) tier so it catches everything.

Examples

tiers = [
    RoutingTier(model="gpt-4o-mini",  max_score=30),
    RoutingTier(model="gpt-4o",        max_score=70),
    RoutingTier(model="o3-mini",        max_score=100),
]

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model: str

max_score: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.SemanticCache(embed_fn, similarity_threshold=0.95, max_size=512, ttl_seconds=None)[source]

Bases: object

Vector-similarity cache — returns cached answers for semantically similar queries, costing $0 in API calls.

Parameters:

embed_fn (Callable[[str], list[float]]) – Any callable (text: str) -> list[float]. Called once per new query (cache miss) and once at put() time.
similarity_threshold (float) – Minimum cosine similarity to declare a hit. Default 0.95 is intentionally strict to avoid incorrect responses.
max_size (int) – Maximum number of entries (LRU eviction). 0 = unlimited.
ttl_seconds (float | None) – Optional per-entry TTL. None disables expiry.

Examples

import ractogateway.openai_developer_kit as gpt

kit = gpt.OpenAIDeveloperKit(model="gpt-4o")

def embed(text: str) -> list[float]:
    import openai
    r = openai.OpenAI().embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return r.data[0].embedding

cache = SemanticCache(embed_fn=embed, similarity_threshold=0.95)

get(query)[source]

Embed query and return a cached response if cosine-sim ≥ threshold.

Returns None on a cache miss (caller should make the real API call and then invoke put()).

Complexity: O(n·d) where n = number of entries, d = embedding dim.

Return type:: LLMResponse | None

put(query, response)[source]

Embed query and store response for future similar queries.

Evicts LRU entry when at capacity.

Return type:: None

clear()[source]

Remove all entries and reset counters.

Return type:: None

property stats: CacheStats: Return a snapshot of hit/miss/size counters.

class ractogateway.huggingface_developer_kit.StreamChunk(**data)[source]

Bases: BaseModel

A single piece of a streaming response.

Consumers iterate over StreamChunk objects — they never touch raw provider events directly.

delta: The incremental content for this chunk.

accumulated_text: Running concatenation of all delta.text values so far.

finish_reason: None for intermediate chunks; set on the final chunk.

tool_calls: Empty until the final chunk (is_final=True).

usage: Token counts — populated on the final chunk only.

is_final: True only for the very last chunk in the stream.

raw: The underlying provider event (escape-hatch for advanced users).

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

delta: StreamDelta

accumulated_text: str

accumulated_thinking: str

is_thinking: bool

finish_reason: FinishReason | None

tool_calls: list[ToolCallResult]

usage: dict[str, int]

is_final: bool

parsed: dict[str, Any] | list[Any] | None

raw: Any

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.StreamDelta(**data)[source]

Bases: BaseModel

Incremental content produced by a single streaming event.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

text: str

thinking: str

tool_call_id: str | None

tool_call_name: str | None

tool_call_args_fragment: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.TokenTruncator(config=None)[source]

Bases: object

Smart conversation-history trimmer.

Parameters:: config (TruncationConfig | None) – TruncationConfig instance. If omitted a default config is used (approximate counter, 8 k limit).

Examples

from ractogateway.truncation import TokenTruncator, TruncationConfig
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
truncator = TokenTruncator(
    TruncationConfig(
        token_counter=lambda t: len(enc.encode(t)),
        keep_first_n=2,
        keep_last_n=8,
    )
)
kit = OpenAIDeveloperKit(model="gpt-4o", truncator=truncator)

truncate(chat_config, model)[source]

Return a copy of chat_config with trimmed history if necessary.

If the total estimated token count (system prompt + history + user_message) fits within the model’s context limit, the original ChatConfig is returned unchanged.

Parameters:

chat_config (ChatConfig) – The chat configuration to potentially truncate.
model (str) – The resolved model name used to look up the context-window limit.

Return type:

ChatConfig

Returns:

ChatConfig – A new ChatConfig instance with (possibly shorter) history. The user_message and all other fields are preserved verbatim.

estimate_tokens(text)[source]

Convenience wrapper around the configured token counter.

Return type:: int

class ractogateway.huggingface_developer_kit.ToolCallResult(**data)[source]

Bases: BaseModel

A single tool/function call returned by the model.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

id: str

name: str

arguments: dict[str, Any]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.huggingface_developer_kit.TruncationConfig(**data)[source]

Bases: BaseModel

Configuration for TokenTruncator.

Parameters:

max_context_tokens (int | None) – Hard cap on total prompt tokens before calling the API. When None, the truncator looks up the model in MODEL_CONTEXT_LIMITS (falling back to 8 192).
keep_first_n (int) – Number of history messages to always preserve from the start of the conversation (anchors context). Defaults to 2.
keep_last_n (int) – Number of history messages to always preserve from the most recent end of the conversation. Defaults to 6.
token_counter (Callable[[str], int]) –
Callable (text: str) -> int. Defaults to the built-in approximate counter (len // 4). Swap for tiktoken for exact OpenAI token counts:
```
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
config = TruncationConfig(token_counter=lambda t: len(enc.encode(t)))
```
safety_margin (int) – Extra token budget reserved beyond the system prompt and user message. Defaults to 512.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

max_context_tokens: int | None

keep_first_n: int

keep_last_n: int

token_counter: Callable[[str], int]

safety_margin: int

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

resolve_limit(model)[source]

Return the effective token limit for model.

Priority: max_context_tokens → MODEL_CONTEXT_LIMITS lookup → _DEFAULT_CONTEXT.

Return type:: int

model_post_init(_TruncationConfig__context)[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Return type:: None