ractogateway.huggingface_developer_kit.kit

HuggingFace Developer Kit — production-grade HuggingFace interface.

Usage:

from ractogateway import huggingface_developer_kit as hf

kit = hf.HuggingFaceDeveloperKit(
    model="meta-llama/Llama-3.2-3B-Instruct",
    default_prompt=my_prompt,
)
response = kit.chat(hf.ChatConfig(user_message="Hello"))

for chunk in kit.stream(hf.ChatConfig(user_message="Hello")):
    print(chunk.delta.text, end="", flush=True)

Set HF_TOKEN in the environment for cloud inference, or pass base_url to point at a self-hosted TGI / vLLM / Llama.cpp server.

Local TGI example:

# docker run --rm -p 8080:80 ghcr.io/huggingface/text-generation-inference \
#     --model-id meta-llama/Llama-3.2-3B-Instruct
kit = hf.HuggingFaceDeveloperKit(
    model="tgi",
    base_url="http://localhost:8080",
)

class ractogateway.huggingface_developer_kit.kit.HuggingFaceDeveloperKit(model='meta-llama/Llama-3.2-3B-Instruct', *, api_key=None, base_url=None, embedding_model='sentence-transformers/all-MiniLM-L6-v2', default_prompt=None, exact_cache=None, semantic_cache=None, router=None, truncator=None, tracer=None, metrics=None)[source]

Bases: object

Complete HuggingFace developer kit — chat, stream, embeddings, and optional performance/cost optimisation middleware.

Works with both the HuggingFace Inference API (cloud) and local deployments (TGI / vLLM / Llama.cpp) via base_url.

Parameters:

model (str) – HuggingFace model repo ID (e.g. "meta-llama/Llama-3.2-3B-Instruct"). For local servers use any identifier the server expects (e.g. "tgi"). Use "auto" when a CostAwareRouter is provided — the router will select the model per-request.
api_key (str | None) – HuggingFace token. Falls back to HF_TOKEN then HUGGINGFACE_TOKEN environment variables.
base_url (str | None) – Custom endpoint URL. When set, requests go to the local/private server instead of the HuggingFace Inference API.
embedding_model (str) – Default model for embedding calls. Defaults to "sentence-transformers/all-MiniLM-L6-v2".
default_prompt (RactoPrompt | None) – RACTO prompt used when ChatConfig.prompt is None.
exact_cache (ExactMatchCache | None) – Optional ExactMatchCache.
semantic_cache (SemanticCache | None) – Optional SemanticCache.
router (CostAwareRouter | None) – Optional CostAwareRouter. Required when model="auto".
truncator (TokenTruncator | None) – Optional TokenTruncator.
tracer (RactoTracer | None) – Optional RactoTracer.
metrics (GatewayMetricsMiddleware | None) – Optional GatewayMetricsMiddleware.

provider: str = 'huggingface'

chat(config)[source]

Synchronous chat completion with optional middleware pipeline.

Middleware order: truncate → exact cache → semantic cache → route model → API call → write caches → record telemetry.

Return type:: LLMResponse

async achat(config)[source]

Async chat completion with optional middleware pipeline.

Return type:: LLMResponse

stream(config)[source]

Synchronous streaming — yields StreamChunk objects.

Example:

for chunk in kit.stream(config):
    print(chunk.delta.text, end="", flush=True)
    if chunk.is_final:
        print(f"\nTokens: {chunk.usage}")

Return type:: Iterator[StreamChunk]

async astream(config)[source]

Async streaming — yields StreamChunk objects.

Return type:: AsyncIterator[StreamChunk]

embed(config)[source]

Synchronous embedding via HuggingFace feature_extraction.

Example:

resp = kit.embed(EmbeddingConfig(texts=["hello", "world"]))
print(resp.vectors[0].embedding[:5])

Return type:: EmbeddingResponse

async aembed(config)[source]

Async embedding via HuggingFace feature_extraction.

Return type:: EmbeddingResponse