ractogateway.adapters.huggingface_kit

HuggingFace Inference adapter — cloud and local TGI/vLLM servers.

Requires the huggingface_hub package:

pip install ractogateway[huggingface]

Set HF_TOKEN (or HUGGINGFACE_TOKEN) in the environment for the HuggingFace Inference API. For local servers (TGI / vLLM) pass base_url and omit the token.

class ractogateway.adapters.huggingface_kit.HuggingFaceLLMKit(model='meta-llama/Llama-3.2-3B-Instruct', *, api_key=None, base_url=None, **kwargs)[source]

Bases: BaseLLMAdapter

Low-level adapter for HuggingFace Inference API and local TGI/vLLM servers.

Uses InferenceClient.chat_completion() (OpenAI-compatible endpoint) so it works with any chat-capable model hosted on HF or self-hosted via TGI / vLLM / Llama.cpp.

Parameters:

model (str) – HuggingFace model repo ID (e.g. "meta-llama/Llama-3.2-3B-Instruct"). For local servers set base_url and this can be any identifier the server understands (often "tgi" for TGI’s default endpoint).
api_key (str | None) – HuggingFace token. Falls back to HF_TOKEN then HUGGINGFACE_TOKEN environment variables.
base_url (str | None) – Custom endpoint URL. When supplied, requests are sent there instead of the HF Inference API (useful for local TGI / vLLM deployments).

provider: str = 'huggingface'

translate_tools(registry)[source]

Convert registry schemas to HuggingFace/OpenAI function-calling format.

Return type:: list[dict[str, Any]]

run(prompt, user_message, *, history=None, tools=None, temperature=0.0, max_tokens=4096, **kwargs)[source]

Execute a chat request synchronously.

Return type:: LLMResponse

async arun(prompt, user_message, *, history=None, tools=None, temperature=0.0, max_tokens=4096, **kwargs)[source]

Execute a chat request asynchronously.

Return type:: LLMResponse