ractogateway.adapters.huggingface_kit
HuggingFace Inference adapter — cloud and local TGI/vLLM servers.
Requires the huggingface_hub package:
pip install ractogateway[huggingface]
Set HF_TOKEN (or HUGGINGFACE_TOKEN) in the environment for the
HuggingFace Inference API. For local servers (TGI / vLLM) pass
base_url and omit the token.
- class ractogateway.adapters.huggingface_kit.HuggingFaceLLMKit(model='meta-llama/Llama-3.2-3B-Instruct', *, api_key=None, base_url=None, **kwargs)[source]
Bases:
BaseLLMAdapterLow-level adapter for HuggingFace Inference API and local TGI/vLLM servers.
Uses
InferenceClient.chat_completion()(OpenAI-compatible endpoint) so it works with any chat-capable model hosted on HF or self-hosted via TGI / vLLM / Llama.cpp.- Parameters:
model (
str) – HuggingFace model repo ID (e.g."meta-llama/Llama-3.2-3B-Instruct"). For local servers setbase_urland this can be any identifier the server understands (often"tgi"for TGI’s default endpoint).api_key (
str|None) – HuggingFace token. Falls back toHF_TOKENthenHUGGINGFACE_TOKENenvironment variables.base_url (
str|None) – Custom endpoint URL. When supplied, requests are sent there instead of the HF Inference API (useful for local TGI / vLLM deployments).
- provider: str = 'huggingface'
- translate_tools(registry)[source]
Convert registry schemas to HuggingFace/OpenAI function-calling format.
- run(prompt, user_message, *, history=None, tools=None, temperature=0.0, max_tokens=4096, **kwargs)[source]
Execute a chat request synchronously.
- Return type:
- async arun(prompt, user_message, *, history=None, tools=None, temperature=0.0, max_tokens=4096, **kwargs)[source]
Execute a chat request asynchronously.
- Return type: