HuggingFace — Cloud and Local Inference
HuggingFaceDeveloperKit supports three deployment modes through a single,
consistent interface:
Mode |
When to use |
|---|---|
HuggingFace Inference API |
Quick prototyping; no server required; set |
Local TGI |
Self-hosted Text Generation Inference; no API key needed |
Local vLLM / Llama.cpp |
Any OpenAI-compatible HTTP server; pass |
Installation
pip install ractogateway[huggingface]
For cloud inference, obtain a token at https://huggingface.co/settings/tokens and set:
export HF_TOKEN="hf_..."
Quick Start — Cloud Inference
from ractogateway import huggingface_developer_kit as hf, RactoPrompt
prompt = RactoPrompt(
role="You are a helpful assistant.",
aim="Answer the user clearly and concisely.",
constraints=["Stay on topic.", "Do not hallucinate."],
tone="Friendly",
output_format="text",
)
# Token read from HF_TOKEN env var automatically
kit = hf.Chat(
model="meta-llama/Llama-3.2-3B-Instruct",
default_prompt=prompt,
)
response = kit.chat(hf.ChatConfig(user_message="What is attention in transformers?"))
print(response.content)
Quick Start — Local TGI Server
# Pull and launch TGI (requires Docker + enough VRAM/RAM)
docker run --rm -p 8080:80 \
ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Llama-3.2-3B-Instruct
# No API key; point base_url at the running container
kit = hf.Chat(
model="tgi",
base_url="http://localhost:8080",
default_prompt=prompt,
)
response = kit.chat(hf.ChatConfig(user_message="Explain attention in one paragraph."))
print(response.content)
Quick Start — Local vLLM Server
# Launch vLLM (OpenAI-compatible endpoint)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--port 8000
kit = hf.Chat(
model="meta-llama/Llama-3.2-3B-Instruct",
base_url="http://localhost:8000/v1",
default_prompt=prompt,
)
Constructor Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
HF repo ID or server label |
|
|
|
Falls back to |
|
|
|
Local server URL (TGI, vLLM, Llama.cpp) |
|
|
|
Default embedding model |
|
|
|
Kit-level default prompt |
|
|
|
In-process exact-match cache |
|
|
|
Cosine-similarity cache |
|
|
|
Required when |
|
|
|
Auto-trim long histories |
|
|
|
OpenTelemetry spans |
|
|
|
Prometheus metrics |
Streaming
for chunk in kit.stream(hf.ChatConfig(user_message="Tell me a short story.")):
print(chunk.delta.text, end="", flush=True)
if chunk.is_final:
print(f"\nTokens: {chunk.usage}")
Async
import asyncio
async def main() -> None:
response = await kit.achat(hf.ChatConfig(user_message="Explain async/await."))
print(response.content)
async for chunk in kit.astream(hf.ChatConfig(user_message="Count to five.")):
print(chunk.delta.text, end="", flush=True)
asyncio.run(main())
Embeddings
HuggingFaceDeveloperKit uses InferenceClient.feature_extraction() for
embeddings. The model must support sentence / feature extraction (e.g.
sentence-transformers/all-MiniLM-L6-v2).
embed_kit = hf.Chat(
model="meta-llama/Llama-3.2-3B-Instruct",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
default_prompt=prompt,
)
resp = embed_kit.embed(hf.EmbeddingConfig(texts=["hello world", "goodbye world"]))
for vec in resp.vectors:
print(f"[{vec.index}] '{vec.text}' — dim={len(vec.embedding)}")
Vision Models (Image Input)
HuggingFace models that support the OpenAI-compatible image_url content
block format (e.g. llava-hf/llava-1.5-7b-hf,
Qwen/Qwen2-VL-7B-Instruct) accept image attachments via
ChatConfig.attachments:
from ractogateway.prompts.engine import RactoFile
img = RactoFile.from_path("/tmp/chart.png")
kit = hf.Chat(
model="llava-hf/llava-1.5-7b-hf",
default_prompt=prompt,
)
response = kit.chat(
hf.ChatConfig(
user_message="What trend do you see in this chart?",
attachments=[img],
)
)
print(response.content)
For local TGI deployments, enable multimodal support when launching the container:
docker run --rm -p 8080:80 \
ghcr.io/huggingface/text-generation-inference \
--model-id llava-hf/llava-1.5-7b-hf
Tool Calling
Tool calling works on any model that supports function calling through the HuggingFace chat completions API.
from ractogateway import ToolRegistry, tool
@tool
def get_weather(city: str) -> str:
"""Return the current weather for a city."""
return f"Sunny, 22 °C in {city}"
registry = ToolRegistry([get_weather])
response = kit.chat(
hf.ChatConfig(
user_message="What is the weather in London?",
tools=registry,
auto_execute_tools=True,
)
)
print(response.content)
Validated JSON Output
from pydantic import BaseModel
class Sentiment(BaseModel):
label: str # "positive" | "negative" | "neutral"
score: float
typed_prompt = RactoPrompt(
role="You are a sentiment analyser.",
aim="Classify the sentiment of the text.",
constraints=["Return only the JSON object."],
tone="Neutral",
output_format=Sentiment,
)
kit = hf.Chat(
model="meta-llama/Llama-3.2-3B-Instruct",
default_prompt=typed_prompt,
)
response = kit.chat(
hf.ChatConfig(
user_message="I love this library!",
response_model=Sentiment,
)
)
print(response.parsed) # validated Sentiment instance dict
Environment Variables
Variable |
Used for |
|---|---|
|
HuggingFace Inference API authentication (preferred) |
|
Alternative token env var name |
Both variables are checked in order. If neither is set and no api_key is
passed, the client will attempt unauthenticated access (works for some public
models but may be rate-limited).
Recommended Models
Use case |
Model |
|---|---|
General chat |
|
Code generation |
|
Small + fast |
|
Multilingual |
|
Embeddings (cloud) |
|
Embeddings (local TGI) |
Any |
Browse all models at https://huggingface.co/models.