ractogateway.telemetry

RactoGateway Telemetry — OpenTelemetry tracing + Prometheus metrics.

Provides production-grade observability for every LLM call made through any RactoGateway developer kit.

Quick start:

from ractogateway import openai_developer_kit as opd
from ractogateway.telemetry import RactoTracer, GatewayMetricsMiddleware, PrometheusExporter

# --- OTEL tracing ---
tracer = RactoTracer(otlp_endpoint="http://localhost:4317")

# --- Prometheus metrics ---
metrics = GatewayMetricsMiddleware()
PrometheusExporter(port=8000).start()

kit = opd.OpenAIDeveloperKit(
    model="gpt-4o",
    default_prompt=my_prompt,
    tracer=tracer,
    metrics=metrics,
)
response = kit.chat(opd.ChatConfig(user_message="Hello"))

Install extras:

pip install ractogateway[telemetry]    # OpenTelemetry tracing
pip install ractogateway[prometheus]   # Prometheus metrics
pip install ractogateway[observability]  # both

Public API

RactoTracer: OpenTelemetry tracer. Pass as tracer= to any kit.
GatewayMetricsMiddleware: Prometheus metrics collector. Pass as metrics= to any kit.
PrometheusExporter: HTTP server for Prometheus scraping.
ModelPricing: Per-model input/output pricing (USD per 1M tokens).
SpanRecord: In-memory span record for test assertions.
DEFAULT_COST_TABLE: Built-in pricing table.
compute_cost(): Compute estimated USD cost from token counts.

class ractogateway.telemetry.GatewayMetricsMiddleware(*, price_table=None, registry=None)[source]

Bases: object

Prometheus metrics middleware — pass as metrics= to any developer kit.

A single instance can be shared across multiple kits (different providers) to aggregate metrics in one registry.

Parameters:

price_table (dict[str, ModelPricing] | None) – Override or extend the built-in pricing table used for the ractogateway_cost_usd_total counter.
registry (Any | None) – Custom prometheus_client.CollectorRegistry. Defaults to the global REGISTRY (which also includes default Python metrics). Pass prometheus_client.CollectorRegistry() to get an isolated registry — useful in tests.
Requires (pip install ractogateway[prometheus])

record_request(*, provider, model, operation, status, latency_s, input_tokens=0, output_tokens=0, tool_calls=None)[source]

Record metrics for a completed LLM request.

Parameters:

provider (str) – Provider string ("openai", "google", "anthropic").
model (str) – Model identifier (e.g. "gpt-4o").
operation (str) – "chat", "stream", or "embed".
status (str) – "ok" or "error".
latency_s (float) – Request wall-clock latency in seconds.
input_tokens (int) – Prompt tokens consumed (0 for cache hits or errors).
output_tokens (int) – Completion tokens produced (0 for cache hits or errors).
tool_calls (list[Any] | None) – List of ToolCallResult objects from the response. Used to update ractogateway_tool_calls_total.

Return type:

None

record_cache_hit(cache_type)[source]

Increment the cache-hits counter.

Parameters:: cache_type (str) – "exact" or "semantic".
Return type:: None

record_cache_miss(cache_type)[source]

Increment the cache-misses counter.

Parameters:: cache_type (str) – "exact" or "semantic".
Return type:: None

generate_latest()[source]

Return current metrics in Prometheus text exposition format.

Useful for testing without starting an HTTP server:

text = middleware.generate_latest()
assert "ractogateway_requests_total" in text

Return type:: str
Returns:: str – UTF-8 decoded Prometheus text format string.

class ractogateway.telemetry.ModelPricing(**data)[source]

Bases: BaseModel

USD cost per 1 million tokens for a specific model.

Parameters:

input_per_million (float) – Price in USD for 1 million input (prompt) tokens.
output_per_million (float) – Price in USD for 1 million output (completion) tokens.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

input_per_million: float

output_per_million: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.telemetry.PrometheusExporter(port=8000, *, registry=None)[source]

Bases: object

Start an HTTP server that exposes all registered Prometheus metrics.

The server listens on http://0.0.0.0:<port>/metrics and responds with the standard Prometheus text exposition format. This is designed to be used alongside GatewayMetricsMiddleware but works with any metrics registered in the global registry.

Parameters:

port (int) – HTTP port to listen on. Defaults to 8000.
registry (Any | None) – Custom prometheus_client.CollectorRegistry. When None the global REGISTRY is used.
Example:: –
from ractogateway.telemetry import GatewayMetricsMiddleware, PrometheusExporter

metrics = GatewayMetricsMiddleware() exporter = PrometheusExporter(port=9090) exporter.start() # Prometheus can now scrape http://localhost:9090/metrics exporter.stop()
Requires (pip install ractogateway[prometheus])

start()[source]

Start the metrics HTTP server in a background daemon thread.

Calling start() a second time on an already-running exporter is a no-op.

Return type:: None

stop()[source]

Shut down the metrics HTTP server.

Safe to call even if the server was never started.

Return type:: None

property port: int: The configured HTTP port.

property is_running: bool: True when the HTTP server is active.

generate_latest()[source]

Return current metrics in Prometheus text format.

No HTTP server required — useful for testing or embedding the metrics in a custom endpoint:

text = exporter.generate_latest()
assert "process_resident_memory_bytes" in text

Return type:: str
Returns:: str – UTF-8 decoded Prometheus text exposition string.

class ractogateway.telemetry.RactoTracer(*, service_name='ractogateway', otlp_endpoint=None, otlp_http_endpoint=None, console=False, in_memory=False, custom_exporter=None, price_table=None)[source]

Bases: object

OpenTelemetry tracer — pass as tracer= to any developer kit.

Records one span per LLM call with attributes for latency, token usage, estimated cost, cache-hit type, and tool-call count.

Supports OTLP gRPC (Jaeger / Grafana Tempo), OTLP HTTP, console stdout, in-memory capture (for tests), and any custom opentelemetry.sdk.trace.export.SpanExporter.

Parameters:

service_name (str) – OTEL service.name resource attribute. Defaults to "ractogateway".
otlp_endpoint (str | None) – OTLP gRPC endpoint (e.g. "http://localhost:4317"). Requires pip install ractogateway[telemetry].
otlp_http_endpoint (str | None) – OTLP HTTP endpoint (e.g. "http://localhost:4318"). Requires pip install ractogateway[telemetry].
console (bool) – Also print spans to stdout — convenient during local development.
in_memory (bool) – Capture spans internally in a thread-safe list. Access recorded spans via the spans property. Useful for unit tests — no external backend required.
custom_exporter (Any | None) – Any opentelemetry.sdk.trace.export.SpanExporter instance.
price_table (dict[str, ModelPricing] | None) – Override or extend the built-in DEFAULT_COST_TABLE. Keys are model identifiers; values are ModelPricing objects.
attributes (All spans carry the following OTEL)
---------------
attributes
"anthropic" (* llm.provider — "openai" / "google" /)
"gpt-4o" (* llm.model — e.g.)
"embed" (* llm.operation — "chat" / "stream" /)
milliseconds (* llm.latency_ms — wall-clock time in)
consumed (* llm.input_tokens — prompt tokens)
produced (* llm.output_tokens — completion tokens)
places) (* llm.cost_usd — estimated USD cost (8 decimal)
"miss" (* llm.cache_hit — "exact" / "semantic" /)
response (* llm.tool_calls — number of tool calls in the)
success) (* llm.error_type — exception class name on error (omitted on)

record_chat_span(*, provider, model, latency_ms, input_tokens=0, output_tokens=0, cache_hit='miss', tool_calls=0, status='ok', error_type=None)[source]

Record a completed chat or stream span.

Parameters:

provider (str) – Provider string ("openai", "google", "anthropic").
model (str) – Model identifier (e.g. "gpt-4o").
latency_ms (float) – Total wall-clock latency of the LLM call in milliseconds.
input_tokens (int) – Number of prompt tokens consumed (0 for cache hits).
output_tokens (int) – Number of completion tokens produced (0 for cache hits).
cache_hit (str) – "exact", "semantic", or "miss".
tool_calls (int) – Number of tool calls in the response.
status (str) – "ok" or "error".
error_type (str | None) – Exception class name when status == "error", else None.

Return type:

None

record_embed_span(*, provider, model, latency_ms, input_tokens=0, status='ok', error_type=None)[source]

Record a completed embedding span.

Parameters:

provider (str) – Provider string ("openai" or "google").
model (str) – Embedding model identifier.
latency_ms (float) – Total wall-clock latency in milliseconds.
input_tokens (int) – Number of tokens embedded.
status (str) – "ok" or "error".
error_type (str | None) – Exception class name when status == "error", else None.

Return type:

None

property spans: list[SpanRecord]

Return all captured in-memory spans.

Only populated when in_memory=True. Thread-safe.

Returns:: list[SpanRecord] – Snapshot of all recorded spans (newest last).

clear_spans()[source]

Clear all in-memory spans.

Only has effect when in_memory=True.

Return type:: None

class ractogateway.telemetry.SpanRecord(**data)[source]

Bases: BaseModel

In-memory span record — captured when RactoTracer(in_memory=True).

Useful for unit tests: inspect .spans after a call and assert on attributes without requiring an external OTEL backend.

Parameters:

name (str) – OTEL span name ("llm.chat" or "llm.embed").
provider (str) – Provider string — "openai", "google", or "anthropic".
model (str) – Model identifier as passed to the kit (e.g. "gpt-4o").
operation (str) – Operation type — "chat", "stream", or "embed".
latency_ms (float) – Total wall-clock latency of the LLM call in milliseconds.
input_tokens (int) – Number of prompt / input tokens consumed.
output_tokens (int) – Number of completion / output tokens produced.
cost_usd (float) – Estimated cost in USD derived from the built-in pricing table.
cache_hit (str) – Which cache served the result: "exact", "semantic", or "miss" when the LLM API was actually called.
tool_calls (int) – Number of tool calls present in the response.
status (str) – "ok" on success, "error" on exception.
error_type (str | None) – Exception class name when status == "error", else None.
timestamp (float) – Unix timestamp (time.time()) when the span was recorded.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

name: str

provider: str

model: str

operation: str

latency_ms: float

input_tokens: int

output_tokens: int

cost_usd: float

cache_hit: str

tool_calls: int

status: str

error_type: str | None

timestamp: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

ractogateway.telemetry.compute_cost(model, input_tokens, output_tokens, extra_table=None)[source]

Compute the estimated USD cost for a single LLM call.

Parameters:

model (str) – Model identifier (e.g. "gpt-4o"). If not found in the combined table the function returns 0.0.
input_tokens (int) – Number of prompt tokens consumed.
output_tokens (int) – Number of completion tokens produced.
extra_table (dict[str, ModelPricing] | None) – Optional {model: ModelPricing} dict to override or extend DEFAULT_COST_TABLE. Extra entries win over defaults.

Return type:

float

Returns:

float – Estimated cost in USD, or 0.0 when the model is unknown.