ractogateway.telemetry

RactoGateway Telemetry — OpenTelemetry tracing + Prometheus metrics.

Provides production-grade observability for every LLM call made through any RactoGateway developer kit.

Quick start:

from ractogateway import openai_developer_kit as opd
from ractogateway.telemetry import RactoTracer, GatewayMetricsMiddleware, PrometheusExporter

# --- OTEL tracing ---
tracer = RactoTracer(otlp_endpoint="http://localhost:4317")

# --- Prometheus metrics ---
metrics = GatewayMetricsMiddleware()
PrometheusExporter(port=8000).start()

kit = opd.OpenAIDeveloperKit(
    model="gpt-4o",
    default_prompt=my_prompt,
    tracer=tracer,
    metrics=metrics,
)
response = kit.chat(opd.ChatConfig(user_message="Hello"))

Install extras:

pip install ractogateway[telemetry]    # OpenTelemetry tracing
pip install ractogateway[prometheus]   # Prometheus metrics
pip install ractogateway[observability]  # both

Public API

RactoTracer

OpenTelemetry tracer. Pass as tracer= to any kit.

GatewayMetricsMiddleware

Prometheus metrics collector. Pass as metrics= to any kit.

PrometheusExporter

HTTP server for Prometheus scraping.

ModelPricing

Per-model input/output pricing (USD per 1M tokens).

SpanRecord

In-memory span record for test assertions.

DEFAULT_COST_TABLE

Built-in pricing table.

compute_cost()

Compute estimated USD cost from token counts.

class ractogateway.telemetry.GatewayMetricsMiddleware(*, price_table=None, registry=None)[source]

Bases: object

Prometheus metrics middleware — pass as metrics= to any developer kit.

A single instance can be shared across multiple kits (different providers) to aggregate metrics in one registry.

Parameters:
  • price_table (dict[str, ModelPricing] | None) – Override or extend the built-in pricing table used for the ractogateway_cost_usd_total counter.

  • registry (Any | None) – Custom prometheus_client.CollectorRegistry. Defaults to the global REGISTRY (which also includes default Python metrics). Pass prometheus_client.CollectorRegistry() to get an isolated registry — useful in tests.

  • Requires (pip install ractogateway[prometheus])

record_request(*, provider, model, operation, status, latency_s, input_tokens=0, output_tokens=0, tool_calls=None)[source]

Record metrics for a completed LLM request.

Parameters:
  • provider (str) – Provider string ("openai", "google", "anthropic").

  • model (str) – Model identifier (e.g. "gpt-4o").

  • operation (str) – "chat", "stream", or "embed".

  • status (str) – "ok" or "error".

  • latency_s (float) – Request wall-clock latency in seconds.

  • input_tokens (int) – Prompt tokens consumed (0 for cache hits or errors).

  • output_tokens (int) – Completion tokens produced (0 for cache hits or errors).

  • tool_calls (list[Any] | None) – List of ToolCallResult objects from the response. Used to update ractogateway_tool_calls_total.

Return type:

None

record_cache_hit(cache_type)[source]

Increment the cache-hits counter.

Parameters:

cache_type (str) – "exact" or "semantic".

Return type:

None

record_cache_miss(cache_type)[source]

Increment the cache-misses counter.

Parameters:

cache_type (str) – "exact" or "semantic".

Return type:

None

generate_latest()[source]

Return current metrics in Prometheus text exposition format.

Useful for testing without starting an HTTP server:

text = middleware.generate_latest()
assert "ractogateway_requests_total" in text
Return type:

str

Returns:

str – UTF-8 decoded Prometheus text format string.

class ractogateway.telemetry.ModelPricing(**data)[source]

Bases: BaseModel

USD cost per 1 million tokens for a specific model.

Parameters:
  • input_per_million (float) – Price in USD for 1 million input (prompt) tokens.

  • output_per_million (float) – Price in USD for 1 million output (completion) tokens.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

input_per_million: float
output_per_million: float
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.telemetry.PrometheusExporter(port=8000, *, registry=None)[source]

Bases: object

Start an HTTP server that exposes all registered Prometheus metrics.

The server listens on http://0.0.0.0:<port>/metrics and responds with the standard Prometheus text exposition format. This is designed to be used alongside GatewayMetricsMiddleware but works with any metrics registered in the global registry.

Parameters:
  • port (int) – HTTP port to listen on. Defaults to 8000.

  • registry (Any | None) – Custom prometheus_client.CollectorRegistry. When None the global REGISTRY is used.

  • Example::

    from ractogateway.telemetry import GatewayMetricsMiddleware, PrometheusExporter

    metrics = GatewayMetricsMiddleware() exporter = PrometheusExporter(port=9090) exporter.start() # Prometheus can now scrape http://localhost:9090/metrics exporter.stop()

  • Requires (pip install ractogateway[prometheus])

start()[source]

Start the metrics HTTP server in a background daemon thread.

Calling start() a second time on an already-running exporter is a no-op.

Return type:

None

stop()[source]

Shut down the metrics HTTP server.

Safe to call even if the server was never started.

Return type:

None

property port: int

The configured HTTP port.

property is_running: bool

True when the HTTP server is active.

generate_latest()[source]

Return current metrics in Prometheus text format.

No HTTP server required — useful for testing or embedding the metrics in a custom endpoint:

text = exporter.generate_latest()
assert "process_resident_memory_bytes" in text
Return type:

str

Returns:

str – UTF-8 decoded Prometheus text exposition string.

class ractogateway.telemetry.RactoTracer(*, service_name='ractogateway', otlp_endpoint=None, otlp_http_endpoint=None, console=False, in_memory=False, custom_exporter=None, price_table=None)[source]

Bases: object

OpenTelemetry tracer — pass as tracer= to any developer kit.

Records one span per LLM call with attributes for latency, token usage, estimated cost, cache-hit type, and tool-call count.

Supports OTLP gRPC (Jaeger / Grafana Tempo), OTLP HTTP, console stdout, in-memory capture (for tests), and any custom opentelemetry.sdk.trace.export.SpanExporter.

Parameters:
  • service_name (str) – OTEL service.name resource attribute. Defaults to "ractogateway".

  • otlp_endpoint (str | None) – OTLP gRPC endpoint (e.g. "http://localhost:4317"). Requires pip install ractogateway[telemetry].

  • otlp_http_endpoint (str | None) – OTLP HTTP endpoint (e.g. "http://localhost:4318"). Requires pip install ractogateway[telemetry].

  • console (bool) – Also print spans to stdout — convenient during local development.

  • in_memory (bool) – Capture spans internally in a thread-safe list. Access recorded spans via the spans property. Useful for unit tests — no external backend required.

  • custom_exporter (Any | None) – Any opentelemetry.sdk.trace.export.SpanExporter instance.

  • price_table (dict[str, ModelPricing] | None) – Override or extend the built-in DEFAULT_COST_TABLE. Keys are model identifiers; values are ModelPricing objects.

  • attributes (All spans carry the following OTEL)

  • ---------------

  • attributes

  • "anthropic" (* llm.provider — "openai" / "google" /)

  • "gpt-4o" (* llm.model — e.g.)

  • "embed" (* llm.operation — "chat" / "stream" /)

  • milliseconds (* llm.latency_ms — wall-clock time in)

  • consumed (* llm.input_tokens — prompt tokens)

  • produced (* llm.output_tokens — completion tokens)

  • places) (* llm.cost_usd — estimated USD cost (8 decimal)

  • "miss" (* llm.cache_hit — "exact" / "semantic" /)

  • response (* llm.tool_calls — number of tool calls in the)

  • success) (* llm.error_type — exception class name on error (omitted on)

record_chat_span(*, provider, model, latency_ms, input_tokens=0, output_tokens=0, cache_hit='miss', tool_calls=0, status='ok', error_type=None)[source]

Record a completed chat or stream span.

Parameters:
  • provider (str) – Provider string ("openai", "google", "anthropic").

  • model (str) – Model identifier (e.g. "gpt-4o").

  • latency_ms (float) – Total wall-clock latency of the LLM call in milliseconds.

  • input_tokens (int) – Number of prompt tokens consumed (0 for cache hits).

  • output_tokens (int) – Number of completion tokens produced (0 for cache hits).

  • cache_hit (str) – "exact", "semantic", or "miss".

  • tool_calls (int) – Number of tool calls in the response.

  • status (str) – "ok" or "error".

  • error_type (str | None) – Exception class name when status == "error", else None.

Return type:

None

record_embed_span(*, provider, model, latency_ms, input_tokens=0, status='ok', error_type=None)[source]

Record a completed embedding span.

Parameters:
  • provider (str) – Provider string ("openai" or "google").

  • model (str) – Embedding model identifier.

  • latency_ms (float) – Total wall-clock latency in milliseconds.

  • input_tokens (int) – Number of tokens embedded.

  • status (str) – "ok" or "error".

  • error_type (str | None) – Exception class name when status == "error", else None.

Return type:

None

property spans: list[SpanRecord]

Return all captured in-memory spans.

Only populated when in_memory=True. Thread-safe.

Returns:

list[SpanRecord] – Snapshot of all recorded spans (newest last).

clear_spans()[source]

Clear all in-memory spans.

Only has effect when in_memory=True.

Return type:

None

class ractogateway.telemetry.SpanRecord(**data)[source]

Bases: BaseModel

In-memory span record — captured when RactoTracer(in_memory=True).

Useful for unit tests: inspect .spans after a call and assert on attributes without requiring an external OTEL backend.

Parameters:
  • name (str) – OTEL span name ("llm.chat" or "llm.embed").

  • provider (str) – Provider string — "openai", "google", or "anthropic".

  • model (str) – Model identifier as passed to the kit (e.g. "gpt-4o").

  • operation (str) – Operation type — "chat", "stream", or "embed".

  • latency_ms (float) – Total wall-clock latency of the LLM call in milliseconds.

  • input_tokens (int) – Number of prompt / input tokens consumed.

  • output_tokens (int) – Number of completion / output tokens produced.

  • cost_usd (float) – Estimated cost in USD derived from the built-in pricing table.

  • cache_hit (str) – Which cache served the result: "exact", "semantic", or "miss" when the LLM API was actually called.

  • tool_calls (int) – Number of tool calls present in the response.

  • status (str) – "ok" on success, "error" on exception.

  • error_type (str | None) – Exception class name when status == "error", else None.

  • timestamp (float) – Unix timestamp (time.time()) when the span was recorded.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

name: str
provider: str
model: str
operation: str
latency_ms: float
input_tokens: int
output_tokens: int
cost_usd: float
cache_hit: str
tool_calls: int
status: str
error_type: str | None
timestamp: float
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

ractogateway.telemetry.compute_cost(model, input_tokens, output_tokens, extra_table=None)[source]

Compute the estimated USD cost for a single LLM call.

Parameters:
  • model (str) – Model identifier (e.g. "gpt-4o"). If not found in the combined table the function returns 0.0.

  • input_tokens (int) – Number of prompt tokens consumed.

  • output_tokens (int) – Number of completion tokens produced.

  • extra_table (dict[str, ModelPricing] | None) – Optional {model: ModelPricing} dict to override or extend DEFAULT_COST_TABLE. Extra entries win over defaults.

Return type:

float

Returns:

float – Estimated cost in USD, or 0.0 when the model is unknown.