# API Reference — Video Processor Pipeline Module: `ractogateway.pipelines.video_processor` ```bash pip install "ractogateway[pipelines-video]" # core pip install "ractogateway[pipelines-video-whisper]" # + faster-whisper pip install "ractogateway[pipelines-video-yt]" # + YouTube via yt-dlp pip install "ractogateway[pipelines-video-full]" # all of the above ``` --- ## VideoProcessorPipeline ```python class VideoProcessorPipeline ``` Five-stage pipeline that turns a raw video into structured knowledge: frame extraction → deduplication → audio transcription → vision LLM analysis → summary. Optionally stores everything in a RactoRAG vector store for Q&A. ### Constructor ```python VideoProcessorPipeline( kit, *, analysis_kit=None, summary_kit=None, # Transcription transcriber=TranscriberBackend.FASTER_WHISPER, transcriber_model="base", transcriber_api_key=None, transcriber_base_url=None, # Frame extraction fps=1.0, similarity_threshold=90.0, dedup_method=DeduplicationMethod.PHASH, max_frames=None, frame_format="JPEG", # Vision analysis frame_analysis_mode=FrameAnalysisMode.INDIVIDUAL, grid_size=4, batch_size=10, max_workers=4, max_process_workers=4, language=None, # Feature flags transcribe_audio=True, analyze_frames=True, generate_summary=True, # Integrations rag_pipeline=None, # Safety & observability safe_mode=False, tracer=None, metrics=None, rate_limiter=None, user_id="default", ) ``` **Parameters** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `kit` | `Any` | required | Main developer kit — used for summary + fallback frame analysis | | `analysis_kit` | `Any \| None` | `None` | Separate kit for frame-by-frame vision analysis (falls back to `kit`) | | `summary_kit` | `Any \| None` | `None` | Separate kit for summary generation (falls back to `kit`) | | `transcriber` | `TranscriberBackend` | `FASTER_WHISPER` | Audio transcription backend | | `transcriber_model` | `str` | `"base"` | Model name / size — interpretation is backend-specific | | `transcriber_api_key` | `str \| None` | `None` | API key for cloud backends (or read from env var) | | `transcriber_base_url` | `str \| None` | `None` | Base URL for self-hosted endpoints (Ollama etc.) | | `fps` | `float` | `1.0` | Frames to sample per second of video | | `similarity_threshold` | `float` | `90.0` | Discard a frame when its similarity to the previous kept frame is ≥ this % | | `dedup_method` | `DeduplicationMethod` | `PHASH` | `PHASH` (fast) or `SSIM` (more accurate) | | `max_frames` | `int \| None` | `None` | Hard cap on kept frames | | `frame_format` | `str` | `"JPEG"` | `"JPEG"` or `"PNG"` | | `frame_analysis_mode` | `FrameAnalysisMode` | `INDIVIDUAL` | `INDIVIDUAL` or `GRID` (collage) | | `grid_size` | `int` | `4` | Frames per grid collage (GRID mode only) | | `batch_size` | `int` | `10` | Concurrent LLM calls per analysis batch | | `max_workers` | `int` | `4` | Thread-pool size for concurrent LLM calls | | `max_process_workers` | `int` | `4` | Process-pool size for CPU-bound extraction | | `language` | `str \| None` | `None` | BCP-47 code (e.g. `"en"`), `None` = auto-detect | | `transcribe_audio` | `bool` | `True` | Extract and transcribe the audio track | | `analyze_frames` | `bool` | `True` | Pass kept frames to the vision LLM | | `generate_summary` | `bool` | `True` | Generate a comprehensive Markdown summary | | `processing_mode` | `VideoProcessingMode` | `ACTIVE` | `ACTIVE` = full video; `PASSIVE` = focused time window | | `focus_time_seconds` | `float \| None` | `None` | Center timestamp for passive mode (seconds or parseable string) | | `window_seconds` | `float` | `5.0` | Passive-mode half-window: processes `focus ± window` seconds | | `rag_pipeline` | `Any \| None` | `None` | `RactoRAG` instance for optional storage | | `safe_mode` | `bool` | `False` | Catch all exceptions; return them in `result.error` | | `tracer` | `Any \| None` | `None` | `RactoTracer` for OTEL tracing | | `metrics` | `Any \| None` | `None` | `GatewayMetricsMiddleware` for Prometheus | | `rate_limiter` | `Any \| None` | `None` | Duck-typed rate limiter | | `user_id` | `str` | `"default"` | Default user identifier for rate limiter | --- ### Methods #### run ```python def run( source: str | Path | bytes | list, *, fps=None, similarity_threshold=None, dedup_method=None, max_frames=None, analyze_frames=None, frame_analysis_mode=None, grid_size=None, batch_size=None, transcribe_audio=None, language=None, generate_summary=None, processing_mode=None, focus_time_seconds=None, window_seconds=None, store_in_rag=False, user_id=None, ) -> VideoProcessorResult ``` Process a video synchronously. All keyword arguments override the constructor defaults for this call only. **Accepted `source` types** | Type | Example | Behaviour | |------|---------|-----------| | `str` (local path) | `"lecture.mp4"` | Opens with OpenCV directly | | `Path` (local path) | `Path("lecture.mp4")` | Opens with OpenCV directly | | `str` (HTTP/HTTPS URL) | `"https://cdn.example.com/v.mp4"` | Downloaded via `httpx` to temp file | | `str` (YouTube URL) | `"https://youtu.be/abc"` | Downloaded via `yt-dlp` to temp file | | `bytes` | `video_bytes` | Written to a temp file, then processed | | `list[str \| Path]` | `["f0.jpg", "f1.jpg"]` | Pre-extracted frames — extraction step skipped | #### arun ```python async def arun(source, **kwargs) -> VideoProcessorResult ``` Async variant of `run()`. CPU-bound steps run in a `ThreadPoolExecutor`; LLM calls use `asyncio.gather` for concurrency. #### answer_question ```python def answer_question( source: str | Path | bytes | list, *, question: str, processing_mode: VideoProcessingMode | str = VideoProcessingMode.ACTIVE, focus_time: float | int | str | None = None, window_seconds: float = 5.0, max_context_chars: int = 40_000, **run_kwargs, ) -> VideoProcessorResult ``` Process a video then answer a user question from the extracted timeline context. Combines passive-mode windowed processing with a QA LLM call. The `answer` and `question` fields are populated on the returned result. `focus_time` accepts all formats supported by `parse_timestamp` (`130`, `"02:10"`, `"2 mins 10 sec"`). #### aanswer_question ```python async def aanswer_question(source, *, question, ...) -> VideoProcessorResult ``` Async variant of `answer_question`. #### parse_timestamp ```python @staticmethod def parse_timestamp(value: float | int | str) -> float ``` Parse a human-readable timestamp into seconds. Accepted formats: | Input | Parsed as | | ----- | --------- | | `130` / `130.0` | `130.0` seconds | | `"02:10"` | `130.0` seconds (MM:SS) | | `"1:02:10"` | `3730.0` seconds (HH:MM:SS) | | `"2 mins 10 sec"` | `130.0` seconds | | `"1h 3m"` | `3780.0` seconds | | `"90s"` | `90.0` seconds | Raises `ValueError` for negative values or unrecognised formats. --- ## AsyncVideoProcessorPipeline ```python class AsyncVideoProcessorPipeline ``` Async-only variant of `VideoProcessorPipeline`. Exposes a single `async run()` method — suitable for FastAPI endpoints. Constructor parameters are identical to `VideoProcessorPipeline`. ### Methods #### run ```python async def run(source, **kwargs) -> VideoProcessorResult ``` --- ## Models ### VideoProcessorResult ```python class VideoProcessorResult(BaseModel) ``` Full output of a pipeline run. | Field | Type | Description | |-------|------|-------------| | `video_path` | `str` | Source identifier (path, URL, or `""`) | | `frames` | `list[FrameEntry]` | All extracted frames (kept and discarded) | | `transcript` | `list[TranscriptSegment]` | Audio transcript with timestamps | | `sections` | `list[VideoSection]` | Merged visual + audio sections | | `summary` | `str \| None` | Comprehensive Markdown summary | | `rag_stored` | `bool` | `True` if content was pushed to RactoRAG | | `rag_chunk_count` | `int` | Number of chunks stored | | `usage` | `VideoProcessorUsage` | Token and frame accounting | | `error` | `str \| None` | Error message when `safe_mode=True` | | `processing_mode` | `VideoProcessingMode` | Whether this run used `ACTIVE` or `PASSIVE` mode | | `window_start_seconds` | `float \| None` | Passive-mode window start in source-video seconds | | `window_end_seconds` | `float \| None` | Passive-mode window end in source-video seconds | | `question` | `str \| None` | User question (set by `answer_question`) | | `answer` | `str \| None` | LLM answer to `question` (set by `answer_question`) | **Methods** - `get_transcript_text() -> str` — Full transcript as a single string - `get_all_visual_content() -> str` — All frame analyses in timestamp order - `to_json(path=None, *, indent=2) -> str | None` — JSON export (image bytes excluded) - `to_markdown(path=None) -> str | None` — Structured Markdown report --- ### FrameEntry ```python class FrameEntry(BaseModel) ``` | Field | Type | Description | |-------|------|-------------| | `frame_id` | `int` | Zero-based sequential frame ID | | `timestamp` | `float` | Position in the video in seconds | | `similarity_to_prev` | `float \| None` | Similarity % to previous kept frame | | `kept` | `bool` | `False` if discarded by deduplication | | `analysis` | `str \| None` | LLM-generated content description | | `image_data` | `bytes \| None` | Raw image bytes (kept frames only) | | `image_format` | `str` | `"JPEG"` or `"PNG"` | --- ### TranscriptSegment ```python class TranscriptSegment(BaseModel) ``` | Field | Type | Description | |-------|------|-------------| | `start` | `float` | Segment start time (seconds) | | `end` | `float` | Segment end time (seconds) | | `text` | `str` | Transcribed text | | `frame_ids` | `list[int]` | Kept frame IDs within this time window | --- ### VideoSection ```python class VideoSection(BaseModel) ``` Merged time section combining visual analyses + audio transcript. | Field | Type | Description | |-------|------|-------------| | `timestamp_start` | `float` | Section start (seconds) | | `timestamp_end` | `float` | Section end (seconds) | | `frame_ids` | `list[int]` | Frame IDs in this section | | `visual_content` | `str` | Combined LLM frame analyses | | `audio_content` | `str` | Transcript text for this window | --- ### VideoProcessorUsage ```python class VideoProcessorUsage(BaseModel) ``` | Field | Type | Description | |-------|------|-------------| | `frames_extracted` | `int` | Total frames sampled from video | | `frames_kept` | `int` | Frames that passed deduplication | | `frames_discarded` | `int` | Frames removed by deduplication | | `analysis_input_tokens` | `int` | Prompt tokens for frame analysis | | `analysis_output_tokens` | `int` | Completion tokens for frame analysis | | `summary_input_tokens` | `int` | Prompt tokens for summary | | `summary_output_tokens` | `int` | Completion tokens for summary | | `audio_duration_seconds` | `float` | Duration of extracted audio | **Properties** - `total_analysis_tokens` — `analysis_input + analysis_output` - `total_summary_tokens` — `summary_input + summary_output` - `total_tokens` — Sum of all tokens --- ### VideoConfig ```python class VideoConfig(BaseModel) ``` Convenience model bundling all pipeline hyperparameters (mirrors constructor params). --- ## Enums ### TranscriberBackend ```python class TranscriberBackend(str, Enum) ``` | Value | Backend | Extra required | |-------|---------|---------------| | `"faster-whisper"` | faster-whisper lib (default) | `pipelines-video-whisper` | | `"openai-whisper"` | openai-whisper lib | `pipelines-video-openai-whisper` | | `"huggingface-local"` | HF transformers ASR | `transformers torch` | | `"openai-api"` | OpenAI Whisper API | `openai` (core) | | `"google-api"` | Google Cloud Speech-to-Text v2 | `google-cloud-speech` | | `"huggingface-api"` | HuggingFace Inference API | `huggingface_hub` | | `"groq-api"` | Groq Whisper (ultra-fast) | `groq` | | `"deepgram-api"` | Deepgram Nova | `deepgram-sdk` | | `"ollama"` | Self-hosted Ollama | `ollama` (core) | **`transcriber_model` values by backend** | Backend | Valid model values | |---------|------------------| | `faster-whisper` / `openai-whisper` | `"tiny"` `"base"` `"small"` `"medium"` `"large"` `"large-v2"` `"large-v3"` | | `huggingface-local` / `huggingface-api` | Any HF model ID, e.g. `"openai/whisper-large-v3"` | | `openai-api` | `"whisper-1"` | | `google-api` | `"long"` `"short"` `"latest_long"` | | `groq-api` | `"whisper-large-v3"` `"whisper-large-v3-turbo"` `"distil-whisper-large-v3-en"` | | `deepgram-api` | `"nova-3"` `"nova-2"` `"enhanced"` `"base"` | | `ollama` | Model name on your Ollama server, e.g. `"whisper"` | --- ### VideoProcessingMode ```python class VideoProcessingMode(str, Enum) ``` | Value | Behaviour | | ----------- | ---------------------------------------------------------------- | | `"active"` | Process the full video (default) | | `"passive"` | Process only a focused time window (`focus ± window_seconds`) | --- ### DeduplicationMethod ```python class DeduplicationMethod(str, Enum) ``` | Value | Algorithm | Speed | Accuracy | |-------|-----------|-------|---------| | `"phash"` | Perceptual hash | Fast | Good for most videos | | `"ssim"` | Structural Similarity | Slower | Higher accuracy | --- ### FrameAnalysisMode ```python class FrameAnalysisMode(str, Enum) ``` | Value | Behaviour | Best for | |-------|-----------|---------| | `"individual"` | One LLM call per frame | High accuracy, whiteboard extraction | | `"grid"` | Stitch N frames into a collage → one call | Cost reduction, dense videos | --- ## Exceptions ### VideoRateLimitExceededError ```python class VideoRateLimitExceededError(RuntimeError) ``` Raised (or captured in `result.error` when `safe_mode=True`) when the `rate_limiter` denies a pipeline request. --- ## See also - [Video Processor Guide](../guide/pipelines/video_processor.md) - [Pipelines overview](../guide/pipelines.md) - [RAG guide](../guide/rag.md) - [Telemetry guide](../guide/telemetry.md)