API Reference — Video Processor Pipeline
Module: ractogateway.pipelines.video_processor
pip install "ractogateway[pipelines-video]" # core
pip install "ractogateway[pipelines-video-whisper]" # + faster-whisper
pip install "ractogateway[pipelines-video-yt]" # + YouTube via yt-dlp
pip install "ractogateway[pipelines-video-full]" # all of the above
VideoProcessorPipeline
class VideoProcessorPipeline
Five-stage pipeline that turns a raw video into structured knowledge: frame extraction → deduplication → audio transcription → vision LLM analysis → summary. Optionally stores everything in a RactoRAG vector store for Q&A.
Constructor
VideoProcessorPipeline(
kit,
*,
analysis_kit=None,
summary_kit=None,
# Transcription
transcriber=TranscriberBackend.FASTER_WHISPER,
transcriber_model="base",
transcriber_api_key=None,
transcriber_base_url=None,
# Frame extraction
fps=1.0,
similarity_threshold=90.0,
dedup_method=DeduplicationMethod.PHASH,
max_frames=None,
frame_format="JPEG",
# Vision analysis
frame_analysis_mode=FrameAnalysisMode.INDIVIDUAL,
grid_size=4,
batch_size=10,
max_workers=4,
max_process_workers=4,
language=None,
# Feature flags
transcribe_audio=True,
analyze_frames=True,
generate_summary=True,
# Integrations
rag_pipeline=None,
# Safety & observability
safe_mode=False,
tracer=None,
metrics=None,
rate_limiter=None,
user_id="default",
)
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Main developer kit — used for summary + fallback frame analysis |
|
|
|
Separate kit for frame-by-frame vision analysis (falls back to |
|
|
|
Separate kit for summary generation (falls back to |
|
|
|
Audio transcription backend |
|
|
|
Model name / size — interpretation is backend-specific |
|
|
|
API key for cloud backends (or read from env var) |
|
|
|
Base URL for self-hosted endpoints (Ollama etc.) |
|
|
|
Frames to sample per second of video |
|
|
|
Discard a frame when its similarity to the previous kept frame is ≥ this % |
|
|
|
|
|
|
|
Hard cap on kept frames |
|
|
|
|
|
|
|
|
|
|
|
Frames per grid collage (GRID mode only) |
|
|
|
Concurrent LLM calls per analysis batch |
|
|
|
Thread-pool size for concurrent LLM calls |
|
|
|
Process-pool size for CPU-bound extraction |
|
|
|
BCP-47 code (e.g. |
|
|
|
Extract and transcribe the audio track |
|
|
|
Pass kept frames to the vision LLM |
|
|
|
Generate a comprehensive Markdown summary |
|
|
|
|
|
|
|
Center timestamp for passive mode (seconds or parseable string) |
|
|
|
Passive-mode half-window: processes |
|
|
|
|
|
|
|
Catch all exceptions; return them in |
|
|
|
|
|
|
|
|
|
|
|
Duck-typed rate limiter |
|
|
|
Default user identifier for rate limiter |
Methods
run
def run(
source: str | Path | bytes | list,
*,
fps=None,
similarity_threshold=None,
dedup_method=None,
max_frames=None,
analyze_frames=None,
frame_analysis_mode=None,
grid_size=None,
batch_size=None,
transcribe_audio=None,
language=None,
generate_summary=None,
processing_mode=None,
focus_time_seconds=None,
window_seconds=None,
store_in_rag=False,
user_id=None,
) -> VideoProcessorResult
Process a video synchronously. All keyword arguments override the constructor defaults for this call only.
Accepted source types
Type |
Example |
Behaviour |
|---|---|---|
|
|
Opens with OpenCV directly |
|
|
Opens with OpenCV directly |
|
|
Downloaded via |
|
|
Downloaded via |
|
|
Written to a temp file, then processed |
|
|
Pre-extracted frames — extraction step skipped |
arun
async def arun(source, **kwargs) -> VideoProcessorResult
Async variant of run(). CPU-bound steps run in a ThreadPoolExecutor;
LLM calls use asyncio.gather for concurrency.
answer_question
def answer_question(
source: str | Path | bytes | list,
*,
question: str,
processing_mode: VideoProcessingMode | str = VideoProcessingMode.ACTIVE,
focus_time: float | int | str | None = None,
window_seconds: float = 5.0,
max_context_chars: int = 40_000,
**run_kwargs,
) -> VideoProcessorResult
Process a video then answer a user question from the extracted timeline
context. Combines passive-mode windowed processing with a QA LLM call.
The answer and question fields are populated on the returned result.
focus_time accepts all formats supported by parse_timestamp
(130, "02:10", "2 mins 10 sec").
aanswer_question
async def aanswer_question(source, *, question, ...) -> VideoProcessorResult
Async variant of answer_question.
parse_timestamp
@staticmethod
def parse_timestamp(value: float | int | str) -> float
Parse a human-readable timestamp into seconds. Accepted formats:
Input |
Parsed as |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Raises ValueError for negative values or unrecognised formats.
AsyncVideoProcessorPipeline
class AsyncVideoProcessorPipeline
Async-only variant of VideoProcessorPipeline. Exposes a single
async run() method — suitable for FastAPI endpoints.
Constructor parameters are identical to VideoProcessorPipeline.
Methods
run
async def run(source, **kwargs) -> VideoProcessorResult
Models
VideoProcessorResult
class VideoProcessorResult(BaseModel)
Full output of a pipeline run.
Field |
Type |
Description |
|---|---|---|
|
|
Source identifier (path, URL, or |
|
|
All extracted frames (kept and discarded) |
|
|
Audio transcript with timestamps |
|
|
Merged visual + audio sections |
|
|
Comprehensive Markdown summary |
|
|
|
|
|
Number of chunks stored |
|
|
Token and frame accounting |
|
|
Error message when |
|
|
Whether this run used |
|
|
Passive-mode window start in source-video seconds |
|
|
Passive-mode window end in source-video seconds |
|
|
User question (set by |
|
|
LLM answer to |
Methods
get_transcript_text() -> str— Full transcript as a single stringget_all_visual_content() -> str— All frame analyses in timestamp orderto_json(path=None, *, indent=2) -> str | None— JSON export (image bytes excluded)to_markdown(path=None) -> str | None— Structured Markdown report
FrameEntry
class FrameEntry(BaseModel)
Field |
Type |
Description |
|---|---|---|
|
|
Zero-based sequential frame ID |
|
|
Position in the video in seconds |
|
|
Similarity % to previous kept frame |
|
|
|
|
|
LLM-generated content description |
|
|
Raw image bytes (kept frames only) |
|
|
|
TranscriptSegment
class TranscriptSegment(BaseModel)
Field |
Type |
Description |
|---|---|---|
|
|
Segment start time (seconds) |
|
|
Segment end time (seconds) |
|
|
Transcribed text |
|
|
Kept frame IDs within this time window |
VideoSection
class VideoSection(BaseModel)
Merged time section combining visual analyses + audio transcript.
Field |
Type |
Description |
|---|---|---|
|
|
Section start (seconds) |
|
|
Section end (seconds) |
|
|
Frame IDs in this section |
|
|
Combined LLM frame analyses |
|
|
Transcript text for this window |
VideoProcessorUsage
class VideoProcessorUsage(BaseModel)
Field |
Type |
Description |
|---|---|---|
|
|
Total frames sampled from video |
|
|
Frames that passed deduplication |
|
|
Frames removed by deduplication |
|
|
Prompt tokens for frame analysis |
|
|
Completion tokens for frame analysis |
|
|
Prompt tokens for summary |
|
|
Completion tokens for summary |
|
|
Duration of extracted audio |
Properties
total_analysis_tokens—analysis_input + analysis_outputtotal_summary_tokens—summary_input + summary_outputtotal_tokens— Sum of all tokens
VideoConfig
class VideoConfig(BaseModel)
Convenience model bundling all pipeline hyperparameters (mirrors constructor params).
Enums
TranscriberBackend
class TranscriberBackend(str, Enum)
Value |
Backend |
Extra required |
|---|---|---|
|
faster-whisper lib (default) |
|
|
openai-whisper lib |
|
|
HF transformers ASR |
|
|
OpenAI Whisper API |
|
|
Google Cloud Speech-to-Text v2 |
|
|
HuggingFace Inference API |
|
|
Groq Whisper (ultra-fast) |
|
|
Deepgram Nova |
|
|
Self-hosted Ollama |
|
transcriber_model values by backend
Backend |
Valid model values |
|---|---|
|
|
|
Any HF model ID, e.g. |
|
|
|
|
|
|
|
|
|
Model name on your Ollama server, e.g. |
VideoProcessingMode
class VideoProcessingMode(str, Enum)
Value |
Behaviour |
|---|---|
|
Process the full video (default) |
|
Process only a focused time window ( |
DeduplicationMethod
class DeduplicationMethod(str, Enum)
Value |
Algorithm |
Speed |
Accuracy |
|---|---|---|---|
|
Perceptual hash |
Fast |
Good for most videos |
|
Structural Similarity |
Slower |
Higher accuracy |
FrameAnalysisMode
class FrameAnalysisMode(str, Enum)
Value |
Behaviour |
Best for |
|---|---|---|
|
One LLM call per frame |
High accuracy, whiteboard extraction |
|
Stitch N frames into a collage → one call |
Cost reduction, dense videos |
Exceptions
VideoRateLimitExceededError
class VideoRateLimitExceededError(RuntimeError)
Raised (or captured in result.error when safe_mode=True) when the
rate_limiter denies a pipeline request.