API Reference — PageIndexRAG

Vectorless RAG pipeline: keyword index + BM25 scoring. No embeddings, no vector store required.

Pipeline

class ractogateway.rag.page_index.pipeline.PageIndexRAG(llm_kit=None, *, processors=None, reader_registry=None, context_template="Use the following retrieved page excerpts to answer the user's question.\\nIf the excerpts do not contain enough information, say so clearly.\\n\\n--- CONTEXT ---\\n{context}\\n--- END CONTEXT ---\\n\\nQuestion: {question}", default_prompt=None, page_size=1000, page_overlap=100, k1=1.5, b=0.75, top_keywords=20, ocr_backend=None, ocr_fallback=True, min_ocr_confidence=0.0)[source]

Bases: object

Vectorless RAG pipeline that indexes documents at the page level.

Parameters:
  • llm_kit (Any) – Any RactoGateway developer kit (OpenAI, Anthropic, Google, Ollama, HuggingFace). Required only for query() / aquery(). Pass None to use the pipeline in retrieve-only mode.

  • processors (Sequence[BaseProcessor] | None) – Text processors applied to each page before indexing. Defaults to [TextCleaner()].

  • reader_registry (FileReaderRegistry | None) – File reader registry used to load non-PDF documents. Defaults to a FileReaderRegistry with all built-in readers registered.

  • context_template (str) – Jinja-style template with {context} and {question} placeholders used when building the LLM prompt.

  • default_prompt (RactoPrompt | None) – RactoPrompt used for generation. Defaults to a built-in factual Q&A prompt.

  • page_size (int) – Maximum character length of each page window for non-PDF files (default 1 000).

  • page_overlap (int) – Character overlap between consecutive windows (default 100).

  • k1 (float) – BM25 term-frequency saturation parameter (default 1.5).

  • b (float) – BM25 length-normalisation parameter (default 0.75).

  • top_keywords (int) – Number of top TF-weighted keywords to extract per page for the decision index (default 20).

retrieve(query, top_k=5)[source]

Retrieve the most relevant pages for query.

Uses two-stage retrieval: decision index (candidate selection) → BM25 scoring (ranking).

Parameters:
  • query (str) – Natural-language question or keyword string.

  • top_k (int) – Maximum number of results to return.

Return type:

list[PageIndexResult]

Returns:

list[PageIndexResult] – Pages ranked by BM25 score (most relevant first).

async aretrieve(query, top_k=5)[source]

Async variant of retrieve().

Return type:

list[PageIndexResult]

ingest(path, **metadata)[source]

Read a file and add its pages to the index.

PDFs are split page-by-page; all other file types are split into fixed-size character windows.

Parameters:
  • path (str) – Absolute or relative path to the file.

  • **metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

Returns:

list[PageEntry] – All page entries created from this file.

async aingest(path, **metadata)[source]

Async variant of ingest().

Return type:

list[PageEntry]

ingest_text(text, source='manual', **metadata)[source]

Index raw text directly (no file I/O).

Parameters:
  • text (str) – Plain text to index.

  • source (str) – Descriptive label stored in PageEntry.source.

  • **metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

async aingest_text(text, source='manual', **metadata)[source]

Async variant of ingest_text().

Return type:

list[PageEntry]

ingest_dir(directory, pattern='**/*', *, on_progress=None, **metadata)[source]

Ingest all files matching pattern inside directory.

Files that cannot be read are logged and skipped; the rest are indexed normally.

Parameters:
  • directory (str) – Root directory to search.

  • pattern (str) – Glob pattern relative to directory (default "**/*").

  • on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file is processed (or skipped). Useful for progress bars.

  • **metadata (Any) – Forwarded to every ingest() call.

Return type:

list[PageEntry]

async aingest_dir(directory, pattern='**/*', *, max_concurrent=4, on_progress=None, **metadata)[source]

Async parallel variant of ingest_dir().

Parameters:
  • directory (str) – Root directory to search.

  • pattern (str) – Glob pattern relative to directory (default "**/*").

  • max_concurrent (int) – Maximum number of files ingested concurrently (default 4).

  • on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file finishes (thread-safe; called from the event loop).

  • **metadata (Any) – Forwarded to every aingest() call.

Return type:

list[PageEntry]

add_document(path, **metadata)[source]

Alias for ingest().

Return type:

list[PageEntry]

add_texts(texts, source='manual', **metadata)[source]

Ingest a list of text strings.

Return type:

list[PageEntry]

search(query, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Alias for query().

Return type:

PageIndexResponse

query(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Retrieve relevant pages and generate an answer with the LLM kit.

Parameters:
  • question (str) – Natural-language question to answer.

  • top_k (int) – Number of pages to retrieve.

  • prompt (RactoPrompt | None) – Override the kit’s default prompt for this call.

  • temperature (float) – Sampling temperature for generation.

  • max_tokens (int) – Maximum generation tokens.

Return type:

PageIndexResponse

Returns:

PageIndexResponse – Contains the generated answer, ranked sources, and the context string that was supplied to the model.

Raises:

ValueError – If no llm_kit was provided and generation is requested.

async aquery(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Async variant of query().

Return type:

PageIndexResponse

remove_document(doc_id)[source]

Remove all pages belonging to doc_id from the index.

Parameters:

doc_id (str) – The doc_id value from any PageEntry returned during ingestion.

Return type:

int

Returns:

int – Number of page entries removed.

clear()[source]

Remove all indexed entries and reset the pipeline to empty state.

Return type:

None

save(path)[source]

Serialise the full index to a JSON file.

The saved file contains all PageEntry records, BM25 term weights, and deduplication hashes. Reload with load().

Parameters:

path (str) – Destination file path (will be created or overwritten).

Return type:

None

classmethod load(path, **kwargs)[source]

Load a previously saved index from path.

Parameters:
  • path (str) – JSON file written by save().

  • **kwargs (Any) – Forwarded to the constructor (e.g. llm_kit=kit).

Return type:

PageIndexRAG

Returns:

PageIndexRAG – A new instance with the index fully restored.

property entry_count: int

Total number of indexed page entries.

property document_count: int

Number of distinct documents ingested.

Models

Pydantic models for the PageIndexRAG pipeline.

class ractogateway.rag.page_index._models.PageEntry(**data)[source]

Bases: BaseModel

A single page (or fixed-size window) extracted from a document.

Produced by PageIndexRAG during ingestion and stored in the in-process index.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

entry_id: str
page_number: int | None
content: str
source: str
section_title: str | None
keywords: list[str]
doc_id: str
char_count: int
extra: dict[str, Any]
ocr_applied: bool
ocr_confidence: float | None
content_hash: str | None
property text: str

Alias for content.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.rag.page_index._models.PageIndexResult(**data)[source]

Bases: BaseModel

A single retrieved page together with its BM25 relevance score.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

entry: PageEntry
score: float
rank: int
matched_terms: list[str]
property content: str

Alias for entry.content.

property text: str

Alias for entry.content.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.rag.page_index._models.PageIndexResponse(**data)[source]

Bases: BaseModel

Full response from PageIndexRAG.query() / PageIndexRAG.aquery().

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

answer: LLMResponse | None
sources: list[PageIndexResult]
query: str
context_used: str
property results: list[PageIndexResult]

Alias for sources.

property pages: list[PageIndexResult]

Alias for sources.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

BM25 Engine

Pure-Python BM25 index and decision-tree inverted index.

No external dependencies required — everything is implemented with the Python standard library.

Two components work together for two-stage retrieval:

  1. _DecisionIndex — an inverted keyword index that maps content terms to page entry IDs. Given a tokenised query it returns the union of candidate entry IDs in O(|query terms|) time. This is the “decision tree” routing layer.

  2. BM25Index — Okapi BM25 (k1=1.5, b=0.75) that scores the candidates returned by the decision index. Only candidates are scored, so the full corpus is never re-ranked on every query.

ractogateway.rag.page_index._bm25.extract_keywords(text, top_n=20)[source]

Return the top-n most frequent content tokens from text.

Return type:

list[str]

class ractogateway.rag.page_index._bm25.BM25Index(k1=1.5, b=0.75)[source]

Bases: object

Okapi BM25 scorer over a corpus of PageEntry texts.

Parameters:
  • k1 (float) – Term-frequency saturation parameter (default 1.5).

  • b (float) – Length normalisation parameter (default 0.75).

add(entry_id, text)[source]

Tokenise text and add the entry to the index.

Return type:

None

remove(entry_id)[source]

Remove entry_id from the index.

Return type:

None

clear()[source]
Return type:

None

score(query, candidate_ids=None)[source]

Score candidates against query and return ranked results.

Parameters:
  • query (str) – Raw query string.

  • candidate_ids (set[str] | None) – Subset of entry IDs to score. When None the entire corpus is scored (full-scan fallback).

Return type:

list[tuple[str, float, list[str]]]

Returns:

list of (entry_id, bm25_score, matched_terms) – Sorted descending by score, ties broken by entry_id for stability.

property entry_count: int

OCR Backends

OCR backends for PageIndexRAG.

Each backend converts raw image bytes (PNG/JPEG) into extracted text. All backends follow the same interface so they are interchangeable.

Available backends

  • TesseractOcrBackend — free, offline, pytesseract wrapper

  • EasyOcrBackend — deep-learning, 80+ languages, offline

  • GoogleVisionBackend — Google Cloud Vision API

  • GoogleDocumentAIBackend — Google Document AI (tables, forms)

  • AWSTextractBackend — AWS Textract (forms, tables, key-value)

  • AzureDocumentIntelligenceBackend — Azure Form Recognizer v4

Quick start:

from ractogateway.rag.page_index import PageIndexRAG
from ractogateway.rag.page_index._ocr import TesseractOcrBackend

rag = PageIndexRAG(llm_kit=kit, ocr_backend=TesseractOcrBackend())
rag.ingest("scanned_report.pdf")   # OCR fallback auto-triggered
class ractogateway.rag.page_index._ocr.BaseOcrBackend[source]

Bases: ABC

Abstract base class for OCR backends.

Implementors must provide extract_text(). An async default is provided via aextract_text() that offloads the synchronous call to a thread-pool executor.

abstractmethod extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

async aextract_text(image_bytes, mime_type='image/png')[source]

Async variant of extract_text() (thread-pool offload).

Return type:

str

class ractogateway.rag.page_index._ocr.TesseractOcrBackend(lang='eng', config='', confidence_threshold=40.0)[source]

Bases: BaseOcrBackend

OCR via Tesseract.

Requires pytesseract and a working Tesseract installation. Install with:

pip install ractogateway[rag-ocr-tesseract]
# Also install Tesseract binary: https://github.com/UB-Mannheim/tesseract/wiki
Parameters:
  • lang (str) – Tesseract language string, e.g. "eng" (default), "eng+deu".

  • config (str) – Extra Tesseract config flags, e.g. "--psm 6".

  • confidence_threshold (float) – Pages where the mean word confidence is below this value (0–100) are flagged in the returned metadata; text is still returned. Set to 0 to disable filtering.

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

extract_with_confidence(image_bytes)[source]

Return (text, mean_confidence) for confidence-aware ingestion.

Return type:

tuple[str, float]

class ractogateway.rag.page_index._ocr.EasyOcrBackend(languages=None, gpu=False)[source]

Bases: BaseOcrBackend

OCR via EasyOCR.

Deep-learning model; no cloud API required. Supports 80+ languages. Install with:

pip install ractogateway[rag-ocr-easy]
Parameters:
  • languages (list[str] | None) – List of language codes, e.g. ["en"] (default) or ["en", "de"].

  • gpu (bool) – Use CUDA GPU if available (default False for broad compatibility).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.GoogleVisionBackend(credentials_path=None)[source]

Bases: BaseOcrBackend

OCR via Google Cloud Vision API (DOCUMENT_TEXT_DETECTION).

Install with:

pip install ractogateway[rag-ocr-google]
Parameters:

credentials_path (str | None) – Path to a service-account JSON key file. If None the SDK uses Application Default Credentials (GOOGLE_APPLICATION_CREDENTIALS).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.GoogleDocumentAIBackend(project_id, processor_id, location='us', credentials_path=None)[source]

Bases: BaseOcrBackend

OCR via Google Document AI.

Best for structured documents: tables, forms, invoices, contracts. Install with:

pip install ractogateway[rag-ocr-google]
Parameters:
  • project_id (str) – GCP project ID.

  • processor_id (str) – Document AI processor ID (e.g. an OCR or Form Parser processor).

  • location (str) – Processor region, usually "us" or "eu" (default "us").

  • credentials_path (str | None) – Optional path to a service-account JSON key.

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.AWSTextractBackend(region_name='us-east-1', aws_access_key_id=None, aws_secret_access_key=None)[source]

Bases: BaseOcrBackend

OCR via AWS Textract.

Best for forms and tables; uses DetectDocumentText for plain text. Install with:

pip install ractogateway[rag-ocr-aws]
Parameters:
  • region_name (str) – AWS region (default "us-east-1").

  • aws_secret_access_key (str | None) – Optional explicit credentials; if omitted, boto3 uses the standard credential chain (env vars, ~/.aws/credentials, IAM role, etc.).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.AzureDocumentIntelligenceBackend(endpoint, api_key, model_id='prebuilt-read')[source]

Bases: BaseOcrBackend

OCR via Azure Document Intelligence.

Previously called Azure Form Recognizer. The prebuilt-read model is used by default; swap in prebuilt-document for richer extraction. Install with:

pip install ractogateway[rag-ocr-azure]
Parameters:
  • endpoint (str) – Azure resource endpoint URL.

  • api_key (str) – Azure resource API key.

  • model_id (str) – Document Intelligence model to use (default "prebuilt-read").

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.