ractogateway.rag.page_index

PageIndexRAG — vectorless, page-level BM25 retrieval.

Quick start:

from ractogateway.rag.page_index import PageIndexRAG

rag = PageIndexRAG(llm_kit=kit)
rag.ingest("report.pdf")
rag.ingest("notes.txt")

# Retrieve only
results = rag.retrieve("revenue growth", top_k=5)

# Full RAG: retrieve + generate
response = rag.query("What were the Q3 revenue figures?")
print(response.answer.content)
class ractogateway.rag.page_index.PageIndexRAG(llm_kit=None, *, processors=None, reader_registry=None, context_template="Use the following retrieved page excerpts to answer the user's question.\\nIf the excerpts do not contain enough information, say so clearly.\\n\\n--- CONTEXT ---\\n{context}\\n--- END CONTEXT ---\\n\\nQuestion: {question}", default_prompt=None, page_size=1000, page_overlap=100, k1=1.5, b=0.75, top_keywords=20, ocr_backend=None, ocr_fallback=True, min_ocr_confidence=0.0)[source]

Bases: object

Vectorless RAG pipeline that indexes documents at the page level.

Parameters:
  • llm_kit (Any) – Any RactoGateway developer kit (OpenAI, Anthropic, Google, Ollama, HuggingFace). Required only for query() / aquery(). Pass None to use the pipeline in retrieve-only mode.

  • processors (Sequence[BaseProcessor] | None) – Text processors applied to each page before indexing. Defaults to [TextCleaner()].

  • reader_registry (FileReaderRegistry | None) – File reader registry used to load non-PDF documents. Defaults to a FileReaderRegistry with all built-in readers registered.

  • context_template (str) – Jinja-style template with {context} and {question} placeholders used when building the LLM prompt.

  • default_prompt (RactoPrompt | None) – RactoPrompt used for generation. Defaults to a built-in factual Q&A prompt.

  • page_size (int) – Maximum character length of each page window for non-PDF files (default 1 000).

  • page_overlap (int) – Character overlap between consecutive windows (default 100).

  • k1 (float) – BM25 term-frequency saturation parameter (default 1.5).

  • b (float) – BM25 length-normalisation parameter (default 0.75).

  • top_keywords (int) – Number of top TF-weighted keywords to extract per page for the decision index (default 20).

retrieve(query, top_k=5)[source]

Retrieve the most relevant pages for query.

Uses two-stage retrieval: decision index (candidate selection) → BM25 scoring (ranking).

Parameters:
  • query (str) – Natural-language question or keyword string.

  • top_k (int) – Maximum number of results to return.

Return type:

list[PageIndexResult]

Returns:

list[PageIndexResult] – Pages ranked by BM25 score (most relevant first).

async aretrieve(query, top_k=5)[source]

Async variant of retrieve().

Return type:

list[PageIndexResult]

ingest(path, **metadata)[source]

Read a file and add its pages to the index.

PDFs are split page-by-page; all other file types are split into fixed-size character windows.

Parameters:
  • path (str) – Absolute or relative path to the file.

  • **metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

Returns:

list[PageEntry] – All page entries created from this file.

async aingest(path, **metadata)[source]

Async variant of ingest().

Return type:

list[PageEntry]

ingest_text(text, source='manual', **metadata)[source]

Index raw text directly (no file I/O).

Parameters:
  • text (str) – Plain text to index.

  • source (str) – Descriptive label stored in PageEntry.source.

  • **metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

async aingest_text(text, source='manual', **metadata)[source]

Async variant of ingest_text().

Return type:

list[PageEntry]

ingest_dir(directory, pattern='**/*', *, on_progress=None, **metadata)[source]

Ingest all files matching pattern inside directory.

Files that cannot be read are logged and skipped; the rest are indexed normally.

Parameters:
  • directory (str) – Root directory to search.

  • pattern (str) – Glob pattern relative to directory (default "**/*").

  • on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file is processed (or skipped). Useful for progress bars.

  • **metadata (Any) – Forwarded to every ingest() call.

Return type:

list[PageEntry]

async aingest_dir(directory, pattern='**/*', *, max_concurrent=4, on_progress=None, **metadata)[source]

Async parallel variant of ingest_dir().

Parameters:
  • directory (str) – Root directory to search.

  • pattern (str) – Glob pattern relative to directory (default "**/*").

  • max_concurrent (int) – Maximum number of files ingested concurrently (default 4).

  • on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file finishes (thread-safe; called from the event loop).

  • **metadata (Any) – Forwarded to every aingest() call.

Return type:

list[PageEntry]

add_document(path, **metadata)[source]

Alias for ingest().

Return type:

list[PageEntry]

add_texts(texts, source='manual', **metadata)[source]

Ingest a list of text strings.

Return type:

list[PageEntry]

search(query, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Alias for query().

Return type:

PageIndexResponse

query(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Retrieve relevant pages and generate an answer with the LLM kit.

Parameters:
  • question (str) – Natural-language question to answer.

  • top_k (int) – Number of pages to retrieve.

  • prompt (RactoPrompt | None) – Override the kit’s default prompt for this call.

  • temperature (float) – Sampling temperature for generation.

  • max_tokens (int) – Maximum generation tokens.

Return type:

PageIndexResponse

Returns:

PageIndexResponse – Contains the generated answer, ranked sources, and the context string that was supplied to the model.

Raises:

ValueError – If no llm_kit was provided and generation is requested.

async aquery(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Async variant of query().

Return type:

PageIndexResponse

remove_document(doc_id)[source]

Remove all pages belonging to doc_id from the index.

Parameters:

doc_id (str) – The doc_id value from any PageEntry returned during ingestion.

Return type:

int

Returns:

int – Number of page entries removed.

clear()[source]

Remove all indexed entries and reset the pipeline to empty state.

Return type:

None

save(path)[source]

Serialise the full index to a JSON file.

The saved file contains all PageEntry records, BM25 term weights, and deduplication hashes. Reload with load().

Parameters:

path (str) – Destination file path (will be created or overwritten).

Return type:

None

classmethod load(path, **kwargs)[source]

Load a previously saved index from path.

Parameters:
  • path (str) – JSON file written by save().

  • **kwargs (Any) – Forwarded to the constructor (e.g. llm_kit=kit).

Return type:

PageIndexRAG

Returns:

PageIndexRAG – A new instance with the index fully restored.

property entry_count: int

Total number of indexed page entries.

property document_count: int

Number of distinct documents ingested.

class ractogateway.rag.page_index.PageEntry(**data)[source]

Bases: BaseModel

A single page (or fixed-size window) extracted from a document.

Produced by PageIndexRAG during ingestion and stored in the in-process index.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

entry_id: str
page_number: int | None
content: str
source: str
section_title: str | None
keywords: list[str]
doc_id: str
char_count: int
extra: dict[str, Any]
ocr_applied: bool
ocr_confidence: float | None
content_hash: str | None
property text: str

Alias for content.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.rag.page_index.PageIndexResult(**data)[source]

Bases: BaseModel

A single retrieved page together with its BM25 relevance score.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

entry: PageEntry
score: float
rank: int
matched_terms: list[str]
property content: str

Alias for entry.content.

property text: str

Alias for entry.content.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.rag.page_index.PageIndexResponse(**data)[source]

Bases: BaseModel

Full response from PageIndexRAG.query() / PageIndexRAG.aquery().

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

answer: LLMResponse | None
sources: list[PageIndexResult]
query: str
context_used: str
property results: list[PageIndexResult]

Alias for sources.

property pages: list[PageIndexResult]

Alias for sources.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ractogateway.rag.page_index.BaseOcrBackend[source]

Bases: ABC

Abstract base class for OCR backends.

Implementors must provide extract_text(). An async default is provided via aextract_text() that offloads the synchronous call to a thread-pool executor.

abstractmethod extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

async aextract_text(image_bytes, mime_type='image/png')[source]

Async variant of extract_text() (thread-pool offload).

Return type:

str

class ractogateway.rag.page_index.TesseractOcrBackend(lang='eng', config='', confidence_threshold=40.0)[source]

Bases: BaseOcrBackend

OCR via Tesseract.

Requires pytesseract and a working Tesseract installation. Install with:

pip install ractogateway[rag-ocr-tesseract]
# Also install Tesseract binary: https://github.com/UB-Mannheim/tesseract/wiki
Parameters:
  • lang (str) – Tesseract language string, e.g. "eng" (default), "eng+deu".

  • config (str) – Extra Tesseract config flags, e.g. "--psm 6".

  • confidence_threshold (float) – Pages where the mean word confidence is below this value (0–100) are flagged in the returned metadata; text is still returned. Set to 0 to disable filtering.

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

extract_with_confidence(image_bytes)[source]

Return (text, mean_confidence) for confidence-aware ingestion.

Return type:

tuple[str, float]

class ractogateway.rag.page_index.EasyOcrBackend(languages=None, gpu=False)[source]

Bases: BaseOcrBackend

OCR via EasyOCR.

Deep-learning model; no cloud API required. Supports 80+ languages. Install with:

pip install ractogateway[rag-ocr-easy]
Parameters:
  • languages (list[str] | None) – List of language codes, e.g. ["en"] (default) or ["en", "de"].

  • gpu (bool) – Use CUDA GPU if available (default False for broad compatibility).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index.GoogleVisionBackend(credentials_path=None)[source]

Bases: BaseOcrBackend

OCR via Google Cloud Vision API (DOCUMENT_TEXT_DETECTION).

Install with:

pip install ractogateway[rag-ocr-google]
Parameters:

credentials_path (str | None) – Path to a service-account JSON key file. If None the SDK uses Application Default Credentials (GOOGLE_APPLICATION_CREDENTIALS).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index.GoogleDocumentAIBackend(project_id, processor_id, location='us', credentials_path=None)[source]

Bases: BaseOcrBackend

OCR via Google Document AI.

Best for structured documents: tables, forms, invoices, contracts. Install with:

pip install ractogateway[rag-ocr-google]
Parameters:
  • project_id (str) – GCP project ID.

  • processor_id (str) – Document AI processor ID (e.g. an OCR or Form Parser processor).

  • location (str) – Processor region, usually "us" or "eu" (default "us").

  • credentials_path (str | None) – Optional path to a service-account JSON key.

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index.AWSTextractBackend(region_name='us-east-1', aws_access_key_id=None, aws_secret_access_key=None)[source]

Bases: BaseOcrBackend

OCR via AWS Textract.

Best for forms and tables; uses DetectDocumentText for plain text. Install with:

pip install ractogateway[rag-ocr-aws]
Parameters:
  • region_name (str) – AWS region (default "us-east-1").

  • aws_secret_access_key (str | None) – Optional explicit credentials; if omitted, boto3 uses the standard credential chain (env vars, ~/.aws/credentials, IAM role, etc.).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index.AzureDocumentIntelligenceBackend(endpoint, api_key, model_id='prebuilt-read')[source]

Bases: BaseOcrBackend

OCR via Azure Document Intelligence.

Previously called Azure Form Recognizer. The prebuilt-read model is used by default; swap in prebuilt-document for richer extraction. Install with:

pip install ractogateway[rag-ocr-azure]
Parameters:
  • endpoint (str) – Azure resource endpoint URL.

  • api_key (str) – Azure resource API key.

  • model_id (str) – Document Intelligence model to use (default "prebuilt-read").

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.