API Reference — PageIndexRAG
Vectorless RAG pipeline: keyword index + BM25 scoring. No embeddings, no vector store required.
Pipeline
- class ractogateway.rag.page_index.pipeline.PageIndexRAG(llm_kit=None, *, processors=None, reader_registry=None, context_template="Use the following retrieved page excerpts to answer the user's question.\\nIf the excerpts do not contain enough information, say so clearly.\\n\\n--- CONTEXT ---\\n{context}\\n--- END CONTEXT ---\\n\\nQuestion: {question}", default_prompt=None, page_size=1000, page_overlap=100, k1=1.5, b=0.75, top_keywords=20, ocr_backend=None, ocr_fallback=True, min_ocr_confidence=0.0)[source]
Bases:
objectVectorless RAG pipeline that indexes documents at the page level.
- Parameters:
llm_kit (
Any) – Any RactoGateway developer kit (OpenAI, Anthropic, Google, Ollama, HuggingFace). Required only forquery()/aquery(). PassNoneto use the pipeline in retrieve-only mode.processors (
Sequence[BaseProcessor] |None) – Text processors applied to each page before indexing. Defaults to[TextCleaner()].reader_registry (
FileReaderRegistry|None) – File reader registry used to load non-PDF documents. Defaults to aFileReaderRegistrywith all built-in readers registered.context_template (
str) – Jinja-style template with{context}and{question}placeholders used when building the LLM prompt.default_prompt (
RactoPrompt|None) –RactoPromptused for generation. Defaults to a built-in factual Q&A prompt.page_size (
int) – Maximum character length of each page window for non-PDF files (default 1 000).page_overlap (
int) – Character overlap between consecutive windows (default 100).k1 (
float) – BM25 term-frequency saturation parameter (default 1.5).b (
float) – BM25 length-normalisation parameter (default 0.75).top_keywords (
int) – Number of top TF-weighted keywords to extract per page for the decision index (default 20).
- retrieve(query, top_k=5)[source]
Retrieve the most relevant pages for query.
Uses two-stage retrieval: decision index (candidate selection) → BM25 scoring (ranking).
- Parameters:
- Return type:
- Returns:
list[PageIndexResult] – Pages ranked by BM25 score (most relevant first).
- async aretrieve(query, top_k=5)[source]
Async variant of
retrieve().- Return type:
- ingest(path, **metadata)[source]
Read a file and add its pages to the index.
PDFs are split page-by-page; all other file types are split into fixed-size character windows.
- ingest_text(text, source='manual', **metadata)[source]
Index raw text directly (no file I/O).
- async aingest_text(text, source='manual', **metadata)[source]
Async variant of
ingest_text().
- ingest_dir(directory, pattern='**/*', *, on_progress=None, **metadata)[source]
Ingest all files matching pattern inside directory.
Files that cannot be read are logged and skipped; the rest are indexed normally.
- Parameters:
- Return type:
- async aingest_dir(directory, pattern='**/*', *, max_concurrent=4, on_progress=None, **metadata)[source]
Async parallel variant of
ingest_dir().- Parameters:
directory (
str) – Root directory to search.pattern (
str) – Glob pattern relative to directory (default"**/*").max_concurrent (
int) – Maximum number of files ingested concurrently (default 4).on_progress (
Callable[[int,int],None] |None) – Optional callback(done, total) -> Nonecalled after each file finishes (thread-safe; called from the event loop).
- Return type:
- add_texts(texts, source='manual', **metadata)[source]
Ingest a list of text strings.
- search(query, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]
Alias for
query().- Return type:
- query(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]
Retrieve relevant pages and generate an answer with the LLM kit.
- Parameters:
- Return type:
- Returns:
PageIndexResponse – Contains the generated answer, ranked sources, and the context string that was supplied to the model.
- Raises:
ValueError – If no
llm_kitwas provided and generation is requested.
- async aquery(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]
Async variant of
query().- Return type:
- remove_document(doc_id)[source]
Remove all pages belonging to doc_id from the index.
- save(path)[source]
Serialise the full index to a JSON file.
The saved file contains all
PageEntryrecords, BM25 term weights, and deduplication hashes. Reload withload().
- classmethod load(path, **kwargs)[source]
Load a previously saved index from path.
- Parameters:
- Return type:
- Returns:
PageIndexRAG – A new instance with the index fully restored.
- property entry_count: int
Total number of indexed page entries.
- property document_count: int
Number of distinct documents ingested.
Models
Pydantic models for the PageIndexRAG pipeline.
- class ractogateway.rag.page_index._models.PageEntry(**data)[source]
Bases:
BaseModelA single page (or fixed-size window) extracted from a document.
Produced by
PageIndexRAGduring ingestion and stored in the in-process index.Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- entry_id: str
- content: str
- source: str
- doc_id: str
- char_count: int
- ocr_applied: bool
- property text: str
Alias for content.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.rag.page_index._models.PageIndexResult(**data)[source]
Bases:
BaseModelA single retrieved page together with its BM25 relevance score.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- entry: PageEntry
- score: float
- rank: int
- property content: str
Alias for entry.content.
- property text: str
Alias for entry.content.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class ractogateway.rag.page_index._models.PageIndexResponse(**data)[source]
Bases:
BaseModelFull response from
PageIndexRAG.query()/PageIndexRAG.aquery().Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- answer: LLMResponse | None
- sources: list[PageIndexResult]
- query: str
- context_used: str
- property results: list[PageIndexResult]
Alias for sources.
- property pages: list[PageIndexResult]
Alias for sources.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
BM25 Engine
Pure-Python BM25 index and decision-tree inverted index.
No external dependencies required — everything is implemented with the Python standard library.
Two components work together for two-stage retrieval:
_DecisionIndex— an inverted keyword index that maps content terms to page entry IDs. Given a tokenised query it returns the union of candidate entry IDs in O(|query terms|) time. This is the “decision tree” routing layer.BM25Index— Okapi BM25 (k1=1.5, b=0.75) that scores the candidates returned by the decision index. Only candidates are scored, so the full corpus is never re-ranked on every query.
- ractogateway.rag.page_index._bm25.extract_keywords(text, top_n=20)[source]
Return the top-n most frequent content tokens from text.
- class ractogateway.rag.page_index._bm25.BM25Index(k1=1.5, b=0.75)[source]
Bases:
objectOkapi BM25 scorer over a corpus of
PageEntrytexts.- Parameters:
- score(query, candidate_ids=None)[source]
Score candidates against query and return ranked results.
- Parameters:
- Return type:
- Returns:
list of (entry_id, bm25_score, matched_terms) – Sorted descending by score, ties broken by entry_id for stability.
- property entry_count: int
OCR Backends
OCR backends for PageIndexRAG.
Each backend converts raw image bytes (PNG/JPEG) into extracted text. All backends follow the same interface so they are interchangeable.
Available backends
TesseractOcrBackend— free, offline,pytesseractwrapperEasyOcrBackend— deep-learning, 80+ languages, offlineGoogleVisionBackend— Google Cloud Vision APIGoogleDocumentAIBackend— Google Document AI (tables, forms)AWSTextractBackend— AWS Textract (forms, tables, key-value)AzureDocumentIntelligenceBackend— Azure Form Recognizer v4
Quick start:
from ractogateway.rag.page_index import PageIndexRAG
from ractogateway.rag.page_index._ocr import TesseractOcrBackend
rag = PageIndexRAG(llm_kit=kit, ocr_backend=TesseractOcrBackend())
rag.ingest("scanned_report.pdf") # OCR fallback auto-triggered
- class ractogateway.rag.page_index._ocr.BaseOcrBackend[source]
Bases:
ABCAbstract base class for OCR backends.
Implementors must provide
extract_text(). An async default is provided viaaextract_text()that offloads the synchronous call to a thread-pool executor.- abstractmethod extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.TesseractOcrBackend(lang='eng', config='', confidence_threshold=40.0)[source]
Bases:
BaseOcrBackendOCR via Tesseract.
Requires
pytesseractand a working Tesseract installation. Install with:pip install ractogateway[rag-ocr-tesseract] # Also install Tesseract binary: https://github.com/UB-Mannheim/tesseract/wiki
- Parameters:
lang (
str) – Tesseract language string, e.g."eng"(default),"eng+deu".config (
str) – Extra Tesseract config flags, e.g."--psm 6".confidence_threshold (
float) – Pages where the mean word confidence is below this value (0–100) are flagged in the returned metadata; text is still returned. Set to0to disable filtering.
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.EasyOcrBackend(languages=None, gpu=False)[source]
Bases:
BaseOcrBackendOCR via EasyOCR.
Deep-learning model; no cloud API required. Supports 80+ languages. Install with:
pip install ractogateway[rag-ocr-easy]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.GoogleVisionBackend(credentials_path=None)[source]
Bases:
BaseOcrBackendOCR via Google Cloud Vision API (
DOCUMENT_TEXT_DETECTION).Install with:
pip install ractogateway[rag-ocr-google]
- Parameters:
credentials_path (
str|None) – Path to a service-account JSON key file. IfNonethe SDK uses Application Default Credentials (GOOGLE_APPLICATION_CREDENTIALS).
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.GoogleDocumentAIBackend(project_id, processor_id, location='us', credentials_path=None)[source]
Bases:
BaseOcrBackendOCR via Google Document AI.
Best for structured documents: tables, forms, invoices, contracts. Install with:
pip install ractogateway[rag-ocr-google]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.AWSTextractBackend(region_name='us-east-1', aws_access_key_id=None, aws_secret_access_key=None)[source]
Bases:
BaseOcrBackendOCR via AWS Textract.
Best for forms and tables; uses
DetectDocumentTextfor plain text. Install with:pip install ractogateway[rag-ocr-aws]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.AzureDocumentIntelligenceBackend(endpoint, api_key, model_id='prebuilt-read')[source]
Bases:
BaseOcrBackendOCR via Azure Document Intelligence.
Previously called Azure Form Recognizer. The
prebuilt-readmodel is used by default; swap inprebuilt-documentfor richer extraction. Install with:pip install ractogateway[rag-ocr-azure]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.