ractogateway.rag.page_index.pipeline

PageIndexRAG — vectorless RAG using BM25 and a decision-tree index.

Unlike RactoRAG, this pipeline requires no embedding model and no vector store. It indexes documents at the page level and retrieves using a two-stage approach:

Decision index (inverted keyword index) — narrows the full corpus to candidate pages that share at least one token with the query.
BM25 scoring — ranks the candidates with Okapi BM25 for accurate relevance ordering.

This makes it ideal for keyword-rich corpora (legal, technical, financial documents) where exact term matching matters more than semantic similarity.

Quick start:

from ractogateway import openai_developer_kit as gpt
from ractogateway.rag.page_index import PageIndexRAG

# 1. Setup
kit = gpt.Chat(model="gpt-4o-mini")
rag = PageIndexRAG(llm_kit=kit)

# 2. Ingest
rag.ingest("report.pdf")

# 3. Query
response = rag.query("What were the Q3 revenue figures?")
print(response.answer.content)

# Retrieve without LLM
results = rag.retrieve("revenue", top_k=5)
for r in results:
    print(r.rank, r.score, r.entry.source, r.entry.page_number)

class ractogateway.rag.page_index.pipeline.PageIndexRAG(llm_kit=None, *, processors=None, reader_registry=None, context_template="Use the following retrieved page excerpts to answer the user's question.\\nIf the excerpts do not contain enough information, say so clearly.\\n\\n--- CONTEXT ---\\n{context}\\n--- END CONTEXT ---\\n\\nQuestion: {question}", default_prompt=None, page_size=1000, page_overlap=100, k1=1.5, b=0.75, top_keywords=20, ocr_backend=None, ocr_fallback=True, min_ocr_confidence=0.0)[source]

Bases: object

Vectorless RAG pipeline that indexes documents at the page level.

Parameters:

llm_kit (Any) – Any RactoGateway developer kit (OpenAI, Anthropic, Google, Ollama, HuggingFace). Required only for query() / aquery(). Pass None to use the pipeline in retrieve-only mode.
processors (Sequence[BaseProcessor] | None) – Text processors applied to each page before indexing. Defaults to [TextCleaner()].
reader_registry (FileReaderRegistry | None) – File reader registry used to load non-PDF documents. Defaults to a FileReaderRegistry with all built-in readers registered.
context_template (str) – Jinja-style template with {context} and {question} placeholders used when building the LLM prompt.
default_prompt (RactoPrompt | None) – RactoPrompt used for generation. Defaults to a built-in factual Q&A prompt.
page_size (int) – Maximum character length of each page window for non-PDF files (default 1 000).
page_overlap (int) – Character overlap between consecutive windows (default 100).
k1 (float) – BM25 term-frequency saturation parameter (default 1.5).
b (float) – BM25 length-normalisation parameter (default 0.75).
top_keywords (int) – Number of top TF-weighted keywords to extract per page for the decision index (default 20).

retrieve(query, top_k=5)[source]

Retrieve the most relevant pages for query.

Uses two-stage retrieval: decision index (candidate selection) → BM25 scoring (ranking).

Parameters:

query (str) – Natural-language question or keyword string.
top_k (int) – Maximum number of results to return.

Return type:

list[PageIndexResult]

Returns:

list[PageIndexResult] – Pages ranked by BM25 score (most relevant first).

async aretrieve(query, top_k=5)[source]

Async variant of retrieve().

Return type:: list[PageIndexResult]

ingest(path, **metadata)[source]

Read a file and add its pages to the index.

PDFs are split page-by-page; all other file types are split into fixed-size character windows.

Parameters:

path (str) – Absolute or relative path to the file.
**metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

Returns:

list[PageEntry] – All page entries created from this file.

async aingest(path, **metadata)[source]

Async variant of ingest().

Return type:: list[PageEntry]

ingest_text(text, source='manual', **metadata)[source]

Index raw text directly (no file I/O).

Parameters:

text (str) – Plain text to index.
source (str) – Descriptive label stored in PageEntry.source.
**metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

async aingest_text(text, source='manual', **metadata)[source]

Async variant of ingest_text().

Return type:: list[PageEntry]

ingest_dir(directory, pattern='**/*', *, on_progress=None, **metadata)[source]

Ingest all files matching pattern inside directory.

Files that cannot be read are logged and skipped; the rest are indexed normally.

Parameters:

directory (str) – Root directory to search.
pattern (str) – Glob pattern relative to directory (default "**/*").
on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file is processed (or skipped). Useful for progress bars.
**metadata (Any) – Forwarded to every ingest() call.

Return type:

list[PageEntry]

async aingest_dir(directory, pattern='**/*', *, max_concurrent=4, on_progress=None, **metadata)[source]

Async parallel variant of ingest_dir().

Parameters:

directory (str) – Root directory to search.
pattern (str) – Glob pattern relative to directory (default "**/*").
max_concurrent (int) – Maximum number of files ingested concurrently (default 4).
on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file finishes (thread-safe; called from the event loop).
**metadata (Any) – Forwarded to every aingest() call.

Return type:

list[PageEntry]

add_document(path, **metadata)[source]

Alias for ingest().

Return type:: list[PageEntry]

add_texts(texts, source='manual', **metadata)[source]

Ingest a list of text strings.

Return type:: list[PageEntry]

search(query, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Alias for query().

Return type:: PageIndexResponse

query(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Retrieve relevant pages and generate an answer with the LLM kit.

Parameters:

question (str) – Natural-language question to answer.
top_k (int) – Number of pages to retrieve.
prompt (RactoPrompt | None) – Override the kit’s default prompt for this call.
temperature (float) – Sampling temperature for generation.
max_tokens (int) – Maximum generation tokens.

Return type:

PageIndexResponse

Returns:

PageIndexResponse – Contains the generated answer, ranked sources, and the context string that was supplied to the model.

Raises:

ValueError – If no llm_kit was provided and generation is requested.

async aquery(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Async variant of query().

Return type:: PageIndexResponse

remove_document(doc_id)[source]

Remove all pages belonging to doc_id from the index.

Parameters:: doc_id (str) – The doc_id value from any PageEntry returned during ingestion.
Return type:: int
Returns:: int – Number of page entries removed.

clear()[source]

Remove all indexed entries and reset the pipeline to empty state.

Return type:: None

save(path)[source]

Serialise the full index to a JSON file.

The saved file contains all PageEntry records, BM25 term weights, and deduplication hashes. Reload with load().

Parameters:: path (str) – Destination file path (will be created or overwritten).
Return type:: None

classmethod load(path, **kwargs)[source]

Load a previously saved index from path.

Parameters:

path (str) – JSON file written by save().
**kwargs (Any) – Forwarded to the constructor (e.g. llm_kit=kit).

Return type:

PageIndexRAG

Returns:

PageIndexRAG – A new instance with the index fully restored.

property entry_count: int: Total number of indexed page entries.

property document_count: int: Number of distinct documents ingested.