ractogateway.rag.page_index.pipeline

PageIndexRAG — vectorless RAG using BM25 and a decision-tree index.

Unlike RactoRAG, this pipeline requires no embedding model and no vector store. It indexes documents at the page level and retrieves using a two-stage approach:

  1. Decision index (inverted keyword index) — narrows the full corpus to candidate pages that share at least one token with the query.

  2. BM25 scoring — ranks the candidates with Okapi BM25 for accurate relevance ordering.

This makes it ideal for keyword-rich corpora (legal, technical, financial documents) where exact term matching matters more than semantic similarity.

Quick start:

from ractogateway import openai_developer_kit as gpt
from ractogateway.rag.page_index import PageIndexRAG

# 1. Setup
kit = gpt.Chat(model="gpt-4o-mini")
rag = PageIndexRAG(llm_kit=kit)

# 2. Ingest
rag.ingest("report.pdf")

# 3. Query
response = rag.query("What were the Q3 revenue figures?")
print(response.answer.content)

# Retrieve without LLM
results = rag.retrieve("revenue", top_k=5)
for r in results:
    print(r.rank, r.score, r.entry.source, r.entry.page_number)
class ractogateway.rag.page_index.pipeline.PageIndexRAG(llm_kit=None, *, processors=None, reader_registry=None, context_template="Use the following retrieved page excerpts to answer the user's question.\\nIf the excerpts do not contain enough information, say so clearly.\\n\\n--- CONTEXT ---\\n{context}\\n--- END CONTEXT ---\\n\\nQuestion: {question}", default_prompt=None, page_size=1000, page_overlap=100, k1=1.5, b=0.75, top_keywords=20, ocr_backend=None, ocr_fallback=True, min_ocr_confidence=0.0)[source]

Bases: object

Vectorless RAG pipeline that indexes documents at the page level.

Parameters:
  • llm_kit (Any) – Any RactoGateway developer kit (OpenAI, Anthropic, Google, Ollama, HuggingFace). Required only for query() / aquery(). Pass None to use the pipeline in retrieve-only mode.

  • processors (Sequence[BaseProcessor] | None) – Text processors applied to each page before indexing. Defaults to [TextCleaner()].

  • reader_registry (FileReaderRegistry | None) – File reader registry used to load non-PDF documents. Defaults to a FileReaderRegistry with all built-in readers registered.

  • context_template (str) – Jinja-style template with {context} and {question} placeholders used when building the LLM prompt.

  • default_prompt (RactoPrompt | None) – RactoPrompt used for generation. Defaults to a built-in factual Q&A prompt.

  • page_size (int) – Maximum character length of each page window for non-PDF files (default 1 000).

  • page_overlap (int) – Character overlap between consecutive windows (default 100).

  • k1 (float) – BM25 term-frequency saturation parameter (default 1.5).

  • b (float) – BM25 length-normalisation parameter (default 0.75).

  • top_keywords (int) – Number of top TF-weighted keywords to extract per page for the decision index (default 20).

retrieve(query, top_k=5)[source]

Retrieve the most relevant pages for query.

Uses two-stage retrieval: decision index (candidate selection) → BM25 scoring (ranking).

Parameters:
  • query (str) – Natural-language question or keyword string.

  • top_k (int) – Maximum number of results to return.

Return type:

list[PageIndexResult]

Returns:

list[PageIndexResult] – Pages ranked by BM25 score (most relevant first).

async aretrieve(query, top_k=5)[source]

Async variant of retrieve().

Return type:

list[PageIndexResult]

ingest(path, **metadata)[source]

Read a file and add its pages to the index.

PDFs are split page-by-page; all other file types are split into fixed-size character windows.

Parameters:
  • path (str) – Absolute or relative path to the file.

  • **metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

Returns:

list[PageEntry] – All page entries created from this file.

async aingest(path, **metadata)[source]

Async variant of ingest().

Return type:

list[PageEntry]

ingest_text(text, source='manual', **metadata)[source]

Index raw text directly (no file I/O).

Parameters:
  • text (str) – Plain text to index.

  • source (str) – Descriptive label stored in PageEntry.source.

  • **metadata (Any) – Arbitrary key/value pairs stored in PageEntry.extra.

Return type:

list[PageEntry]

async aingest_text(text, source='manual', **metadata)[source]

Async variant of ingest_text().

Return type:

list[PageEntry]

ingest_dir(directory, pattern='**/*', *, on_progress=None, **metadata)[source]

Ingest all files matching pattern inside directory.

Files that cannot be read are logged and skipped; the rest are indexed normally.

Parameters:
  • directory (str) – Root directory to search.

  • pattern (str) – Glob pattern relative to directory (default "**/*").

  • on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file is processed (or skipped). Useful for progress bars.

  • **metadata (Any) – Forwarded to every ingest() call.

Return type:

list[PageEntry]

async aingest_dir(directory, pattern='**/*', *, max_concurrent=4, on_progress=None, **metadata)[source]

Async parallel variant of ingest_dir().

Parameters:
  • directory (str) – Root directory to search.

  • pattern (str) – Glob pattern relative to directory (default "**/*").

  • max_concurrent (int) – Maximum number of files ingested concurrently (default 4).

  • on_progress (Callable[[int, int], None] | None) – Optional callback (done, total) -> None called after each file finishes (thread-safe; called from the event loop).

  • **metadata (Any) – Forwarded to every aingest() call.

Return type:

list[PageEntry]

add_document(path, **metadata)[source]

Alias for ingest().

Return type:

list[PageEntry]

add_texts(texts, source='manual', **metadata)[source]

Ingest a list of text strings.

Return type:

list[PageEntry]

search(query, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Alias for query().

Return type:

PageIndexResponse

query(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Retrieve relevant pages and generate an answer with the LLM kit.

Parameters:
  • question (str) – Natural-language question to answer.

  • top_k (int) – Number of pages to retrieve.

  • prompt (RactoPrompt | None) – Override the kit’s default prompt for this call.

  • temperature (float) – Sampling temperature for generation.

  • max_tokens (int) – Maximum generation tokens.

Return type:

PageIndexResponse

Returns:

PageIndexResponse – Contains the generated answer, ranked sources, and the context string that was supplied to the model.

Raises:

ValueError – If no llm_kit was provided and generation is requested.

async aquery(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]

Async variant of query().

Return type:

PageIndexResponse

remove_document(doc_id)[source]

Remove all pages belonging to doc_id from the index.

Parameters:

doc_id (str) – The doc_id value from any PageEntry returned during ingestion.

Return type:

int

Returns:

int – Number of page entries removed.

clear()[source]

Remove all indexed entries and reset the pipeline to empty state.

Return type:

None

save(path)[source]

Serialise the full index to a JSON file.

The saved file contains all PageEntry records, BM25 term weights, and deduplication hashes. Reload with load().

Parameters:

path (str) – Destination file path (will be created or overwritten).

Return type:

None

classmethod load(path, **kwargs)[source]

Load a previously saved index from path.

Parameters:
  • path (str) – JSON file written by save().

  • **kwargs (Any) – Forwarded to the constructor (e.g. llm_kit=kit).

Return type:

PageIndexRAG

Returns:

PageIndexRAG – A new instance with the index fully restored.

property entry_count: int

Total number of indexed page entries.

property document_count: int

Number of distinct documents ingested.