ractogateway.rag.page_index.pipeline
PageIndexRAG — vectorless RAG using BM25 and a decision-tree index.
Unlike RactoRAG, this pipeline requires
no embedding model and no vector store. It indexes documents at the
page level and retrieves using a two-stage approach:
Decision index (inverted keyword index) — narrows the full corpus to candidate pages that share at least one token with the query.
BM25 scoring — ranks the candidates with Okapi BM25 for accurate relevance ordering.
This makes it ideal for keyword-rich corpora (legal, technical, financial documents) where exact term matching matters more than semantic similarity.
Quick start:
from ractogateway import openai_developer_kit as gpt
from ractogateway.rag.page_index import PageIndexRAG
# 1. Setup
kit = gpt.Chat(model="gpt-4o-mini")
rag = PageIndexRAG(llm_kit=kit)
# 2. Ingest
rag.ingest("report.pdf")
# 3. Query
response = rag.query("What were the Q3 revenue figures?")
print(response.answer.content)
# Retrieve without LLM
results = rag.retrieve("revenue", top_k=5)
for r in results:
print(r.rank, r.score, r.entry.source, r.entry.page_number)
- class ractogateway.rag.page_index.pipeline.PageIndexRAG(llm_kit=None, *, processors=None, reader_registry=None, context_template="Use the following retrieved page excerpts to answer the user's question.\\nIf the excerpts do not contain enough information, say so clearly.\\n\\n--- CONTEXT ---\\n{context}\\n--- END CONTEXT ---\\n\\nQuestion: {question}", default_prompt=None, page_size=1000, page_overlap=100, k1=1.5, b=0.75, top_keywords=20, ocr_backend=None, ocr_fallback=True, min_ocr_confidence=0.0)[source]
Bases:
objectVectorless RAG pipeline that indexes documents at the page level.
- Parameters:
llm_kit (
Any) – Any RactoGateway developer kit (OpenAI, Anthropic, Google, Ollama, HuggingFace). Required only forquery()/aquery(). PassNoneto use the pipeline in retrieve-only mode.processors (
Sequence[BaseProcessor] |None) – Text processors applied to each page before indexing. Defaults to[TextCleaner()].reader_registry (
FileReaderRegistry|None) – File reader registry used to load non-PDF documents. Defaults to aFileReaderRegistrywith all built-in readers registered.context_template (
str) – Jinja-style template with{context}and{question}placeholders used when building the LLM prompt.default_prompt (
RactoPrompt|None) –RactoPromptused for generation. Defaults to a built-in factual Q&A prompt.page_size (
int) – Maximum character length of each page window for non-PDF files (default 1 000).page_overlap (
int) – Character overlap between consecutive windows (default 100).k1 (
float) – BM25 term-frequency saturation parameter (default 1.5).b (
float) – BM25 length-normalisation parameter (default 0.75).top_keywords (
int) – Number of top TF-weighted keywords to extract per page for the decision index (default 20).
- retrieve(query, top_k=5)[source]
Retrieve the most relevant pages for query.
Uses two-stage retrieval: decision index (candidate selection) → BM25 scoring (ranking).
- Parameters:
- Return type:
- Returns:
list[PageIndexResult] – Pages ranked by BM25 score (most relevant first).
- async aretrieve(query, top_k=5)[source]
Async variant of
retrieve().- Return type:
- ingest(path, **metadata)[source]
Read a file and add its pages to the index.
PDFs are split page-by-page; all other file types are split into fixed-size character windows.
- ingest_text(text, source='manual', **metadata)[source]
Index raw text directly (no file I/O).
- async aingest_text(text, source='manual', **metadata)[source]
Async variant of
ingest_text().
- ingest_dir(directory, pattern='**/*', *, on_progress=None, **metadata)[source]
Ingest all files matching pattern inside directory.
Files that cannot be read are logged and skipped; the rest are indexed normally.
- Parameters:
- Return type:
- async aingest_dir(directory, pattern='**/*', *, max_concurrent=4, on_progress=None, **metadata)[source]
Async parallel variant of
ingest_dir().- Parameters:
directory (
str) – Root directory to search.pattern (
str) – Glob pattern relative to directory (default"**/*").max_concurrent (
int) – Maximum number of files ingested concurrently (default 4).on_progress (
Callable[[int,int],None] |None) – Optional callback(done, total) -> Nonecalled after each file finishes (thread-safe; called from the event loop).
- Return type:
- add_texts(texts, source='manual', **metadata)[source]
Ingest a list of text strings.
- search(query, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]
Alias for
query().- Return type:
- query(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]
Retrieve relevant pages and generate an answer with the LLM kit.
- Parameters:
- Return type:
- Returns:
PageIndexResponse – Contains the generated answer, ranked sources, and the context string that was supplied to the model.
- Raises:
ValueError – If no
llm_kitwas provided and generation is requested.
- async aquery(question, *, top_k=5, prompt=None, temperature=0.0, max_tokens=2048)[source]
Async variant of
query().- Return type:
- remove_document(doc_id)[source]
Remove all pages belonging to doc_id from the index.
- save(path)[source]
Serialise the full index to a JSON file.
The saved file contains all
PageEntryrecords, BM25 term weights, and deduplication hashes. Reload withload().
- classmethod load(path, **kwargs)[source]
Load a previously saved index from path.
- Parameters:
- Return type:
- Returns:
PageIndexRAG – A new instance with the index fully restored.
- property entry_count: int
Total number of indexed page entries.
- property document_count: int
Number of distinct documents ingested.