RAG — Retrieval-Augmented Generation
RactoGateway ships two complementary RAG pipelines:
Pipeline |
Requires embeddings |
Requires vector store |
Best for |
|---|---|---|---|
|
Yes |
Yes |
Semantic / conceptual queries |
|
No |
No |
Keyword-rich exact-term queries, cost-sensitive setups |
RactoRAG
RactoRAG provides a full pipeline: read → chunk → process → embed → store → retrieve.
PageIndexRAG — Vectorless BM25 RAG
PageIndexRAG indexes documents at the page level and retrieves using a two-stage
decision-tree approach — no embedding API calls, no external vector store required.
How it works
Decision index (routing): Each page’s top-N TF-weighted keywords are stored in an inverted index (
term → page IDs). A query is tokenised and the index returns the union of matching page IDs in O(|query terms|) time.BM25 scoring: Only the candidate pages from step 1 are scored with Okapi BM25 (k1=1.5, b=0.75), giving accurate relevance ordering without scanning the full corpus.
Quick start
from ractogateway.rag.page_index import PageIndexRAG
from ractogateway import openai_developer_kit as gpt
kit = gpt.Chat(model="gpt-4o", default_prompt=my_prompt)
rag = PageIndexRAG(llm_kit=kit)
rag.ingest("report.pdf") # page-by-page via pypdf
rag.ingest("notes.txt") # sliding-window (1 000 chars, 100 overlap)
rag.ingest_text("raw text...", source="memo")
# Retrieve-only (no LLM needed)
results = rag.retrieve("Q3 revenue APAC", top_k=5)
for r in results:
print(r.rank, r.score, r.entry.source, r.entry.page_number, r.matched_terms)
# Full RAG: retrieve + generate
response = rag.query("What were the Q3 APAC revenue figures?")
print(response.answer.content)
# Async
await rag.aingest("big_report.pdf")
results = await rag.aretrieve("revenue", top_k=3)
response = await rag.aquery("Summarise findings.")
Page splitting strategy
File type |
Strategy |
|---|---|
PDF ( |
|
All others |
Sliding character windows ( |
OCR support for scanned and handwritten PDFs
When a PDF page contains no embedded text (scanned document, handwritten notes,
image-only PDF), pypdf returns an empty string. Pass an ocr_backend to
automatically fall back to OCR for those pages:
from ractogateway.rag.page_index import PageIndexRAG, TesseractOcrBackend
rag = PageIndexRAG(
llm_kit=kit,
ocr_backend=TesseractOcrBackend(lang="eng"),
ocr_fallback=True, # default: only OCR pages with no embedded text
min_ocr_confidence=40.0, # skip pages where mean word confidence < 40 (0–100)
)
rag.ingest("scanned_contract.pdf") # digital pages use pypdf, blank pages use OCR
When ocr_fallback=True (default) only empty pages trigger OCR — digital pages are
never sent to the OCR backend, keeping costs low. Set ocr_fallback=False to force
OCR on every page regardless of embedded text.
OCR metadata is stored on every PageEntry:
entries = rag.ingest("scanned.pdf")
for e in entries:
print(e.ocr_applied, e.ocr_confidence) # True 87.4
Available OCR backends
Backend |
Extra |
Notes |
|---|---|---|
|
|
Free, offline; requires Tesseract binary |
|
|
Deep-learning, 80+ languages, offline |
|
|
Google Cloud Vision |
|
|
Google Document AI; best for tables / forms |
|
|
AWS Textract; great for key-value pairs |
|
|
Azure Form Recognizer v4 |
pip install ractogateway[rag-ocr-tesseract] # Tesseract (free, offline)
pip install ractogateway[rag-ocr-easy] # EasyOCR (deep-learning, offline)
pip install ractogateway[rag-ocr-google] # Google Vision + Document AI
pip install ractogateway[rag-ocr-aws] # AWS Textract
pip install ractogateway[rag-ocr-azure] # Azure Document Intelligence
Tesseract (free, offline, no API key):
from ractogateway.rag.page_index import TesseractOcrBackend
# Multi-language
backend = TesseractOcrBackend(lang="eng+deu", config="--psm 6")
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend, min_ocr_confidence=50.0)
Google Document AI (best for structured docs — invoices, contracts, forms):
from ractogateway.rag.page_index import GoogleDocumentAIBackend
backend = GoogleDocumentAIBackend(
project_id="my-gcp-project",
processor_id="abc123def456", # OCR or Form Parser processor
location="us",
)
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend)
AWS Textract:
from ractogateway.rag.page_index import AWSTextractBackend
backend = AWSTextractBackend(region_name="us-east-1")
# Credentials from env / ~/.aws/credentials / IAM role
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend)
Azure Document Intelligence:
from ractogateway.rag.page_index import AzureDocumentIntelligenceBackend
backend = AzureDocumentIntelligenceBackend(
endpoint="https://my-resource.cognitiveservices.azure.com/",
api_key="...",
model_id="prebuilt-read", # or "prebuilt-document" for richer extraction
)
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend)
Document deduplication
Re-ingesting the same file is a no-op — PageIndexRAG computes a SHA-256
hash of the raw file bytes on every ingest() call and returns the cached
entries immediately if the file was already indexed:
entries_1 = rag.ingest("report.pdf") # indexed — 12 pages
entries_2 = rag.ingest("report.pdf") # no-op — returns same 12 entries instantly
assert entries_1 == entries_2
This prevents duplicate pages from inflating BM25 scores when the same file is ingested from different paths or re-processed in a pipeline restart.
Index persistence — save and load
Persist the full index (entries, BM25 weights, dedup hashes) to a JSON file and reload it across process restarts:
# Build and save
rag = PageIndexRAG(llm_kit=kit, ocr_backend=TesseractOcrBackend())
rag.ingest("report.pdf")
rag.save("./my_index.json")
# Reload in a new process — no re-ingestion needed
rag2 = PageIndexRAG.load("./my_index.json", llm_kit=kit)
response = rag2.query("What are the key findings?")
print(response.answer.content)
The saved JSON is human-readable and portable. Any ocr_backend configured at
save time must be re-supplied to load() if you intend to ingest new documents
after loading; it is not serialised.
Removing documents
Remove all pages of a specific document from the index:
entries = rag.ingest("old_report.pdf")
doc_id = entries[0].doc_id
removed = rag.remove_document(doc_id)
print(f"Removed {removed} pages")
remove_document also cleans up the BM25 term-frequency state and the
dedup hash, so the file can be re-ingested fresh afterwards.
Parallel ingest for large corpora
Async parallel (aingest_dir) — up to max_concurrent files at once:
import asyncio
async def main():
entries = await rag.aingest_dir(
"./docs/",
pattern="**/*.pdf",
max_concurrent=8,
on_progress=lambda done, total: print(f"{done}/{total}"),
)
print(f"Indexed {len(entries)} pages")
asyncio.run(main())
Sync with progress callback (ingest_dir):
def show_progress(done: int, total: int) -> None:
print(f"[{done}/{total}] ingesting...")
entries = rag.ingest_dir("./docs/", on_progress=show_progress)
Constructor parameters
Parameter |
Default |
Description |
|---|---|---|
|
|
Kit for generation; |
|
|
Text cleaning pipeline |
|
Built-in |
File reader registry |
|
Built-in |
|
|
Built-in RAG prompt |
Generation system prompt |
|
|
Max chars per window (non-PDF) |
|
|
Char overlap between windows |
|
|
BM25 term-frequency saturation |
|
|
BM25 length normalisation |
|
|
Keywords per page in decision index |
|
|
OCR backend for scanned / handwritten PDFs |
|
|
Only OCR pages with no embedded text |
|
|
Drop OCR pages below this confidence (0–100) |
Result models
PageEntry — one indexed page:
Field |
Type |
Description |
|---|---|---|
|
|
Auto UUID |
|
|
1-based PDF page; |
|
|
Post-processed page text |
|
|
File path or label |
|
|
First Markdown heading on the page |
|
|
Top-N TF terms used by decision index |
|
|
Parent document UUID |
|
|
|
|
|
|
|
|
Mean word confidence (0–100); |
PageIndexResult — one retrieved page:
Field |
Type |
Description |
|---|---|---|
|
|
The retrieved page |
|
|
BM25 relevance score |
|
|
1-based rank |
|
|
Query tokens that hit this page |
PageIndexResponse — full query response:
Field |
Type |
Description |
|---|---|---|
|
|
Generated answer ( |
|
|
Retrieved pages |
|
|
Original question |
|
|
Context injected into LLM |