# RAG — Retrieval-Augmented Generation

RactoGateway ships two complementary RAG pipelines:

| Pipeline | Requires embeddings | Requires vector store | Best for |
| --- | :---: | :---: | --- |
| `RactoRAG` | Yes | Yes | Semantic / conceptual queries |
| `PageIndexRAG` | **No** | **No** | Keyword-rich exact-term queries, cost-sensitive setups |

---

## RactoRAG

`RactoRAG` provides a full pipeline: read → chunk → process → embed → store → retrieve.

---

## PageIndexRAG — Vectorless BM25 RAG

`PageIndexRAG` indexes documents at the **page level** and retrieves using a two-stage
decision-tree approach — no embedding API calls, no external vector store required.

### How it works

1. **Decision index (routing):** Each page's top-N TF-weighted keywords are stored in an
   inverted index (`term → page IDs`). A query is tokenised and the index returns the union
   of matching page IDs in O(|query terms|) time.
2. **BM25 scoring:** Only the candidate pages from step 1 are scored with Okapi BM25
   (k1=1.5, b=0.75), giving accurate relevance ordering without scanning the full corpus.

### Quick start

```python
from ractogateway.rag.page_index import PageIndexRAG
from ractogateway import openai_developer_kit as gpt

kit = gpt.Chat(model="gpt-4o", default_prompt=my_prompt)
rag = PageIndexRAG(llm_kit=kit)

rag.ingest("report.pdf")           # page-by-page via pypdf
rag.ingest("notes.txt")            # sliding-window (1 000 chars, 100 overlap)
rag.ingest_text("raw text...", source="memo")

# Retrieve-only (no LLM needed)
results = rag.retrieve("Q3 revenue APAC", top_k=5)
for r in results:
    print(r.rank, r.score, r.entry.source, r.entry.page_number, r.matched_terms)

# Full RAG: retrieve + generate
response = rag.query("What were the Q3 APAC revenue figures?")
print(response.answer.content)

# Async
await rag.aingest("big_report.pdf")
results = await rag.aretrieve("revenue", top_k=3)
response = await rag.aquery("Summarise findings.")
```

### Page splitting strategy

| File type | Strategy |
| --- | --- |
| PDF (`.pdf`) | `pypdf` — one `PageEntry` per PDF page |
| All others | Sliding character windows (`page_size=1000`, `page_overlap=100`) |

### OCR support for scanned and handwritten PDFs

When a PDF page contains no embedded text (scanned document, handwritten notes,
image-only PDF), `pypdf` returns an empty string. Pass an `ocr_backend` to
automatically fall back to OCR for those pages:

```python
from ractogateway.rag.page_index import PageIndexRAG, TesseractOcrBackend

rag = PageIndexRAG(
    llm_kit=kit,
    ocr_backend=TesseractOcrBackend(lang="eng"),
    ocr_fallback=True,          # default: only OCR pages with no embedded text
    min_ocr_confidence=40.0,    # skip pages where mean word confidence < 40 (0–100)
)
rag.ingest("scanned_contract.pdf")   # digital pages use pypdf, blank pages use OCR
```

When `ocr_fallback=True` (default) only empty pages trigger OCR — digital pages are
never sent to the OCR backend, keeping costs low. Set `ocr_fallback=False` to force
OCR on every page regardless of embedded text.

OCR metadata is stored on every `PageEntry`:

```python
entries = rag.ingest("scanned.pdf")
for e in entries:
    print(e.ocr_applied, e.ocr_confidence)   # True  87.4
```

#### Available OCR backends

| Backend | Extra | Notes |
| --- | --- | --- |
| `TesseractOcrBackend` | `rag-ocr-tesseract` | Free, offline; requires Tesseract binary |
| `EasyOcrBackend` | `rag-ocr-easy` | Deep-learning, 80+ languages, offline |
| `GoogleVisionBackend` | `rag-ocr-google` | Google Cloud Vision `DOCUMENT_TEXT_DETECTION` |
| `GoogleDocumentAIBackend` | `rag-ocr-google` | Google Document AI; best for tables / forms |
| `AWSTextractBackend` | `rag-ocr-aws` | AWS Textract; great for key-value pairs |
| `AzureDocumentIntelligenceBackend` | `rag-ocr-azure` | Azure Form Recognizer v4 |

```bash
pip install ractogateway[rag-ocr-tesseract]   # Tesseract (free, offline)
pip install ractogateway[rag-ocr-easy]        # EasyOCR (deep-learning, offline)
pip install ractogateway[rag-ocr-google]      # Google Vision + Document AI
pip install ractogateway[rag-ocr-aws]         # AWS Textract
pip install ractogateway[rag-ocr-azure]       # Azure Document Intelligence
```

**Tesseract** (free, offline, no API key):

```python
from ractogateway.rag.page_index import TesseractOcrBackend

# Multi-language
backend = TesseractOcrBackend(lang="eng+deu", config="--psm 6")
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend, min_ocr_confidence=50.0)
```

**Google Document AI** (best for structured docs — invoices, contracts, forms):

```python
from ractogateway.rag.page_index import GoogleDocumentAIBackend

backend = GoogleDocumentAIBackend(
    project_id="my-gcp-project",
    processor_id="abc123def456",   # OCR or Form Parser processor
    location="us",
)
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend)
```

**AWS Textract**:

```python
from ractogateway.rag.page_index import AWSTextractBackend

backend = AWSTextractBackend(region_name="us-east-1")
# Credentials from env / ~/.aws/credentials / IAM role
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend)
```

**Azure Document Intelligence**:

```python
from ractogateway.rag.page_index import AzureDocumentIntelligenceBackend

backend = AzureDocumentIntelligenceBackend(
    endpoint="https://my-resource.cognitiveservices.azure.com/",
    api_key="...",
    model_id="prebuilt-read",   # or "prebuilt-document" for richer extraction
)
rag = PageIndexRAG(llm_kit=kit, ocr_backend=backend)
```

---

### Document deduplication

Re-ingesting the same file is a **no-op** — `PageIndexRAG` computes a SHA-256
hash of the raw file bytes on every `ingest()` call and returns the cached
entries immediately if the file was already indexed:

```python
entries_1 = rag.ingest("report.pdf")   # indexed — 12 pages
entries_2 = rag.ingest("report.pdf")   # no-op — returns same 12 entries instantly
assert entries_1 == entries_2
```

This prevents duplicate pages from inflating BM25 scores when the same file is
ingested from different paths or re-processed in a pipeline restart.

---

### Index persistence — save and load

Persist the full index (entries, BM25 weights, dedup hashes) to a JSON file
and reload it across process restarts:

```python
# Build and save
rag = PageIndexRAG(llm_kit=kit, ocr_backend=TesseractOcrBackend())
rag.ingest("report.pdf")
rag.save("./my_index.json")

# Reload in a new process — no re-ingestion needed
rag2 = PageIndexRAG.load("./my_index.json", llm_kit=kit)
response = rag2.query("What are the key findings?")
print(response.answer.content)
```

The saved JSON is human-readable and portable. Any `ocr_backend` configured at
save time must be re-supplied to `load()` if you intend to ingest new documents
after loading; it is not serialised.

---

### Removing documents

Remove all pages of a specific document from the index:

```python
entries = rag.ingest("old_report.pdf")
doc_id = entries[0].doc_id

removed = rag.remove_document(doc_id)
print(f"Removed {removed} pages")
```

`remove_document` also cleans up the BM25 term-frequency state and the
dedup hash, so the file can be re-ingested fresh afterwards.

---

### Parallel ingest for large corpora

**Async parallel** (`aingest_dir`) — up to `max_concurrent` files at once:

```python
import asyncio

async def main():
    entries = await rag.aingest_dir(
        "./docs/",
        pattern="**/*.pdf",
        max_concurrent=8,
        on_progress=lambda done, total: print(f"{done}/{total}"),
    )
    print(f"Indexed {len(entries)} pages")

asyncio.run(main())
```

**Sync with progress callback** (`ingest_dir`):

```python
def show_progress(done: int, total: int) -> None:
    print(f"[{done}/{total}] ingesting...")

entries = rag.ingest_dir("./docs/", on_progress=show_progress)
```

---

### Constructor parameters

| Parameter | Default | Description |
| --- | --- | --- |
| `llm_kit` | `None` | Kit for generation; `None` = retrieve-only mode |
| `processors` | `[TextCleaner()]` | Text cleaning pipeline |
| `reader_registry` | Built-in | File reader registry |
| `context_template` | Built-in | `{context}` / `{question}` template |
| `default_prompt` | Built-in RAG prompt | Generation system prompt |
| `page_size` | `1000` | Max chars per window (non-PDF) |
| `page_overlap` | `100` | Char overlap between windows |
| `k1` | `1.5` | BM25 term-frequency saturation |
| `b` | `0.75` | BM25 length normalisation |
| `top_keywords` | `20` | Keywords per page in decision index |
| `ocr_backend` | `None` | OCR backend for scanned / handwritten PDFs |
| `ocr_fallback` | `True` | Only OCR pages with no embedded text |
| `min_ocr_confidence` | `0.0` | Drop OCR pages below this confidence (0–100) |

### Result models

**`PageEntry`** — one indexed page:

| Field | Type | Description |
| --- | --- | --- |
| `entry_id` | `str` | Auto UUID |
| `page_number` | `int \| None` | 1-based PDF page; `None` for windows |
| `content` | `str` | Post-processed page text |
| `source` | `str` | File path or label |
| `section_title` | `str \| None` | First Markdown heading on the page |
| `keywords` | `list[str]` | Top-N TF terms used by decision index |
| `doc_id` | `str` | Parent document UUID |
| `char_count` | `int` | `len(content)` |
| `ocr_applied` | `bool` | `True` when text was produced by OCR |
| `ocr_confidence` | `float \| None` | Mean word confidence (0–100); `None` if unavailable |

**`PageIndexResult`** — one retrieved page:

| Field | Type | Description |
| --- | --- | --- |
| `entry` | `PageEntry` | The retrieved page |
| `score` | `float` | BM25 relevance score |
| `rank` | `int` | 1-based rank |
| `matched_terms` | `list[str]` | Query tokens that hit this page |

**`PageIndexResponse`** — full query response:

| Field | Type | Description |
| --- | --- | --- |
| `answer` | `LLMResponse \| None` | Generated answer (`None` if no kit) |
| `sources` | `list[PageIndexResult]` | Retrieved pages |
| `query` | `str` | Original question |
| `context_used` | `str` | Context injected into LLM |