ractogateway.rag.page_index._ocr
OCR backends for PageIndexRAG.
Each backend converts raw image bytes (PNG/JPEG) into extracted text. All backends follow the same interface so they are interchangeable.
Available backends
TesseractOcrBackend— free, offline,pytesseractwrapperEasyOcrBackend— deep-learning, 80+ languages, offlineGoogleVisionBackend— Google Cloud Vision APIGoogleDocumentAIBackend— Google Document AI (tables, forms)AWSTextractBackend— AWS Textract (forms, tables, key-value)AzureDocumentIntelligenceBackend— Azure Form Recognizer v4
Quick start:
from ractogateway.rag.page_index import PageIndexRAG
from ractogateway.rag.page_index._ocr import TesseractOcrBackend
rag = PageIndexRAG(llm_kit=kit, ocr_backend=TesseractOcrBackend())
rag.ingest("scanned_report.pdf") # OCR fallback auto-triggered
- class ractogateway.rag.page_index._ocr.BaseOcrBackend[source]
Bases:
ABCAbstract base class for OCR backends.
Implementors must provide
extract_text(). An async default is provided viaaextract_text()that offloads the synchronous call to a thread-pool executor.- abstractmethod extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.TesseractOcrBackend(lang='eng', config='', confidence_threshold=40.0)[source]
Bases:
BaseOcrBackendOCR via Tesseract.
Requires
pytesseractand a working Tesseract installation. Install with:pip install ractogateway[rag-ocr-tesseract] # Also install Tesseract binary: https://github.com/UB-Mannheim/tesseract/wiki
- Parameters:
lang (
str) – Tesseract language string, e.g."eng"(default),"eng+deu".config (
str) – Extra Tesseract config flags, e.g."--psm 6".confidence_threshold (
float) – Pages where the mean word confidence is below this value (0–100) are flagged in the returned metadata; text is still returned. Set to0to disable filtering.
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.EasyOcrBackend(languages=None, gpu=False)[source]
Bases:
BaseOcrBackendOCR via EasyOCR.
Deep-learning model; no cloud API required. Supports 80+ languages. Install with:
pip install ractogateway[rag-ocr-easy]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.GoogleVisionBackend(credentials_path=None)[source]
Bases:
BaseOcrBackendOCR via Google Cloud Vision API (
DOCUMENT_TEXT_DETECTION).Install with:
pip install ractogateway[rag-ocr-google]
- Parameters:
credentials_path (
str|None) – Path to a service-account JSON key file. IfNonethe SDK uses Application Default Credentials (GOOGLE_APPLICATION_CREDENTIALS).
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.GoogleDocumentAIBackend(project_id, processor_id, location='us', credentials_path=None)[source]
Bases:
BaseOcrBackendOCR via Google Document AI.
Best for structured documents: tables, forms, invoices, contracts. Install with:
pip install ractogateway[rag-ocr-google]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.AWSTextractBackend(region_name='us-east-1', aws_access_key_id=None, aws_secret_access_key=None)[source]
Bases:
BaseOcrBackendOCR via AWS Textract.
Best for forms and tables; uses
DetectDocumentTextfor plain text. Install with:pip install ractogateway[rag-ocr-aws]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.
- class ractogateway.rag.page_index._ocr.AzureDocumentIntelligenceBackend(endpoint, api_key, model_id='prebuilt-read')[source]
Bases:
BaseOcrBackendOCR via Azure Document Intelligence.
Previously called Azure Form Recognizer. The
prebuilt-readmodel is used by default; swap inprebuilt-documentfor richer extraction. Install with:pip install ractogateway[rag-ocr-azure]
- Parameters:
- extract_text(image_bytes, mime_type='image/png')[source]
Convert image_bytes to plain text.