ractogateway.rag.page_index._ocr

OCR backends for PageIndexRAG.

Each backend converts raw image bytes (PNG/JPEG) into extracted text. All backends follow the same interface so they are interchangeable.

Available backends

  • TesseractOcrBackend — free, offline, pytesseract wrapper

  • EasyOcrBackend — deep-learning, 80+ languages, offline

  • GoogleVisionBackend — Google Cloud Vision API

  • GoogleDocumentAIBackend — Google Document AI (tables, forms)

  • AWSTextractBackend — AWS Textract (forms, tables, key-value)

  • AzureDocumentIntelligenceBackend — Azure Form Recognizer v4

Quick start:

from ractogateway.rag.page_index import PageIndexRAG
from ractogateway.rag.page_index._ocr import TesseractOcrBackend

rag = PageIndexRAG(llm_kit=kit, ocr_backend=TesseractOcrBackend())
rag.ingest("scanned_report.pdf")   # OCR fallback auto-triggered
class ractogateway.rag.page_index._ocr.BaseOcrBackend[source]

Bases: ABC

Abstract base class for OCR backends.

Implementors must provide extract_text(). An async default is provided via aextract_text() that offloads the synchronous call to a thread-pool executor.

abstractmethod extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

async aextract_text(image_bytes, mime_type='image/png')[source]

Async variant of extract_text() (thread-pool offload).

Return type:

str

class ractogateway.rag.page_index._ocr.TesseractOcrBackend(lang='eng', config='', confidence_threshold=40.0)[source]

Bases: BaseOcrBackend

OCR via Tesseract.

Requires pytesseract and a working Tesseract installation. Install with:

pip install ractogateway[rag-ocr-tesseract]
# Also install Tesseract binary: https://github.com/UB-Mannheim/tesseract/wiki
Parameters:
  • lang (str) – Tesseract language string, e.g. "eng" (default), "eng+deu".

  • config (str) – Extra Tesseract config flags, e.g. "--psm 6".

  • confidence_threshold (float) – Pages where the mean word confidence is below this value (0–100) are flagged in the returned metadata; text is still returned. Set to 0 to disable filtering.

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

extract_with_confidence(image_bytes)[source]

Return (text, mean_confidence) for confidence-aware ingestion.

Return type:

tuple[str, float]

class ractogateway.rag.page_index._ocr.EasyOcrBackend(languages=None, gpu=False)[source]

Bases: BaseOcrBackend

OCR via EasyOCR.

Deep-learning model; no cloud API required. Supports 80+ languages. Install with:

pip install ractogateway[rag-ocr-easy]
Parameters:
  • languages (list[str] | None) – List of language codes, e.g. ["en"] (default) or ["en", "de"].

  • gpu (bool) – Use CUDA GPU if available (default False for broad compatibility).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.GoogleVisionBackend(credentials_path=None)[source]

Bases: BaseOcrBackend

OCR via Google Cloud Vision API (DOCUMENT_TEXT_DETECTION).

Install with:

pip install ractogateway[rag-ocr-google]
Parameters:

credentials_path (str | None) – Path to a service-account JSON key file. If None the SDK uses Application Default Credentials (GOOGLE_APPLICATION_CREDENTIALS).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.GoogleDocumentAIBackend(project_id, processor_id, location='us', credentials_path=None)[source]

Bases: BaseOcrBackend

OCR via Google Document AI.

Best for structured documents: tables, forms, invoices, contracts. Install with:

pip install ractogateway[rag-ocr-google]
Parameters:
  • project_id (str) – GCP project ID.

  • processor_id (str) – Document AI processor ID (e.g. an OCR or Form Parser processor).

  • location (str) – Processor region, usually "us" or "eu" (default "us").

  • credentials_path (str | None) – Optional path to a service-account JSON key.

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.AWSTextractBackend(region_name='us-east-1', aws_access_key_id=None, aws_secret_access_key=None)[source]

Bases: BaseOcrBackend

OCR via AWS Textract.

Best for forms and tables; uses DetectDocumentText for plain text. Install with:

pip install ractogateway[rag-ocr-aws]
Parameters:
  • region_name (str) – AWS region (default "us-east-1").

  • aws_secret_access_key (str | None) – Optional explicit credentials; if omitted, boto3 uses the standard credential chain (env vars, ~/.aws/credentials, IAM role, etc.).

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.

class ractogateway.rag.page_index._ocr.AzureDocumentIntelligenceBackend(endpoint, api_key, model_id='prebuilt-read')[source]

Bases: BaseOcrBackend

OCR via Azure Document Intelligence.

Previously called Azure Form Recognizer. The prebuilt-read model is used by default; swap in prebuilt-document for richer extraction. Install with:

pip install ractogateway[rag-ocr-azure]
Parameters:
  • endpoint (str) – Azure resource endpoint URL.

  • api_key (str) – Azure resource API key.

  • model_id (str) – Document Intelligence model to use (default "prebuilt-read").

extract_text(image_bytes, mime_type='image/png')[source]

Convert image_bytes to plain text.

Parameters:
  • image_bytes (bytes) – Raw image data (PNG, JPEG, TIFF, …).

  • mime_type (str) – MIME type hint used by cloud APIs (default "image/png").

Return type:

str

Returns:

str – Extracted text, or an empty string if nothing was recognised.