ractogateway.rag.readers

RAG file readers.

class ractogateway.rag.readers.BaseReader[source]

Bases: ABC

Read content from a file path, raw bytes, or a binary buffer.

Concrete subclasses must implement _read_path() and may override _read_bytes() to support bytes/buffer input. The public read() method handles all type coercion automatically.

abstract property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.

read(source)[source]

Load source and return its content as a Document.

Parameters:

source (str | Path | bytes | BinaryIO) –

str or Path

File path read from disk. Both absolute and relative paths are accepted.

bytes

Raw file bytes. Document.source is set to "<bytes>".

binary file-like object

Any object with a .read() -> bytes method — e.g. io.BytesIO, an open binary file handle, a network stream. Document.source is set to "<buffer>".

Return type:

Document

class ractogateway.rag.readers.FileReaderRegistry(readers=None)[source]

Bases: object

Registry that maps file extensions to BaseReader instances.

By default all built-in readers are registered. You can add custom readers with register().

Example:

registry = FileReaderRegistry()
doc = registry.read("report.pdf")
register(reader)[source]

Add reader to the registry for all its supported extensions.

Return type:

None

get_reader(path)[source]

Return the reader for path’s extension.

Raises:

ValueError – If no reader supports the file’s extension.

Return type:

BaseReader

read(path)[source]

Convenience method: detect reader and return a Document.

Return type:

Document

property supported_extensions: frozenset[str]

All extensions currently registered.

class ractogateway.rag.readers.HtmlReader[source]

Bases: BaseReader

Extract visible text from HTML files using the stdlib HTML parser.

No external dependencies required.

Accepts a file path (str / Path), raw bytes, or any binary file-like object with a .read() method.

property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.

class ractogateway.rag.readers.ImageReader(include_exif=True)[source]

Bases: BaseReader

Extract metadata from image files and represent them as text Documents.

The resulting Document.content is a human-readable summary of image properties (size, mode, format, EXIF tags). Pass the image to a vision LLM separately using RactoFile for actual visual understanding.

Accepts a file path (str / Path), raw bytes, or any binary file-like object with a .read() method.

Parameters:

include_exif (bool) – Whether to extract and include EXIF metadata in the content.

property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.

class ractogateway.rag.readers.PdfReader(extract_images=False)[source]

Bases: BaseReader

Extract text from PDF files using pypdf.

Accepts a file path (str / Path), raw bytes, or any binary file-like object with a .read() method.

Parameters:

extract_images (bool) – Reserved for future use — image extraction is not yet supported.

property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.

class ractogateway.rag.readers.SpreadsheetReader(max_rows=None, include_header=True)[source]

Bases: BaseReader

Read CSV and Excel spreadsheets into plain text.

Each row is rendered as a tab-separated line; an optional header row is prepended. Multiple sheets in an XLSX workbook are separated by a --- Sheet: <name> --- divider.

Accepts a file path (str / Path), raw bytes, or any binary file-like object with a .read() method. When bytes/buffer are provided, XLSX format is detected via the ZIP magic header (PK\x03\x04); everything else is treated as CSV/TSV.

Parameters:
  • max_rows (int | None) – Maximum number of rows to read per sheet (None = all).

  • include_header (bool) – Whether to repeat the header row at the start of each sheet section.

property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.

class ractogateway.rag.readers.TextReader(encoding='utf-8')[source]

Bases: BaseReader

Read any UTF-8 (or latin-1 fallback) plain-text file.

No external dependencies required.

Accepts a file path (str / Path), raw bytes, or any binary file-like object with a .read() method.

Parameters:

encoding (str) – Primary encoding to try. Falls back to "latin-1" on error.

property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.

class ractogateway.rag.readers.WordReader[source]

Bases: BaseReader

Extract text from Microsoft Word (.docx) files using python-docx.

Accepts a file path (str / Path), raw bytes, or any binary file-like object with a .read() method.

property supported_extensions: frozenset[str]

Lower-case extensions (with dot) this reader handles, e.g. {".pdf"}.