ractogateway.rag.readers
RAG file readers.
- class ractogateway.rag.readers.BaseReader[source]
Bases:
ABCRead content from a file path, raw bytes, or a binary buffer.
Concrete subclasses must implement
_read_path()and may override_read_bytes()to support bytes/buffer input. The publicread()method handles all type coercion automatically.- abstract property supported_extensions: frozenset[str]
Lower-case extensions (with dot) this reader handles, e.g.
{".pdf"}.
- read(source)[source]
Load source and return its content as a
Document.- Parameters:
source (
str|Path|bytes|BinaryIO) –strorPathFile path read from disk. Both absolute and relative paths are accepted.
bytesRaw file bytes.
Document.sourceis set to"<bytes>".- binary file-like object
Any object with a
.read() -> bytesmethod — e.g.io.BytesIO, an open binary file handle, a network stream.Document.sourceis set to"<buffer>".
- Return type:
- class ractogateway.rag.readers.FileReaderRegistry(readers=None)[source]
Bases:
objectRegistry that maps file extensions to
BaseReaderinstances.By default all built-in readers are registered. You can add custom readers with
register().Example:
registry = FileReaderRegistry() doc = registry.read("report.pdf")
- register(reader)[source]
Add reader to the registry for all its supported extensions.
- Return type:
- get_reader(path)[source]
Return the reader for path’s extension.
- Raises:
ValueError – If no reader supports the file’s extension.
- Return type:
- class ractogateway.rag.readers.HtmlReader[source]
Bases:
BaseReaderExtract visible text from HTML files using the stdlib HTML parser.
No external dependencies required.
Accepts a file path (
str/Path), rawbytes, or any binary file-like object with a.read()method.
- class ractogateway.rag.readers.ImageReader(include_exif=True)[source]
Bases:
BaseReaderExtract metadata from image files and represent them as text Documents.
The resulting
Document.contentis a human-readable summary of image properties (size, mode, format, EXIF tags). Pass the image to a vision LLM separately usingRactoFilefor actual visual understanding.Accepts a file path (
str/Path), rawbytes, or any binary file-like object with a.read()method.- Parameters:
include_exif (
bool) – Whether to extract and include EXIF metadata in the content.
- class ractogateway.rag.readers.PdfReader(extract_images=False)[source]
Bases:
BaseReaderExtract text from PDF files using
pypdf.Accepts a file path (
str/Path), rawbytes, or any binary file-like object with a.read()method.- Parameters:
extract_images (
bool) – Reserved for future use — image extraction is not yet supported.
- class ractogateway.rag.readers.SpreadsheetReader(max_rows=None, include_header=True)[source]
Bases:
BaseReaderRead CSV and Excel spreadsheets into plain text.
Each row is rendered as a tab-separated line; an optional header row is prepended. Multiple sheets in an XLSX workbook are separated by a
--- Sheet: <name> ---divider.Accepts a file path (
str/Path), rawbytes, or any binary file-like object with a.read()method. When bytes/buffer are provided, XLSX format is detected via the ZIP magic header (PK\x03\x04); everything else is treated as CSV/TSV.- Parameters:
- class ractogateway.rag.readers.TextReader(encoding='utf-8')[source]
Bases:
BaseReaderRead any UTF-8 (or latin-1 fallback) plain-text file.
No external dependencies required.
Accepts a file path (
str/Path), rawbytes, or any binary file-like object with a.read()method.- Parameters:
encoding (
str) – Primary encoding to try. Falls back to"latin-1"on error.
- class ractogateway.rag.readers.WordReader[source]
Bases:
BaseReaderExtract text from Microsoft Word (.docx) files using
python-docx.Accepts a file path (
str/Path), rawbytes, or any binary file-like object with a.read()method.