ractogateway.rag.processors.cleaner
Text cleaning processor — no extra dependencies.
- class ractogateway.rag.processors.cleaner.TextCleaner(normalize_unicode=True, strip_html=True, strip_control_chars=True, collapse_whitespace=True, collapse_blank_lines=True)[source]
Bases:
BaseProcessorNormalise text for embedding and retrieval.
Steps applied (all optional via constructor flags):
Unicode normalisation (NFC)
Strip residual HTML tags
Remove control characters
Collapse multiple spaces to one
Collapse runs of blank lines to at most two newlines
Strip leading/trailing whitespace
- Parameters:
normalize_unicode (
bool) – Applyunicodedata.normalize("NFC", text).strip_html (
bool) – Remove<tag>patterns.strip_control_chars (
bool) – Remove non-printable control characters.collapse_whitespace (
bool) – Collapse sequences of spaces/tabs to a single space.collapse_blank_lines (
bool) – Collapse 3+ consecutive newlines to 2.