ractogateway.rag.processors.cleaner

Text cleaning processor — no extra dependencies.

class ractogateway.rag.processors.cleaner.TextCleaner(normalize_unicode=True, strip_html=True, strip_control_chars=True, collapse_whitespace=True, collapse_blank_lines=True)[source]

Bases: BaseProcessor

Normalise text for embedding and retrieval.

Steps applied (all optional via constructor flags):

  1. Unicode normalisation (NFC)

  2. Strip residual HTML tags

  3. Remove control characters

  4. Collapse multiple spaces to one

  5. Collapse runs of blank lines to at most two newlines

  6. Strip leading/trailing whitespace

Parameters:
  • normalize_unicode (bool) – Apply unicodedata.normalize("NFC", text).

  • strip_html (bool) – Remove <tag> patterns.

  • strip_control_chars (bool) – Remove non-printable control characters.

  • collapse_whitespace (bool) – Collapse sequences of spaces/tabs to a single space.

  • collapse_blank_lines (bool) – Collapse 3+ consecutive newlines to 2.

process(text)[source]

Process text and return the transformed string.

Parameters:

text (str) – Input text (chunk content or raw document content).

Return type:

str

Returns:

str – Processed text. Must be a non-empty string when input is non-empty.