ractogateway.rag.processors

RAG text processors.

class ractogateway.rag.processors.BaseProcessor[source]

Bases: ABC

Transform a text string and return the processed result.

Processors are applied to chunk content before embedding. They can normalise whitespace, lemmatize tokens, remove stop words, etc.

Chain multiple processors with ProcessingPipeline.

abstractmethod process(text)[source]

Process text and return the transformed string.

Parameters:

text (str) – Input text (chunk content or raw document content).

Return type:

str

Returns:

str – Processed text. Must be a non-empty string when input is non-empty.

class ractogateway.rag.processors.Lemmatizer(use_pos_tagging=True)[source]

Bases: BaseProcessor

Reduce words to their base (lemma) form using NLTK WordNet.

Parameters:

use_pos_tagging (bool) – If True, use POS tagging to improve lemmatization accuracy. Slightly slower but produces better results.

process(text)[source]

Process text and return the transformed string.

Parameters:

text (str) – Input text (chunk content or raw document content).

Return type:

str

Returns:

str – Processed text. Must be a non-empty string when input is non-empty.

class ractogateway.rag.processors.ProcessingPipeline(processors)[source]

Bases: BaseProcessor

Apply a sequence of BaseProcessor objects to text.

Example:

pipeline = ProcessingPipeline([TextCleaner(), Lemmatizer()])
processed = pipeline.process("  Hello,   worlds!  ")
Parameters:

processors (list[BaseProcessor]) – Ordered list of processors to apply. Each processor receives the output of the previous one.

process(text)[source]

Process text and return the transformed string.

Parameters:

text (str) – Input text (chunk content or raw document content).

Return type:

str

Returns:

str – Processed text. Must be a non-empty string when input is non-empty.

class ractogateway.rag.processors.TextCleaner(normalize_unicode=True, strip_html=True, strip_control_chars=True, collapse_whitespace=True, collapse_blank_lines=True)[source]

Bases: BaseProcessor

Normalise text for embedding and retrieval.

Steps applied (all optional via constructor flags):

  1. Unicode normalisation (NFC)

  2. Strip residual HTML tags

  3. Remove control characters

  4. Collapse multiple spaces to one

  5. Collapse runs of blank lines to at most two newlines

  6. Strip leading/trailing whitespace

Parameters:
  • normalize_unicode (bool) – Apply unicodedata.normalize("NFC", text).

  • strip_html (bool) – Remove <tag> patterns.

  • strip_control_chars (bool) – Remove non-printable control characters.

  • collapse_whitespace (bool) – Collapse sequences of spaces/tabs to a single space.

  • collapse_blank_lines (bool) – Collapse 3+ consecutive newlines to 2.

process(text)[source]

Process text and return the transformed string.

Parameters:

text (str) – Input text (chunk content or raw document content).

Return type:

str

Returns:

str – Processed text. Must be a non-empty string when input is non-empty.