ractogateway.rag.processors
RAG text processors.
- class ractogateway.rag.processors.BaseProcessor[source]
Bases:
ABCTransform a text string and return the processed result.
Processors are applied to chunk content before embedding. They can normalise whitespace, lemmatize tokens, remove stop words, etc.
Chain multiple processors with
ProcessingPipeline.
- class ractogateway.rag.processors.Lemmatizer(use_pos_tagging=True)[source]
Bases:
BaseProcessorReduce words to their base (lemma) form using NLTK WordNet.
- Parameters:
use_pos_tagging (
bool) – IfTrue, use POS tagging to improve lemmatization accuracy. Slightly slower but produces better results.
- class ractogateway.rag.processors.ProcessingPipeline(processors)[source]
Bases:
BaseProcessorApply a sequence of
BaseProcessorobjects to text.Example:
pipeline = ProcessingPipeline([TextCleaner(), Lemmatizer()]) processed = pipeline.process(" Hello, worlds! ")
- Parameters:
processors (
list[BaseProcessor]) – Ordered list of processors to apply. Each processor receives the output of the previous one.
- class ractogateway.rag.processors.TextCleaner(normalize_unicode=True, strip_html=True, strip_control_chars=True, collapse_whitespace=True, collapse_blank_lines=True)[source]
Bases:
BaseProcessorNormalise text for embedding and retrieval.
Steps applied (all optional via constructor flags):
Unicode normalisation (NFC)
Strip residual HTML tags
Remove control characters
Collapse multiple spaces to one
Collapse runs of blank lines to at most two newlines
Strip leading/trailing whitespace
- Parameters:
normalize_unicode (
bool) – Applyunicodedata.normalize("NFC", text).strip_html (
bool) – Remove<tag>patterns.strip_control_chars (
bool) – Remove non-printable control characters.collapse_whitespace (
bool) – Collapse sequences of spaces/tabs to a single space.collapse_blank_lines (
bool) – Collapse 3+ consecutive newlines to 2.