ractogateway.rag.chunkers
RAG text chunkers.
- class ractogateway.rag.chunkers.BaseChunker[source]
Bases:
ABCSplit a
Documentinto a list ofChunkobjects.Each chunk preserves provenance (
doc_id,chunk_index,start_char,end_char) in itsChunkMetadata.
- class ractogateway.rag.chunkers.FixedChunker(chunk_size=512, overlap=50)[source]
Bases:
BaseChunkerSplit text into fixed-size character windows with overlap.
- Parameters:
- class ractogateway.rag.chunkers.RecursiveChunker(chunk_size=512, overlap=50, separators=None)[source]
Bases:
BaseChunkerSplit text recursively using a priority list of separators.
- Parameters:
- class ractogateway.rag.chunkers.SemanticChunker(embedder, threshold=0.5, min_chunk_size=2, language='english')[source]
Bases:
BaseChunkerSplit documents where the semantic similarity between adjacent sentences drops below a threshold.
- Parameters:
embedder (
BaseEmbedder) – AnyBaseEmbedderinstance.threshold (
float) – Cosine similarity below which a split is inserted (default:0.5).min_chunk_size (
int) – Minimum number of sentences per chunk (prevents ultra-fine splits).language (
str) – NLTK sentence tokenizer language.
- class ractogateway.rag.chunkers.SentenceChunker(sentences_per_chunk=5, overlap_sentences=1, language='english')[source]
Bases:
BaseChunkerSplit text into groups of sentences using NLTK.
- Parameters: