ractogateway.rag.chunkers.semantic_chunker
Semantic chunker — splits at embedding-space boundaries.
Uses cosine similarity between adjacent sentence embeddings to detect
topic shifts. Requires an BaseEmbedder
and NLTK sent_tokenize.
Install with: pip install ractogateway[rag-nlp]
- class ractogateway.rag.chunkers.semantic_chunker.SemanticChunker(embedder, threshold=0.5, min_chunk_size=2, language='english')[source]
Bases:
BaseChunkerSplit documents where the semantic similarity between adjacent sentences drops below a threshold.
- Parameters:
embedder (
BaseEmbedder) – AnyBaseEmbedderinstance.threshold (
float) – Cosine similarity below which a split is inserted (default:0.5).min_chunk_size (
int) – Minimum number of sentences per chunk (prevents ultra-fine splits).language (
str) – NLTK sentence tokenizer language.