ractogateway.rag.chunkers.semantic_chunker

Semantic chunker — splits at embedding-space boundaries.

Uses cosine similarity between adjacent sentence embeddings to detect topic shifts. Requires an BaseEmbedder and NLTK sent_tokenize.

Install with: pip install ractogateway[rag-nlp]

class ractogateway.rag.chunkers.semantic_chunker.SemanticChunker(embedder, threshold=0.5, min_chunk_size=2, language='english')[source]

Bases: BaseChunker

Split documents where the semantic similarity between adjacent sentences drops below a threshold.

Parameters:
  • embedder (BaseEmbedder) – Any BaseEmbedder instance.

  • threshold (float) – Cosine similarity below which a split is inserted (default: 0.5).

  • min_chunk_size (int) – Minimum number of sentences per chunk (prevents ultra-fine splits).

  • language (str) – NLTK sentence tokenizer language.

chunk(document)[source]

Split document into chunks.

Parameters:

document (Document) – The fully-loaded document to split.

Return type:

list[Chunk]

Returns:

list[Chunk] – Ordered list of non-overlapping (or slightly overlapping) chunks.