ractogateway.rag.chunkers

RAG text chunkers.

class ractogateway.rag.chunkers.BaseChunker[source]

Bases: ABC

Split a Document into a list of Chunk objects.

Each chunk preserves provenance (doc_id, chunk_index, start_char, end_char) in its ChunkMetadata.

abstractmethod chunk(document)[source]

Split document into chunks.

Parameters:

document (Document) – The fully-loaded document to split.

Return type:

list[Chunk]

Returns:

list[Chunk] – Ordered list of non-overlapping (or slightly overlapping) chunks.

class ractogateway.rag.chunkers.FixedChunker(chunk_size=512, overlap=50)[source]

Bases: BaseChunker

Split text into fixed-size character windows with overlap.

Parameters:
  • chunk_size (int) – Maximum number of characters per chunk.

  • overlap (int) – Number of characters to repeat at the start of the next chunk. Must be less than chunk_size.

chunk(document)[source]

Split document into chunks.

Parameters:

document (Document) – The fully-loaded document to split.

Return type:

list[Chunk]

Returns:

list[Chunk] – Ordered list of non-overlapping (or slightly overlapping) chunks.

class ractogateway.rag.chunkers.RecursiveChunker(chunk_size=512, overlap=50, separators=None)[source]

Bases: BaseChunker

Split text recursively using a priority list of separators.

Parameters:
  • chunk_size (int) – Maximum number of characters per chunk.

  • overlap (int) – Number of characters of overlap between consecutive chunks.

  • separators (list[str] | None) – Ordered list of separator strings to try. The first separator that produces pieces within chunk_size is used.

chunk(document)[source]

Split document into chunks.

Parameters:

document (Document) – The fully-loaded document to split.

Return type:

list[Chunk]

Returns:

list[Chunk] – Ordered list of non-overlapping (or slightly overlapping) chunks.

class ractogateway.rag.chunkers.SemanticChunker(embedder, threshold=0.5, min_chunk_size=2, language='english')[source]

Bases: BaseChunker

Split documents where the semantic similarity between adjacent sentences drops below a threshold.

Parameters:
  • embedder (BaseEmbedder) – Any BaseEmbedder instance.

  • threshold (float) – Cosine similarity below which a split is inserted (default: 0.5).

  • min_chunk_size (int) – Minimum number of sentences per chunk (prevents ultra-fine splits).

  • language (str) – NLTK sentence tokenizer language.

chunk(document)[source]

Split document into chunks.

Parameters:

document (Document) – The fully-loaded document to split.

Return type:

list[Chunk]

Returns:

list[Chunk] – Ordered list of non-overlapping (or slightly overlapping) chunks.

class ractogateway.rag.chunkers.SentenceChunker(sentences_per_chunk=5, overlap_sentences=1, language='english')[source]

Bases: BaseChunker

Split text into groups of sentences using NLTK.

Parameters:
  • sentences_per_chunk (int) – Number of sentences per chunk.

  • overlap_sentences (int) – Number of sentences to repeat at the start of the next chunk.

  • language (str) – Language for the NLTK sentence tokenizer (default: "english").

chunk(document)[source]

Split document into chunks.

Parameters:

document (Document) – The fully-loaded document to split.

Return type:

list[Chunk]

Returns:

list[Chunk] – Ordered list of non-overlapping (or slightly overlapping) chunks.