ractogateway.rag.page_index._bm25

Pure-Python BM25 index and decision-tree inverted index.

No external dependencies required — everything is implemented with the Python standard library.

Two components work together for two-stage retrieval:

  1. _DecisionIndex — an inverted keyword index that maps content terms to page entry IDs. Given a tokenised query it returns the union of candidate entry IDs in O(|query terms|) time. This is the “decision tree” routing layer.

  2. BM25Index — Okapi BM25 (k1=1.5, b=0.75) that scores the candidates returned by the decision index. Only candidates are scored, so the full corpus is never re-ranked on every query.

ractogateway.rag.page_index._bm25.extract_keywords(text, top_n=20)[source]

Return the top-n most frequent content tokens from text.

Return type:

list[str]

class ractogateway.rag.page_index._bm25.BM25Index(k1=1.5, b=0.75)[source]

Bases: object

Okapi BM25 scorer over a corpus of PageEntry texts.

Parameters:
  • k1 (float) – Term-frequency saturation parameter (default 1.5).

  • b (float) – Length normalisation parameter (default 0.75).

add(entry_id, text)[source]

Tokenise text and add the entry to the index.

Return type:

None

remove(entry_id)[source]

Remove entry_id from the index.

Return type:

None

clear()[source]
Return type:

None

score(query, candidate_ids=None)[source]

Score candidates against query and return ranked results.

Parameters:
  • query (str) – Raw query string.

  • candidate_ids (set[str] | None) – Subset of entry IDs to score. When None the entire corpus is scored (full-scan fallback).

Return type:

list[tuple[str, float, list[str]]]

Returns:

list of (entry_id, bm25_score, matched_terms) – Sorted descending by score, ties broken by entry_id for stability.

property entry_count: int