ractogateway.rag.page_index._bm25

Pure-Python BM25 index and decision-tree inverted index.

No external dependencies required — everything is implemented with the Python standard library.

Two components work together for two-stage retrieval:

_DecisionIndex — an inverted keyword index that maps content terms to page entry IDs. Given a tokenised query it returns the union of candidate entry IDs in O(|query terms|) time. This is the “decision tree” routing layer.
BM25Index — Okapi BM25 (k1=1.5, b=0.75) that scores the candidates returned by the decision index. Only candidates are scored, so the full corpus is never re-ranked on every query.

ractogateway.rag.page_index._bm25.extract_keywords(text, top_n=20)[source]

Return the top-n most frequent content tokens from text.

class ractogateway.rag.page_index._bm25.BM25Index(k1=1.5, b=0.75)[source]

Okapi BM25 scorer over a corpus of PageEntry texts.

Parameters:

add(entry_id, text)[source]

Tokenise text and add the entry to the index.

remove(entry_id)[source]

Remove entry_id from the index.

score(query, candidate_ids=None)[source]

Score candidates against query and return ranked results.

Parameters:

query (str) – Raw query string.
candidate_ids (set[str] | None) – Subset of entry IDs to score. When None the entire corpus is scored (full-scan fallback).

Return type:

list[tuple[str, float, list[str]]]

Returns:

list of (entry_id, bm25_score, matched_terms) – Sorted descending by score, ties broken by entry_id for stability.