Retrievers¶
Dense, sparse, and hybrid retrievers expose Gold chunks through vector, lexical, and fused search APIs. They feed GraphRAG rerankers, benchmarks, and the app UIs (React frontend in production, Streamlit for legacy testing).
Inputs & Outputs¶
Inputs: Gold buckets (gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet), environment keys GEMINI_KEY, optional OPENAI_KEY (for GraphRAG rerankers), CT_DATA_ROOT.
Outputs:
- Dense (Chroma) persistence:
vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/containingchroma.sqlite3,indexshards,collection_metadata.json. - BM25 persistence:
vectordb_bm25/<dataset>/<date>/index.json,bm25_text.pkl, optionallybm25_summary.pkl, token cache. - Hybrid state: optional latency logs (
BENCH_LATENCY_LOG,HYB_LATENCY_LOG), fused hit lists returned to callers (not persisted by default).
Parameters¶
Dense retriever (python -m dense)¶
buildoptions:--dataset,--date(default latest),--persist(path override),--batch(default 128),--limit(debug),--use-summary/--use-text.- Embedding model/environment:
DENSE_EMBED_MODEL(text-embedding-004default),DENSE_PERSIST_DIRroot,COMM_PROGRESS_RICHtoggles progress UI. - Query options:
--top-k(default 10),--hydrate(adds Gold text/metadata),--json. - Python API:
DenseRetriever(collection, top_k=50)with.search(query, top_k)returning list of{id, score, doc, metadata}.
BM25 retriever (python -m bm25)¶
buildoptions:--use-summary(default False = full text),--k1 1.2,--b 0.75,--sharding {mono|lang},--limit,--persist.- Tokenization: Snowball stemming for supported languages, optional
jiebafor CJK; envBM25_PERSIST_DIRfor root path. - Query options:
--top-k(default 10),--use-summary/--use-textchoose index file,--route(auto|all|<lang>for language shards),--hydrate,--json. - Python API:
load_bm25(dataset, date, persist_dir, mode)returns object with.search(query, top_k, route).
Hybrid & Graph-aware retrieval¶
- Module
rag.retrieval.hybridorchestrates BM25 + dense seeds, optional graph expansion, and reciprocal rank fusion (RRF). - Key knobs:
- Seed sizes:
seed_k(default 120),hyb_seed_kCLI flag in benchmarks. - Fusion weights:
HYB_W_BM25(1.0),HYB_W_DENSE(1.5),HYB_W_GRAPH(1.0),HYB_RRF_K(60). - Graph expansion: requires Neo4j + community ingest tag;
expand_ratio(2.0) × seeds,expand_limit(800),k_comms(communities to expand). - Re-ranking:
--rerank-mode dense(default) uses prebuilt dense vectors;summaryembeds chunk summaries ad hoc;nonedisables rerank. - Integration points:
python -m communities retrieveexposes fused results;rag.bench.cli run --adapter hybridautomates evaluation; both the production React frontend (via API) and legacy Streamlit app use the same hybrid pipeline.
Step-by-step Tasks¶
1. Build dense index¶
- ⬜ Command:
Bash - ✅ Output:
vectordb_dense/gemini_text-embedding-004/fixed_size/2025-09-14/withchroma.sqlite3and metadata JSON. - ⬜ Python check:
python from pathlib import Path from rag.dense.loader import DenseRetriever, load_chroma coll = load_chroma(Path("vectordb_dense/gemini_text-embedding-004/fixed_size/2025-09-14"), "fixed_size_2025-09-14") retr = DenseRetriever(coll, top_k=5) hits = retr.search("green hydrogen electrolyzer") assert hits - ⬜ Troubleshooting: If embeddings fail, confirm
GEMINI_KEYand installchromadb,sentence-transformers(included viapip install cleantech-pipeline[chunk]).
2. Build BM25 index¶
- ⬜ Command:
bash python -m bm25 build --dataset fixed_size --date 2025-09-14 --use-text --k1 1.2 --b 0.75 --sharding lang - ✅ Output:
vectordb_bm25/fixed_size/2025-09-14/index.jsonsummarizing shards;.pklfiles for text/summary indexes. - ⬜ Python check:
python from rag.bm25.retriever import load_bm25 bm25 = load_bm25(dataset="fixed_size", date="2025-09-14", mode="text") hits = bm25.search("solid-state battery", top_k=5, route="auto") assert hits - ⬜ Troubleshooting: Missing
jiebawarnings are safe; for multilingual corpora setpip install jieba. Ifpunkttokenizer missing, runpython -m nltk.downloader punkt.
3. Hybrid retrieval sanity¶
- ⬜ Graph-free fusion (BM25 + dense only):
python from rag.retrieval.hybrid import _seed_hits, _rrf_fuse bm25_hits, dense_hits = _seed_hits("direct air capture financing", dataset="fixed_size", date="2025-09-14", seed_k=60) rank_map = { "bm25": {cid: i+1 for i, (cid, _) in enumerate(bm25_hits[:60])}, "dense": {cid: i+1 for i, (cid, _) in enumerate(dense_hits[:60])}, } fused = _rrf_fuse(rank_map, {"bm25": 1.0, "dense": 1.5}) - ⬜ Graph expansion via CLI:
bash python -m communities retrieve "direct air capture" --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --k-comms 24 --top-k 50 --rerank --hydrate - ✅ Output: Ranked chunks with doc IDs,
scorefield reflecting fused + rerank scores; hydrated text snippet displayed. - ⬜ Troubleshooting:
- Graph expansion requires
NEO4J_*env and dense index (for rerank). Set--no-rerankif dense index missing. - High latency? Capture with
HYB_LATENCY_LOG=bench_out/fixed_size/2025-09-14/evals/latency_stages.jsonland inspect viapython -m rag.bench.cli evaluate --stage-log ....
Validation & Quality Gates¶
- Dense:
python -m dense queryshould return non-empty hits for canonical queries ("hydrogen electrolyzer","floating offshore wind"). Expect cosine scores ≥0.55 for top match. - BM25:
python -m bm25 query --use-text "floating offshore wind"returns lexical hits with BM25 scores >5. - Hybrid: ensure fused list contains at least 30% overlap with Gold doc IDs for benchmark QA (
rag.bench.cli evaluate>0.45map). - Hydration: when
--hydrate, verifymetadata_fullcontains Gold metadata includingentitiesandchunk_summary.
Reproducibility¶
- Dense embeddings deterministic given same
chunks_enriched.parquetand model version; persist directories under version control when promoting to prod. - BM25 indexes include
index.jsonwith build parameters (k1,b,mode,sharding); archive alongside release. - Hybrid retrieval uses deterministic fusion if seeds/hits identical; record weight/env overrides (
HYB_*) in run logs.
Related notebooks¶
- 04_subsample_unified_chunked — chunk distribution before indexing.
- 05_metadata_extraction — enriched metadata used for rerank diagnostics.
See also¶
- RAG overview and GraphRAG.
- Evaluation workflows: RAG benchmarks.
- Demo usage: App.