Retrievers¶

Dense, sparse, and hybrid retrievers expose Gold chunks through vector, lexical, and fused search APIs. They feed GraphRAG rerankers, benchmarks, and the app UIs (React frontend in production, Streamlit for legacy testing).

Retrievers

Inputs & Outputs¶

Inputs: Gold buckets (gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet), environment keys GEMINI_KEY, optional OPENAI_KEY (for GraphRAG rerankers), CT_DATA_ROOT.

Outputs:

Dense (Chroma) persistence: vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/ containing chroma.sqlite3, index shards, collection_metadata.json.
BM25 persistence: vectordb_bm25/<dataset>/<date>/index.json, bm25_text.pkl, optionally bm25_summary.pkl, token cache.
Hybrid state: optional latency logs (BENCH_LATENCY_LOG, HYB_LATENCY_LOG), fused hit lists returned to callers (not persisted by default).

Parameters¶

Dense retriever (`python -m dense`)¶

build options: --dataset, --date (default latest), --persist (path override), --batch (default 128), --limit (debug), --use-summary/--use-text.
Embedding model/environment: DENSE_EMBED_MODEL (text-embedding-004 default), DENSE_PERSIST_DIR root, COMM_PROGRESS_RICH toggles progress UI.
Query options: --top-k (default 10), --hydrate (adds Gold text/metadata), --json.
Python API: DenseRetriever(collection, top_k=50) with .search(query, top_k) returning list of {id, score, doc, metadata}.

BM25 retriever (`python -m bm25`)¶

build options: --use-summary (default False = full text), --k1 1.2, --b 0.75, --sharding {mono|lang}, --limit, --persist.
Tokenization: Snowball stemming for supported languages, optional jieba for CJK; env BM25_PERSIST_DIR for root path.
Query options: --top-k (default 10), --use-summary/--use-text choose index file, --route (auto|all|<lang> for language shards), --hydrate, --json.
Python API: load_bm25(dataset, date, persist_dir, mode) returns object with .search(query, top_k, route).

Hybrid & Graph-aware retrieval¶

Module rag.retrieval.hybrid orchestrates BM25 + dense seeds, optional graph expansion, and reciprocal rank fusion (RRF).
Key knobs:
Seed sizes: seed_k (default 120), hyb_seed_k CLI flag in benchmarks.
Fusion weights: HYB_W_BM25 (1.0), HYB_W_DENSE (1.5), HYB_W_GRAPH (1.0), HYB_RRF_K (60).
Graph expansion: requires Neo4j + community ingest tag; expand_ratio (2.0) × seeds, expand_limit (800), k_comms (communities to expand).
Re-ranking: --rerank-mode dense (default) uses prebuilt dense vectors; summary embeds chunk summaries ad hoc; none disables rerank.
Integration points: python -m communities retrieve exposes fused results; rag.bench.cli run --adapter hybrid automates evaluation; both the production React frontend (via API) and legacy Streamlit app use the same hybrid pipeline.

Step-by-step Tasks¶

1. Build dense index¶

⬜ Command:

Bash
python -m dense build --dataset fixed_size --date 2025-09-14 --use-summary --batch 128

✅ Output: vectordb_dense/gemini_text-embedding-004/fixed_size/2025-09-14/ with chroma.sqlite3 and metadata JSON.
⬜ Python check: python from pathlib import Path from rag.dense.loader import DenseRetriever, load_chroma coll = load_chroma(Path("vectordb_dense/gemini_text-embedding-004/fixed_size/2025-09-14"), "fixed_size_2025-09-14") retr = DenseRetriever(coll, top_k=5) hits = retr.search("green hydrogen electrolyzer") assert hits
⬜ Troubleshooting: If embeddings fail, confirm GEMINI_KEY and install chromadb, sentence-transformers (included via pip install cleantech-pipeline[chunk]).

2. Build BM25 index¶

⬜ Command: bash python -m bm25 build --dataset fixed_size --date 2025-09-14 --use-text --k1 1.2 --b 0.75 --sharding lang
✅ Output: vectordb_bm25/fixed_size/2025-09-14/index.json summarizing shards; .pkl files for text/summary indexes.
⬜ Python check: python from rag.bm25.retriever import load_bm25 bm25 = load_bm25(dataset="fixed_size", date="2025-09-14", mode="text") hits = bm25.search("solid-state battery", top_k=5, route="auto") assert hits
⬜ Troubleshooting: Missing jieba warnings are safe; for multilingual corpora set pip install jieba. If punkt tokenizer missing, run python -m nltk.downloader punkt.

3. Hybrid retrieval sanity¶

⬜ Graph-free fusion (BM25 + dense only): python from rag.retrieval.hybrid import _seed_hits, _rrf_fuse bm25_hits, dense_hits = _seed_hits("direct air capture financing", dataset="fixed_size", date="2025-09-14", seed_k=60) rank_map = { "bm25": {cid: i+1 for i, (cid, _) in enumerate(bm25_hits[:60])}, "dense": {cid: i+1 for i, (cid, _) in enumerate(dense_hits[:60])}, } fused = _rrf_fuse(rank_map, {"bm25": 1.0, "dense": 1.5})
⬜ Graph expansion via CLI: bash python -m communities retrieve "direct air capture" --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --k-comms 24 --top-k 50 --rerank --hydrate
✅ Output: Ranked chunks with doc IDs, score field reflecting fused + rerank scores; hydrated text snippet displayed.
⬜ Troubleshooting:
Graph expansion requires NEO4J_* env and dense index (for rerank). Set --no-rerank if dense index missing.
High latency? Capture with HYB_LATENCY_LOG=bench_out/fixed_size/2025-09-14/evals/latency_stages.jsonl and inspect via python -m rag.bench.cli evaluate --stage-log ....

Validation & Quality Gates¶

Dense: python -m dense query should return non-empty hits for canonical queries ("hydrogen electrolyzer", "floating offshore wind"). Expect cosine scores ≥0.55 for top match.
BM25: python -m bm25 query --use-text "floating offshore wind" returns lexical hits with BM25 scores >5.
Hybrid: ensure fused list contains at least 30% overlap with Gold doc IDs for benchmark QA (rag.bench.cli evaluate >0.45 map).
Hydration: when --hydrate, verify metadata_full contains Gold metadata including entities and chunk_summary.

Reproducibility¶

Dense embeddings deterministic given same chunks_enriched.parquet and model version; persist directories under version control when promoting to prod.
BM25 indexes include index.json with build parameters (k1, b, mode, sharding); archive alongside release.
Hybrid retrieval uses deterministic fusion if seeds/hits identical; record weight/env overrides (HYB_*) in run logs.

04_subsample_unified_chunked — chunk distribution before indexing.
05_metadata_extraction — enriched metadata used for rerank diagnostics.