RAG¶
The Retrieval-Augmented Generation stack consumes Gold chunks to build graph, dense, and sparse indices. It powers GraphRAG exploration, hybrid retrievers, automated benchmarks, and the interactive apps (production React frontend + legacy Streamlit fallback).
Inputs & Outputs¶
- Inputs: Gold chunks (
chunks_enriched.parquetforfixed_sizeandsemantic), provider keys (GEMINI_KEY,OPENAI_KEY), Neo4j credentials (NEO4J_URI,NEO4J_USER,NEO4J_PASSWORD). - Primary outputs:
- Graph CSV + Neo4j ingest:
gold_subsample_chunk/<dataset>/<date>/gr_nodes.csv,gr_edges.csv, populatedChunk,Community,Entitynodes. - Dense store:
vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/(Chroma persistence). - BM25 store:
vectordb_bm25/<dataset>/<date>/(bm25_text.pkl,bm25_summary.pkl, manifestindex.json). - Benchmark artifacts:
bench_out/<dataset>/<date>/qa/*.parquet,bench_out/<dataset>/<date>/evals/*.json, leaderboard CSV, latency logs. - App configuration/state: frontend session cookies and optional
.streamlit/secrets.tomlfor the legacy fallback app.
Parameters¶
- Shared config:
CT_DATA_ROOT, dataset selectors (fixed_size,semantic), run date (defaults to latest bucket), top-k defaults (communities.retrieve --top-k 100, dense/BM25--top-k 10). - Graph:
graphbuild csv/ingestoptions for--dataset,--date,--ingest-tag,--delete-tag,--batch-size. Communities CLI exposes--levels,--min-weight,--min-size,--k-comms,--rerank-mode. - Dense retriever:
dense build --batch 128 --use-summary/--use-text --persist, environmentDENSE_PERSIST_DIR, embed model viaDENSE_EMBED_MODEL(defaulttext-embedding-004). - BM25 retriever:
bm25 build --k1 1.2 --b 0.75 --use-summary/--use-text --sharding {mono|lang}, optionalBM25_PERSIST_DIR. - Hybrid retrieval: weights
HYB_W_BM25(default 1.0),HYB_W_DENSE(1.5),HYB_W_GRAPH(1.0), reciprocal rank fusionHYB_RRF_K(60), seed budgetHYB_SEED_K(120), expansion ratio/limit (2.0,800). - Benchmarks: adapters (
bm25,dense,graphrag,hybrid),--top-k, optionalBENCH_LATENCY_LOG,HYB_LATENCY_LOGfor latency captures. - Apps: production frontend calls API routes (
/api/*), while legacy Streamlit sidebar controls map tocommunities.retrieveparameters. Environment keys likeCOMM_INGEST_TAG,NEO4J_DATABASE_FIXED_SIZE, andDENSE_PERSIST_DIRpre-fill defaults.
Step-by-step Tasks¶
- Prep Gold artifacts
- ⬜ Verify
cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquetand (optionally) semantic bucket exist ([gold.md](gold.md)). - Build indices
- ⬜ Graph CSV + ingest:
- ⬜ Dense index:
Bash - ⬜ BM25 index:
Bash - Run communities & retrieval sanity
- ⬜ Community detection + summaries:
- ⬜ Retrieval spot checks:
- Benchmark & apps
- Quality run:
- Launch production frontend (local):
- Launch legacy Streamlit fallback:
Bash - Troubleshooting
- Use
python -m communities progress --dataset fixed_sizeto inspect graph state. - Missing embeddings → set
GEMINI_KEY(dense) orOPENAI_KEYfor community summaries. - Neo4j connection issues → confirm
NEO4J_DATABASE[_FIXED_SIZE],NEO4J_PASSWORD, and bolt URI.
Validation & Quality Gates¶
- Graph build:
gr_nodes.csv/gr_edges.csvnon-empty; in Neo4j BrowserMATCH (c:Chunk) RETURN count(c)matches row count of Gold parquet. - Communities: ensure coverage ≥95% (see
python -m communities communitiesoutput) andCommunity.summaryEmbeddingdimension matches vector index. - Dense/BM25: run CLI query; require ≥3 hits per cleantech-themed query (e.g., “electrolyzer financing 2024”).
- Hybrid retrieval: inspect fused hits for
scoremonotonicity; latency logs (if enabled) should showseedstage p95 < 250 ms. - Benchmarks:
rag.bench.cli evaluateprintsmap,ndcg_cut_10,recip_rank,P_5,P_10,recall_100; track improvements inbench_out/.../evals/leaderboard.csv.
Reproducibility¶
- Persist ingest tags per run (e.g.,
comm_fixed_C1_g1_2) to reuse communities. - Dense and BM25 stores reside under
vectordb_dense/andvectordb_bm25/with deterministic folder names (<dataset>/<date>). - Benchmark runs log configs inside run JSON; keep QA generation seed stable by reusing the saved parquet/qrels pair.
- Production frontend and legacy Streamlit app both read environment variables at startup.
Related notebooks¶
- 05_metadata_extraction — chunk enrichment context reused by retrievers.
- 06_graphrag_gamma_selection — guidance for selecting Leiden
gammaper community level.
See also¶
- Gold pipeline for upstream lineage.
- Graph details: GraphRAG build & communities.
- Retriever specifics: Dense & BM25.
- Benchmarks, Frontend (React), and App runbooks.
- Reference CLI options in
cli.mdandreference.md.