RAG¶

The Retrieval-Augmented Generation stack consumes Gold chunks to build graph, dense, and sparse indices. It powers GraphRAG exploration, hybrid retrievers, automated benchmarks, and the interactive apps (production React frontend + legacy Streamlit fallback).

RAG

Inputs & Outputs¶

Inputs: Gold chunks (chunks_enriched.parquet for fixed_size and semantic), provider keys (GEMINI_KEY, OPENAI_KEY), Neo4j credentials (NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD).
Primary outputs:
Graph CSV + Neo4j ingest: gold_subsample_chunk/<dataset>/<date>/gr_nodes.csv, gr_edges.csv, populated Chunk, Community, Entity nodes.
Dense store: vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/ (Chroma persistence).
BM25 store: vectordb_bm25/<dataset>/<date>/ (bm25_text.pkl, bm25_summary.pkl, manifest index.json).
Benchmark artifacts: bench_out/<dataset>/<date>/qa/*.parquet, bench_out/<dataset>/<date>/evals/*.json, leaderboard CSV, latency logs.
App configuration/state: frontend session cookies and optional .streamlit/secrets.toml for the legacy fallback app.

Parameters¶

Shared config: CT_DATA_ROOT, dataset selectors (fixed_size, semantic), run date (defaults to latest bucket), top-k defaults (communities.retrieve --top-k 100, dense/BM25 --top-k 10).
Graph: graphbuild csv/ingest options for --dataset, --date, --ingest-tag, --delete-tag, --batch-size. Communities CLI exposes --levels, --min-weight, --min-size, --k-comms, --rerank-mode.
Dense retriever: dense build --batch 128 --use-summary/--use-text --persist, environment DENSE_PERSIST_DIR, embed model via DENSE_EMBED_MODEL (default text-embedding-004).
BM25 retriever: bm25 build --k1 1.2 --b 0.75 --use-summary/--use-text --sharding {mono|lang}, optional BM25_PERSIST_DIR.
Hybrid retrieval: weights HYB_W_BM25 (default 1.0), HYB_W_DENSE (1.5), HYB_W_GRAPH (1.0), reciprocal rank fusion HYB_RRF_K (60), seed budget HYB_SEED_K (120), expansion ratio/limit (2.0, 800).
Benchmarks: adapters (bm25, dense, graphrag, hybrid), --top-k, optional BENCH_LATENCY_LOG, HYB_LATENCY_LOG for latency captures.
Apps: production frontend calls API routes (/api/*), while legacy Streamlit sidebar controls map to communities.retrieve parameters. Environment keys like COMM_INGEST_TAG, NEO4J_DATABASE_FIXED_SIZE, and DENSE_PERSIST_DIR pre-fill defaults.

Step-by-step Tasks¶

Prep Gold artifacts
⬜ Verify cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet and (optionally) semantic bucket exist ([gold.md](gold.md)).
Build indices

⬜ Graph CSV + ingest:

Bash
python -m graphbuild csv --dataset fixed_size --date 2025-09-14
python -m graphbuild ingest --dataset fixed_size --date 2025-09-14 --ingest-tag fixed_2025_09_14

⬜ Dense index:

Bash
python -m dense build --dataset fixed_size --date 2025-09-14 --use-summary

⬜ BM25 index:

Bash
python -m bm25 build --dataset fixed_size --date 2025-09-14 --use-text

Run communities & retrieval sanity

⬜ Community detection + summaries:

Bash
python -m communities communities --dataset fixed_size --ingest-tag comm_fixed_C1_g1_2 --levels "C1:1.2" --min-weight 1 --min-size 8
python -m communities summaries --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --refresh

⬜ Retrieval spot checks:

Bash
python -m dense query --dataset fixed_size "solid-state battery"
python -m bm25 query --dataset fixed_size --use-text "solid-state battery"
python -m communities retrieve "solid-state battery" --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --k-comms 24 --top-k 50 --rerank

Benchmark & apps

Quality run:

Bash
python -m rag.bench.cli generate --dataset fixed_size --date 2025-09-14 --n 75
python -m rag.bench.cli run --adapter hybrid --dataset fixed_size --date 2025-09-14 --qa-file bench_out/fixed_size/2025-09-14/qa/qa.parquet --ingest-tag comm_fixed_C1_g1_2 --level C1 --top-k 100 --hyb-graph-expand --hyb-expand-ratio 2.0 --hyb-expand-limit 800
python -m rag.bench.cli evaluate --qrels bench_out/fixed_size/2025-09-14/qa/qrels.parquet --run bench_out/fixed_size/2025-09-14/evals/hybrid_run.json

Launch production frontend (local):
Bash
1 2
cd frontend pnpm dev
Launch legacy Streamlit fallback:
Bash
1
streamlit run src/rag/demo/app.py
Troubleshooting
Use python -m communities progress --dataset fixed_size to inspect graph state.
Missing embeddings → set GEMINI_KEY (dense) or OPENAI_KEY for community summaries.
Neo4j connection issues → confirm NEO4J_DATABASE[_FIXED_SIZE], NEO4J_PASSWORD, and bolt URI.

Validation & Quality Gates¶

Graph build: gr_nodes.csv/gr_edges.csv non-empty; in Neo4j Browser MATCH (c:Chunk) RETURN count(c) matches row count of Gold parquet.
Communities: ensure coverage ≥95% (see python -m communities communities output) and Community.summaryEmbedding dimension matches vector index.
Dense/BM25: run CLI query; require ≥3 hits per cleantech-themed query (e.g., “electrolyzer financing 2024”).
Hybrid retrieval: inspect fused hits for score monotonicity; latency logs (if enabled) should show seed stage p95 < 250 ms.
Benchmarks: rag.bench.cli evaluate prints map, ndcg_cut_10, recip_rank, P_5, P_10, recall_100; track improvements in bench_out/.../evals/leaderboard.csv.

Reproducibility¶

Persist ingest tags per run (e.g., comm_fixed_C1_g1_2) to reuse communities.
Dense and BM25 stores reside under vectordb_dense/ and vectordb_bm25/ with deterministic folder names (<dataset>/<date>).
Benchmark runs log configs inside run JSON; keep QA generation seed stable by reusing the saved parquet/qrels pair.
Production frontend and legacy Streamlit app both read environment variables at startup.

05_metadata_extraction — chunk enrichment context reused by retrievers.
06_graphrag_gamma_selection — guidance for selecting Leiden gamma per community level.