Skip to content

RAG

The Retrieval-Augmented Generation stack consumes Gold chunks to build graph, dense, and sparse indices. It powers GraphRAG exploration, hybrid retrievers, automated benchmarks, and the interactive apps (production React frontend + legacy Streamlit fallback).

RAG

Inputs & Outputs

  • Inputs: Gold chunks (chunks_enriched.parquet for fixed_size and semantic), provider keys (GEMINI_KEY, OPENAI_KEY), Neo4j credentials (NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD).
  • Primary outputs:
  • Graph CSV + Neo4j ingest: gold_subsample_chunk/<dataset>/<date>/gr_nodes.csv, gr_edges.csv, populated Chunk, Community, Entity nodes.
  • Dense store: vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/ (Chroma persistence).
  • BM25 store: vectordb_bm25/<dataset>/<date>/ (bm25_text.pkl, bm25_summary.pkl, manifest index.json).
  • Benchmark artifacts: bench_out/<dataset>/<date>/qa/*.parquet, bench_out/<dataset>/<date>/evals/*.json, leaderboard CSV, latency logs.
  • App configuration/state: frontend session cookies and optional .streamlit/secrets.toml for the legacy fallback app.

Parameters

  • Shared config: CT_DATA_ROOT, dataset selectors (fixed_size, semantic), run date (defaults to latest bucket), top-k defaults (communities.retrieve --top-k 100, dense/BM25 --top-k 10).
  • Graph: graphbuild csv/ingest options for --dataset, --date, --ingest-tag, --delete-tag, --batch-size. Communities CLI exposes --levels, --min-weight, --min-size, --k-comms, --rerank-mode.
  • Dense retriever: dense build --batch 128 --use-summary/--use-text --persist, environment DENSE_PERSIST_DIR, embed model via DENSE_EMBED_MODEL (default text-embedding-004).
  • BM25 retriever: bm25 build --k1 1.2 --b 0.75 --use-summary/--use-text --sharding {mono|lang}, optional BM25_PERSIST_DIR.
  • Hybrid retrieval: weights HYB_W_BM25 (default 1.0), HYB_W_DENSE (1.5), HYB_W_GRAPH (1.0), reciprocal rank fusion HYB_RRF_K (60), seed budget HYB_SEED_K (120), expansion ratio/limit (2.0, 800).
  • Benchmarks: adapters (bm25, dense, graphrag, hybrid), --top-k, optional BENCH_LATENCY_LOG, HYB_LATENCY_LOG for latency captures.
  • Apps: production frontend calls API routes (/api/*), while legacy Streamlit sidebar controls map to communities.retrieve parameters. Environment keys like COMM_INGEST_TAG, NEO4J_DATABASE_FIXED_SIZE, and DENSE_PERSIST_DIR pre-fill defaults.

Step-by-step Tasks

  1. Prep Gold artifacts
  2. ⬜ Verify cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet and (optionally) semantic bucket exist ([gold.md](gold.md)).
  3. Build indices
  4. ⬜ Graph CSV + ingest:
    Bash
    python -m graphbuild csv --dataset fixed_size --date 2025-09-14
    python -m graphbuild ingest --dataset fixed_size --date 2025-09-14 --ingest-tag fixed_2025_09_14
    
  5. ⬜ Dense index:
    Bash
    python -m dense build --dataset fixed_size --date 2025-09-14 --use-summary
    
  6. ⬜ BM25 index:
    Bash
    python -m bm25 build --dataset fixed_size --date 2025-09-14 --use-text
    
  7. Run communities & retrieval sanity
  8. ⬜ Community detection + summaries:
    Bash
    python -m communities communities --dataset fixed_size --ingest-tag comm_fixed_C1_g1_2 --levels "C1:1.2" --min-weight 1 --min-size 8
    python -m communities summaries --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --refresh
    
  9. ⬜ Retrieval spot checks:
    Bash
    1
    2
    3
    python -m dense query --dataset fixed_size "solid-state battery"
    python -m bm25 query --dataset fixed_size --use-text "solid-state battery"
    python -m communities retrieve "solid-state battery" --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --k-comms 24 --top-k 50 --rerank
    
  10. Benchmark & apps
  11. Quality run:
    Bash
    1
    2
    3
    python -m rag.bench.cli generate --dataset fixed_size --date 2025-09-14 --n 75
    python -m rag.bench.cli run --adapter hybrid --dataset fixed_size --date 2025-09-14 --qa-file bench_out/fixed_size/2025-09-14/qa/qa.parquet --ingest-tag comm_fixed_C1_g1_2 --level C1 --top-k 100 --hyb-graph-expand --hyb-expand-ratio 2.0 --hyb-expand-limit 800
    python -m rag.bench.cli evaluate --qrels bench_out/fixed_size/2025-09-14/qa/qrels.parquet --run bench_out/fixed_size/2025-09-14/evals/hybrid_run.json
    
  12. Launch production frontend (local):
    Bash
    cd frontend
    pnpm dev
    
  13. Launch legacy Streamlit fallback:
    Bash
    streamlit run src/rag/demo/app.py
    
  14. Troubleshooting
  15. Use python -m communities progress --dataset fixed_size to inspect graph state.
  16. Missing embeddings → set GEMINI_KEY (dense) or OPENAI_KEY for community summaries.
  17. Neo4j connection issues → confirm NEO4J_DATABASE[_FIXED_SIZE], NEO4J_PASSWORD, and bolt URI.

Validation & Quality Gates

  • Graph build: gr_nodes.csv/gr_edges.csv non-empty; in Neo4j Browser MATCH (c:Chunk) RETURN count(c) matches row count of Gold parquet.
  • Communities: ensure coverage ≥95% (see python -m communities communities output) and Community.summaryEmbedding dimension matches vector index.
  • Dense/BM25: run CLI query; require ≥3 hits per cleantech-themed query (e.g., “electrolyzer financing 2024”).
  • Hybrid retrieval: inspect fused hits for score monotonicity; latency logs (if enabled) should show seed stage p95 < 250 ms.
  • Benchmarks: rag.bench.cli evaluate prints map, ndcg_cut_10, recip_rank, P_5, P_10, recall_100; track improvements in bench_out/.../evals/leaderboard.csv.

Reproducibility

  • Persist ingest tags per run (e.g., comm_fixed_C1_g1_2) to reuse communities.
  • Dense and BM25 stores reside under vectordb_dense/ and vectordb_bm25/ with deterministic folder names (<dataset>/<date>).
  • Benchmark runs log configs inside run JSON; keep QA generation seed stable by reusing the saved parquet/qrels pair.
  • Production frontend and legacy Streamlit app both read environment variables at startup.

See also