Skip to content

CLI

This page is the engineer‑oriented CLI reference for the MT project. It covers all packages end‑to‑end — Bronze, Silver, Unify, Gold (subsample → chunk → enrich → language patch), RAG build & retrieval (graph, communities, dense & BM25), and benchmarks.

The layout for each CLI:

  • Purpose → what it does and when to use it
  • Invocation → how to run it
  • Options → grouped by category; sensible defaults called out
  • Outputs → what gets written and where
  • Examples → copy‑paste snippets

Data root: defaults to ./cleantech_data. Override globally with CLEANTECH_DATA_DIR or per command via --download-dir, --bronze-dir, --silver-dir, etc.
Kaggle: export KAGGLE_USERNAME & KAGGLE_KEY (or ~/.kaggle/kaggle.json).
LLM providers: set GEMINI_KEY and/or OPENAI_KEY.
Neo4j: set NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD (and database name if applicable).


Bronze — cleantech-fetch

Purpose: Fetch Kaggle datasets and OpenAlex Works/Topics into date‑bucketed Bronze folders, writing a per‑run manifest and optional JSONL mirrors for uniform downstream ingestion.

Invocation

Bash
cleantech-fetch [OPTIONS]

Options

General

  • --download-dir PATH — root for all artifacts (default: ./cleantech_data). Creates bronze/... as needed.

Kaggle datasets

  • --kaggle-no-mirror — skip building raw.jsonl.gz mirror; keep only original ZIP.
  • --kaggle-keep-extracted — preserve the temporary extracted/ CSV/JSON files.

OpenAlex Works

  • --start YYYY-MM-DD, --end YYYY-MM-DD — publication date window (defaults: current year).
  • --openalex-per-page 200 — items per page (max 200).
  • --openalex-pages 0 — number of result pages (0 = no limit).
  • --openalex-search "..." — full‑text search query.
  • --openalex-oa-only — restrict to open‑access works.
  • --openalex-mailto you@example.com — include a mailto for polite API usage.

OpenAlex Topics (topics‑only mode)

  • --openalex-only-topics — fetch Topics only (skips Works).
  • --openalex-topics-per-page 200, --openalex-topics-pages 0, --openalex-topics-search "..." — topics paging/search.
  • --openalex-topics-keep-extracted — keep a plain topics.jsonl under extracted/.

Outputs

Text Only
cleantech_data/
  bronze/
    kaggle/<slug>/<YYYY-MM-DD>/
      original.zip
      raw.jsonl.gz                # if mirror enabled
      raw_manifest.jsonl
      extracted/                  # if --kaggle-keep-extracted
    openalex/works/<YYYY-MM-DD>/
      raw.jsonl.gz
      raw_manifest.jsonl
    openalex/topics/<YYYY-MM-DD>/
      topics.jsonl.gz
      raw_manifest.jsonl
      extracted/topics.jsonl      # if --openalex-topics-keep-extracted

Examples

Bash
1
2
3
4
5
6
7
8
# Works (open access only) with query; cap to 3 pages
cleantech-fetch --openalex-oa-only --openalex-search "renewable OR hydrogen" --openalex-pages 3

# Topics only; keep a plain JSONL copy
cleantech-fetch --openalex-only-topics --openalex-topics-keep-extracted

# Kaggle only; keep extracted CSVs but skip JSONL mirror
cleantech-fetch --kaggle-keep-extracted --kaggle-no-mirror

Silver — ctclean

Purpose: Canonicalize Bronze snapshots into analysis‑ready Silver tables per dataset (Media, Patents, OpenAlex Topics). Parquet first; CSV.gz fallback if needed.

Invocation

Bash
ctclean <subcommand> [OPTIONS]
# subcommands: media | patents | openalex | all

Subcommands & Options

ctclean media

  • --n-rows N — process only first N rows (smoke test).
  • --bronze-dir DIR — override media Bronze location.
  • --silver-dir DIR — override output path.
  • --include-listings / --no-include-listings — keep (default off) or drop listing/archive pages.

Outputs

Text Only
1
2
3
4
5
cleantech_data/silver/media/<YYYY-MM-DD>/
  media_canonical.*              # unique articles
  media_dupe_links.*             # duplicates → canonical
  media_excluded_listings.*
  media_excluded_non_articles.*

ctclean patents

  • --n-rows N, --bronze-dir DIR, --silver-dir DIR

Outputs

Text Only
1
2
3
4
cleantech_data/silver/patents/<YYYY-MM-DD>/
  patent_canonical.*
  patent_dupe_links.*
  patents_normalized.*

ctclean openalex

  • --n-rows N, --bronze-dir DIR, --silver-dir DIR

Outputs

Text Only
1
2
3
4
5
cleantech_data/silver/openalex/<YYYY-MM-DD>/
  topics_canonical.*
  topic_keywords_m2m.*
  topic_siblings_m2m.*
  domains_ref.*  fields_ref.*  subfields_ref.*

ctclean all

  • Runs media → patents → openalex in sequence, honoring the same flags.

Examples

Bash
1
2
3
4
5
# Run all three
ctclean all

# Media only; keep listing pages and write to a custom silver root
ctclean media --include-listings --silver-dir ./cleantech_data_custom/silver

Validation & Unify — ctunify

Purpose: Validate latest Silver buckets and merge into a single Unified Silver file for Gold processing.

Invocation

Bash
ctunify run [--output PATH]

Behavior & Outputs

  • Auto‑discovers latest silver/media, silver/patents, silver/openalex buckets.
  • Default output: cleantech_data/silver/unified/unified_docs.parquet (override with --output).
  • Non‑destructive: inputs remain unchanged.

Examples

Bash
ctunify run                                 # writes default unified path
ctunify run --output ./.../unified_2025-10-27.parquet

Gold — Subsample → Chunk → Enrich → (Language Patch)

Subsample — ctsubsample

Purpose: Create a stratified sample of the unified corpus for faster Gold iteration. Media & patent rows are sampled; topic rows are always kept and appended.

Invocation

Bash
ctsubsample [OPTIONS]

Options

  • I/O: --input PATH (default: latest silver/unified/unified_docs.*), --outdir PATH (default: new folder under silver_subsample/).
  • Size: exactly one of --n INT or --frac FLOAT (e.g., --frac 0.25).
  • Strata: --by "doc_type,lang" (default). Optional --min-per-stratum, --cap-per-stratum.
  • Repro/format: --seed 42 (default), --dedupe (by doc_id pre‑sample), --write-csv (write CSV.gz alongside Parquet).

Outputs

Text Only
1
2
3
4
cleantech_data/silver_subsample/<YYYY-MM-DD>/
  unified_docs_subsample.parquet
  unified_docs_subsample.csv.gz   # only if --write-csv
  manifest.json                   # parameters + TV distance by strata

Examples

Bash
ctsubsample --frac 0.05 --by doc_type,lang --seed 42
ctsubsample --n 5000 --dedupe --outdir ./cleantech_data/silver_subsample/small_run

Chunk — ctchunk

Purpose: Split documents into smaller chunks for retrieval & enrichment. Two strategies:

  • fixed: fixed token windows with overlap
  • semantic: similarity‑aware boundaries using embeddings
  • both: convenience to run both

Invocation

Bash
ctchunk <fixed|semantic|both> [OPTIONS]
# or: python -m ctchunk.cli <fixed|semantic|both> [OPTIONS]

Common Options

  • --input PATH (default: latest silver_subsample/unified_docs_subsample.*)
  • --outdir PATH (default: silver_subsample_chunk/<mode>/<YYYY-MM-DD>/)
  • --doc-types "media,patent" (filter; default: all types present)
  • --n-rows N (debug subset)

Mode: fixed

  • --max-tokens 512 — max tokens per chunk
  • --overlap 64 — tokens overlapped with previous chunk
  • --prepend-title / --no-prepend-title — include title in chunk text (default: on)

Outputs

Text Only
1
2
3
4
.../fixed_size/<YYYY-MM-DD>/
  chunks.jsonl
  chunks.parquet
  manifest.json

Mode: semantic

  • --sim-threshold 0.75 — merge when adjacent similarity ≥ threshold
  • --min-tokens 50 — minimum chunk size (merge small fragments)
  • --embedding-model sentence-transformers/distiluse-base-multilingual-cased-v2
  • --prepend-title / --no-prepend-title

Outputs

Text Only
1
2
3
4
.../semantic/<YYYY-MM-DD>/
  chunks.jsonl
  chunks.parquet
  manifest.json

Examples

Bash
1
2
3
4
5
# Fixed-size chunking
ctchunk fixed --max-tokens 512 --overlap 64 --doc-types media,patent

# Semantic chunking with stricter boundary detection
ctchunk semantic --sim-threshold 0.8 --min-tokens 80

Enrich — python -m ctenrichement.cli

Purpose: LLM‑powered enrichment of chunks with summaries, entities, facts, and topic joins → writes Gold artifacts.

Invocation

Bash
python -m ctenrichement.cli <all|media|patent|topics|progress> [OPTIONS]

Options

  • Dataset selection: --dataset fixed_size or --dataset semantic (default: both; run twice internally).
  • Provider & model: --provider auto|openai|gemini, --model-id <name>; --summary-provider, --summary-model-id (advanced). Env: LEX_SUMMARY_SENTENCES=3 (default).
  • Throughput: --rps 8.0, --max-workers 12; env: LEX_MAX_RETRIES=5, LEX_MIN_EXTRACTIONS=1, LEX_MIN_SPAN_INTEGRITY=0.4.
  • Resume: --resume none|failed|skip-completed (uses .lex_cache state).
  • With topics: --with-topics — also build topic assets/joins when enriching media/patent.

Outputs

Text Only
cleantech_data/gold_subsample_chunk/<dataset>/<YYYY-MM-DD>/
  chunks_enriched.parquet
  grounded_extractions.jsonl
  visualization_interactive.html
  completed_ids.json
  failed_chunks.json
  failed_details.json
  manifest.json
  topics/                    # if --with-topics
  topic_chunk_join/          # if --with-topics

Examples

Bash
1
2
3
4
5
# Enrich both datasets with OpenAI; skip completed
python -m ctenrichement.cli all --provider openai --rps 5 --max-workers 8 --resume skip-completed

# Media only on semantic chunks with topic joins
python -m ctenrichement.cli media --dataset semantic --with-topics

(Optional) Language Patch — python -m patch.language.lang_overwrite

Purpose: Re‑detect / overwrite metadata.lang for chunks using a lightweight LLM prompt; writes patched Gold + audit report.

Invocation

Bash
python -m patch.language.lang_overwrite --dataset <fixed_size|semantic> --date YYYY-MM-DD [OPTIONS]

Options

  • --from-text — detect from raw chunk text (default prefers summary if present)
  • --overwrite / --no-overwrite — replace existing metadata.lang (default: no)
  • --write-mode copy|inplace — write chunks_enriched_langpatched.* or edit in place (with .bak)
  • --limit N — process only first N rows
  • Provider/model: --provider auto|openai|gemini, --openai-model <id>, --gemini-model <id>

Outputs

Text Only
1
2
3
.../gold_subsample_chunk/<dataset>/<YYYY-MM-DD>/
  chunks_enriched_langpatched.parquet   # if write-mode=copy
  audit/lang_patch_report.csv

Example

Bash
python -m patch.language.lang_overwrite --dataset fixed_size --date 2025-10-27 --write-mode copy --overwrite

RAG Build & Retrieval

Graph Build — python -m graphbuild

Purpose: Export Gold chunks to graph CSVs and ingest them into Neo4j under a chosen ingest tag for GraphRAG.

Export CSV

Bash
python -m graphbuild csv --dataset <fixed_size|semantic> --date YYYY-MM-DD [--outdir PATH]
Writes
Text Only
1
2
3
gr_nodes.csv
gr_edges.csv
manifest.json

Ingest Neo4j

Bash
python -m graphbuild ingest --dataset <fixed_size|semantic> --date YYYY-MM-DD --ingest-tag <tag>
- Uses NEO4J_* env vars for connection/auth. - Creates labeled nodes/edges (e.g., Chunk, Community, Entity) under the tag.

Example

Bash
python -m graphbuild csv --dataset fixed_size --date 2025-10-27
python -m graphbuild ingest --dataset fixed_size --date 2025-10-27 --ingest-tag demo_oct27

Communities — python -m communities

Purpose: Build community structure (Leiden), summarize/embedd communities, and perform GraphRAG retrieval.

Subcommands

Bash
1
2
3
4
5
6
7
python -m communities communities   [--dataset <...>] [--levels "C1:1.2,C2:1.0"] [--ingest-tag <tag>]
python -m communities summaries     [--dataset <...>] [--level C1] [--estimate-only] [--refresh]
python -m communities ensure-index  [--dataset <...>] [--level C1]
python -m communities search        [--dataset <...>] [--level C1] --query "..." [--top-k 10]
python -m communities retrieve      [--dataset <...>] [--level C1] --query "..." [--k-comms 4] [--top-k 12] [--rerank-mode dense|summary]
python -m communities cleanup       --ingest-tag <tag>
python -m communities progress

Common options

  • --dataset <fixed_size|semantic>
  • --level C0|C1|C2|C3
  • --ingest-tag <tag> (for operations tied to a tagged ingestion)

Notes

  • communities: runs Leiden; use --levels "C1:1.2" to set the resolution (e.g., level C1 at 1.2).
  • summaries: LLM summarizes + embeds communities; add --estimate-only for dry‑run or --refresh to recompute.
  • retrieve: GraphRAG retrieval across selected communities; supports optional re‑ranking.

Example

Bash
1
2
3
python -m communities communities --dataset fixed_size --levels "C1:1.2"
python -m communities summaries --dataset fixed_size --level C1
python -m communities retrieve --dataset fixed_size --level C1 --query "solid-state hydrogen storage" --k-comms 4 --top-k 12

Hybrid retrieval (triple retriever) — python -m rag.retrieval.triple_retriever

Purpose: Fuse BM25 + dense seeds, optionally expand via Neo4j, and optionally apply cross-encoder reranking.

Stage A (IDs only)

Bash
1
2
3
4
5
6
python -m rag.retrieval.triple_retriever ids "query text" \
  --dataset fixed_size \
  --date 2025-09-14 \
  --injest-tag comm_fixed_C1_g1_2 \
  --level C1 \
  --top-k 50

Options (defaults)

  • --dataset (required)
  • --date (default: latest GOLD bucket)
  • --injest-tag (typo in CLI; default: None, falls back to COMM_INGEST_TAG)
  • --level (default: C1)
  • --top-k (default: 50)
  • --seed-k (default: 120)
  • --expand-ratio (default: 2.0)
  • --expand-limit (default: 800)
  • --bench-root (default: None)
  • --dense-date (default: None)
  • --dense-persist (default: None)
  • --rerank-mode (default: dense)
  • --graph-required/--no-graph-required (default: False)
  • --max-per-doc (default: 0)
  • --out (default: None; .jsonl | .json | .csv)

Stage B (rerank overlay)

Bash
1
2
3
4
5
6
7
python -m rag.retrieval.triple_retriever rerank "query text" \
  --dataset fixed_size \
  --date 2025-09-14 \
  --ingest-tag comm_fixed_C1_g1_2 \
  --level C1 \
  --fetch-top-n 120 \
  --top-k 20

Options (defaults)

  • --dataset (required)
  • --date (default: latest GOLD bucket)
  • --ingest-tag (default: None, falls back to COMM_INGEST_TAG)
  • --level (default: C1)
  • --top-k (default: 20)
  • --fetch-top-n (default: 120)
  • --seed-k (default: 120)
  • --expand-ratio (default: 2.0)
  • --expand-limit (default: 800)
  • --bench-root (default: None)
  • --dense-date (default: None)
  • --dense-persist (default: None)
  • --rerank-mode (default: dense)
  • --rerank-spec (default: hf:BAAI/bge-reranker-base)
  • --prefer (default: summary)
  • --include-doc/--no-include-doc (default: False)
  • --attach-metadata/--no-attach-metadata (default: True)
  • --graph-required/--no-graph-required (default: False)
  • --alpha (default: 0.7)
  • --sort-by (default: final)
  • --max-per-doc (default: 0)
  • --diversify (default: none)
  • --mmr-lambda (default: 0.5)
  • --min-occurrence / --min-occurance (default: 0)
  • --fusion (default: None; text | summary)
  • --coverage/--no-coverage (default: False)
  • --out (default: None; .jsonl | .json | .csv)

See Retrieval params reference for how CLI defaults compare to app defaults.

Dense Retriever — python -m dense

Purpose: Build/query a dense vector index (Chroma) over Gold chunks.

Build

Bash
python -m dense build --dataset <fixed_size|semantic> --date YYYY-MM-DD [--persist PATH] [--batch 128] [--limit N] [--use-summary|--use-text]
- Env: DENSE_EMBED_MODEL (default text-embedding-004), DENSE_PERSIST_DIR

Query

Bash
python -m dense query --top-k 10 [--hydrate] [--json]
- Uses the latest (or --persist) collection; --hydrate attaches chunk text/metadata

Writes

Text Only
vectordb_dense/<model>/<dataset>/<YYYY-MM-DD>/   # or --persist path

Example

Bash
python -m dense build --dataset fixed_size --date 2025-10-27 --use-summary
python -m dense query --top-k 10 --hydrate --json --query "electrolyzer efficiency roadmap"

BM25 Retriever — python -m bm25

Purpose: Build/query a lightweight lexical (BM25) index, optionally sharded by language.

Build

Bash
python -m bm25 build [--use-summary] [--k1 1.2] [--b 0.75] [--sharding mono|lang] [--limit N] [--persist PATH]
- Env: BM25_PERSIST_DIR

Query

Bash
python -m bm25 query --top-k 10 [--use-summary|--use-text] [--route auto|all|<lang>] [--hydrate] [--json] --query "..."

Writes

Text Only
vectordb_bm25/<dataset>/<YYYY-MM-DD>/index.json   # plus shards if sharded

Example

Bash
python -m bm25 build --use-text
python -m bm25 query --top-k 10 --route auto --query "perovskite tandem solar"

Benchmarks — python -m rag.bench.cli

Purpose: Generate QA sets, run retrieval adapters (BM25, Dense, GraphRAG, Hybrid), and evaluate with TREC metrics. Includes latency summaries and weight tuning for hybrid fusion.

Commands

Bash
# 1) Generate QA
python -m rag.bench.cli generate --dataset <fixed_size|semantic> --date YYYY-MM-DD [--n 50] [--use-text|--use-summary] [--provider auto|openai|gemini|none] [--out-dir PATH]

# 2) Run an adapter
python -m rag.bench.cli run --adapter {bm25|dense|graphrag|hybrid} --dataset <...> --date YYYY-MM-DD --qa-file <path> --top-k 10 --out <path>
# adapter-specific flags:
#   BM25: --bm25-use-text|--bm25-use-summary, --bm25-route <auto|all|lang>
#   GraphRAG: --ingest-tag <tag> --level C1 --k-comms 4 --rerank-mode {dense|summary} [--dense-date <YYYY-MM-DD>] [--dense-persist <path>]
#   Hybrid: --hyb-rrf-k 60 --hyb-w-bm25 0.5 --hyb-w-dense 0.5 --hyb-w-graph 0.0 --hyb-seed-k 50 --graph-expand/--no-graph-expand --hyb-expand-ratio 0.2 --hyb-expand-limit 20

# 3) Evaluate
python -m rag.bench.cli evaluate --qrels <path> --run <path> [--run-id <id>] [--leaderboard <path>] [--e2e-log <path>] [--stage-log <path>]

# 4) Tune hybrid
python -m rag.bench.cli tune-hybrid --dataset <...> --date YYYY-MM-DD --qa-file <path> --out <path>

# 5) Latency & leaderboard utils
python -m rag.bench.cli latency --e2e-log <path> --stage-log <path>
python -m rag.bench.cli leaderboard [--leaderboard <path>]

Outputs

Text Only
1
2
3
4
5
6
7
8
9
bench_out/<dataset>/<YYYY-MM-DD>/
  qa/...
  evals/
    runs/*.run
    qrels/*.qrels
    metrics/*.json         # e.g., map, ndcg_cut_10, recip_rank, P_5, P_10, recall_100
    hybrid_tuned.json      # from tune-hybrid
    latency/*.jsonl        # optional timing logs
    leaderboard.csv

Example

Bash
1
2
3
4
5
6
# Build QA
python -m rag.bench.cli generate --dataset fixed_size --date 2025-10-27 --n 50 --use-summary --provider openai

# Run & evaluate Dense
python -m rag.bench.cli run --adapter dense --dataset fixed_size --date 2025-10-27 --qa-file bench_out/fixed_size/2025-10-27/qa/qa.json --top-k 10 --out bench_out/fixed_size/2025-10-27/evals/runs/dense.run
python -m rag.bench.cli evaluate --qrels bench_out/fixed_size/2025-10-27/qa/qrels.txt --run bench_out/fixed_size/2025-10-27/evals/runs/dense.run --leaderboard bench_out/fixed_size/2025-10-27/evals/leaderboard.csv

End‑to‑End (quick path)

Bash
# Bronze
cleantech-fetch --openalex-pages 3 --openalex-per-page 200 --openalex-mailto you@example.com

# Silver
ctclean media
ctclean patents
ctclean openalex
ctunify run

# Gold
ctsubsample --frac 0.25 --by doc_type,lang --seed 42
ctchunk fixed --max-tokens 512 --overlap 64 --doc-types media,patent
python -m ctenrichement.cli all --dataset fixed_size --provider auto --rps 8 --max-workers 12 --resume none --with-topics

# Graph + Retrievers
python -m graphbuild csv --dataset fixed_size --date <date>
python -m graphbuild ingest --dataset fixed_size --date <date> --ingest-tag <tag>
python -m dense build --dataset fixed_size --date <date> --use-summary
python -m bm25  build --use-text

# Bench
python -m rag.bench.cli generate --dataset fixed_size --date <date> --n 50 --use-summary --provider openai
python -m rag.bench.cli run --adapter hybrid --dataset fixed_size --date <date> --qa-file bench_out/fixed_size/<date>/qa/qa.json --top-k 10 --out bench_out/fixed_size/<date>/evals/runs/hybrid.run
python -m rag.bench.cli evaluate --qrels bench_out/fixed_size/<date>/qa/qrels.txt --run bench_out/fixed_size/<date>/evals/runs/hybrid.run

End‑to‑End Study Replication (BM25, Dense, Hybrid, Hybrid+GraphRAG) — with Latency

This section provides a step‑by‑step runbook for reproducing a study using BM25, Dense, Hybrid, and Hybrid+Graph variants with latency logging.
The commands below use the bench module path (python -m bench ...). If your environment exposes the full path, replace bench with rag.bench.cli (e.g., python -m rag.bench.cli run ...).
Where paths or dates differ, substitute your <DATASET> and <DATE> accordingly.

0) One‑time prep (only if not done yet)

Activate env & make sources importable (PowerShell):

PowerShell
1
2
3
. .\.venv\Scripts\Activate.ps1
$env:PYTHONPATH = "$PWD\src
ag"

Activate env (bash/zsh):

Bash
source .venv/bin/activate
export PYTHONPATH="$PWD/src/rag"

.env (root of repo):

INI
# Data root (either of these is honored depending on component)
CLEANTECH_DATA_DIR=...        # preferred name used across the pipeline
CT_DATA_ROOT=...              # alias used by some bench utilities

# LLMs
GEMINI_KEY=your_gemini_api_key

# Retrievers (optional explicit persist paths)
DENSE_PERSIST_DIR=...
BM25_PERSIST_DIR=...

# GraphRAG / Hybrid+Graph
NEO4J_URI=neo4j://127.0.0.1:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE_FIXED_SIZE=graph-fixed-size

Build indexes (once per dataset/date):

Bash
python -m bm25 build  --dataset fixed_size --date 2025-09-14 --use-text
python -m dense build --dataset fixed_size --date 2025-09-14

(Optional) Ensure community vector index ONLINE:

Bash
python -m communities ensure-index --dataset fixed_size --level C1

0b) Latency logging (set once per session/date)

PowerShell:

PowerShell
# where your evals live
$evals = "bench_out\fixed_size\5-09-14\evals"
mkdir $evals -Force | Out-Null

# end-to-end (all adapters) + Hybrid stage-level logs
$env:BENCH_LATENCY_LOG = "$evals\latency_e2e.jsonl"
$env:HYB_LATENCY_LOG   = "$evals\latency_stages.jsonl"

# optional: start fresh logs
#Remove-Item $env:BENCH_LATENCY_LOG -ErrorAction SilentlyContinue
#Remove-Item $env:HYB_LATENCY_LOG   -ErrorAction SilentlyContinue

bash/zsh:

Bash
1
2
3
4
5
6
7
8
evals="bench_out/fixed_size/2025-09-14/evals"
mkdir -p "$evals"

export BENCH_LATENCY_LOG="$evals/latency_e2e.jsonl"
export HYB_LATENCY_LOG="$evals/latency_stages.jsonl"

# optional: start fresh logs
# rm -f "$BENCH_LATENCY_LOG" "$HYB_LATENCY_LOG"

With these set, bench runs will log latency automatically.
bench evaluate will also fold latency percentiles into the printed output and leaderboard rows.


1) Generate QA & qrels (Parquet‑first, text‑first)

Bash
python -m bench generate   --dataset fixed_size   --date 2025-09-14   --n 50   --use-text   --provider auto
Outputs
  • bench_out/fixed_size/2025-09-14/qa.parquet
  • bench_out/fixed_size/2025-09-14/qrels.parquet
  • bench_out/fixed_size/2025-09-14/build_meta.json
    (+ qa.jsonl, qrels.json if RAG_BENCH_WRITE_JSON=1)

2) Build baseline runs (BM25, Dense)

BM25 (text)

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter bm25 --dataset fixed_size --date 2025-09-14   --top-k 100 --bm25-use-text

Dense (Chroma)

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter dense --dataset fixed_size --date 2025-09-14   --top-k 100

Writes

  • bench_out/fixed_size/2025-09-14/evals/bm25_run.json
  • bench_out/fixed_size/2025-09-14/evals/dense_run.json

3) Hybrid (BM25 ⊕ Dense, no graph) — tune & re‑run

Tune weights to a metric (e.g., recall_100):

Bash
python -m bench tune-hybrid   --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --dataset fixed_size --date 2025-09-14   --top-k 100 --metric recall_100

Load tuned weights (PowerShell):

PowerShell
$best = Get-Content -Raw bench_out\fixed_size\5-09-14\evals\hybrid_tuned.json | ConvertFrom-Json

Run Hybrid with tuned weights (no graph expansion):

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter hybrid --dataset fixed_size --date 2025-09-14   --top-k 100 --bm25-use-text --no-graph-expand   --hyb-w-bm25 $best.w_bm25 --hyb-w-dense $best.w_dense --hyb-rrf-k $best.rrf_k   --hyb-seed-k 120
Writes
  • bench_out/fixed_size/2025-09-14/evals/hybrid_run.json

3b) Hybrid + GraphRAG expansion (with tuned weights)

Requires a communities DB for --ingest-tag / --level (e.g., comm_fixed_C1_g1_2 at level C1). Ensure community vector index is ONLINE if searching communities.

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter hybrid --dataset fixed_size --date 2025-09-14   --graph-expand   --ingest-tag comm_fixed_C1_g1_2 --level C1   --hyb-expand-ratio 2.0 --hyb-expand-limit 800   --rerank-mode dense   --hyb-w-graph 1.0   --hyb-w-bm25 $best.w_bm25 --hyb-w-dense $best.w_dense --hyb-rrf-k $best.rrf_k   --hyb-seed-k 120 --top-k 100   --out bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json
Writes
  • bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json

4) Evaluate & log (chunk‑level + doc‑level + latency)

BM25

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/bm25_run.json   --run-id BM25_text_2025-09-14

Dense

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/dense_run.json   --run-id Dense_2025-09-14

Hybrid (tuned)

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/hybrid_run.json   --run-id Hybrid_tuned_2025-09-14

Hybrid + GraphRAG (tuned)

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json   --run-id Hybrid_tuned_graph_2025-09-14
Each call logs:
  • chunk‑levelbench_out/fixed_size/2025-09-14/evals/leaderboard.csv
  • doc‑levelbench_out/fixed_size/2025-09-14/evals/leaderboard_doc.csv
    …and prints latency (end‑to‑end for all adapters; stage‑level for Hybrid).
    If you prefer explicit paths, pass --e2e-log and/or --stage-log to bench evaluate.

5) Leaderboard (view)

Chunk‑level leaderboard (auto‑picks latest evals/leaderboard.csv):

Bash
python -m bench leaderboard --sort-by recall_100

Doc‑level leaderboard:

Bash
python -m bench leaderboard   --leaderboard bench_out/fixed_size/2025-09-14/evals/leaderboard_doc.csv   --sort-by recall_100

(Optional) One‑shot latency summaries

Bash
python -m bench latency --end-to-end "$BENCH_LATENCY_LOG"
python -m bench latency --stages     "$HYB_LATENCY_LOG"

Notes & tips

  • Graph fix applied: Hybrid graph expansion uses Neo4j 5 / GDS 2‑safe Cypher (COUNT { pattern }), avoiding deprecation errors.
  • GraphRAG prereqs: communities exist for --ingest-tag / --level; ensure vector index ONLINE if you search communities (python -m communities ensure-index --dataset fixed_size --level C1).
  • Doc‑level metrics: bench evaluate logs both chunk‑ and doc‑level automatically.
  • Latency logging model:
  • BENCH_LATENCY_LOG: end‑to‑end per‑query for all adapters.
  • HYB_LATENCY_LOG: stage‑level (seed, graph_expand, rerank, total) for Hybrid only.
  • Evaluator filters by the run’s QIDs — you can reuse the same JSONL logs across runs; to keep per‑run logs, point the env vars to different files before each run.

Environment quick reference

  • Data root: CLEANTECH_DATA_DIR (default ./cleantech_data) or CT_DATA_ROOT (bench alias)
  • Kaggle: KAGGLE_USERNAME, KAGGLE_KEY (or ~/.kaggle/kaggle.json)
  • LLM: GEMINI_KEY, OPENAI_KEY
  • Neo4j: NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, NEO4J_DATABASE, optional NEO4J_DATABASE_FIXED_SIZE
  • Dense retriever: DENSE_EMBED_MODEL, DENSE_PERSIST_DIR
  • BM25 retriever: BM25_PERSIST_DIR
  • Bench logs: BENCH_LATENCY_LOG, HYB_LATENCY_LOG
  • Enrichment knobs: LEX_SUMMARY_SENTENCES, LEX_MAX_RETRIES, LEX_MIN_EXTRACTIONS, LEX_MIN_SPAN_INTEGRITY