CLI¶

This page is the engineer‑oriented CLI reference for the MT project. It covers all packages end‑to‑end — Bronze, Silver, Unify, Gold (subsample → chunk → enrich → language patch), RAG build & retrieval (graph, communities, dense & BM25), and benchmarks.

The layout for each CLI:

Purpose → what it does and when to use it
Invocation → how to run it
Options → grouped by category; sensible defaults called out
Outputs → what gets written and where
Examples → copy‑paste snippets

Data root: defaults to ./cleantech_data. Override globally with CLEANTECH_DATA_DIR or per command via --download-dir, --bronze-dir, --silver-dir, etc.
Kaggle: export KAGGLE_USERNAME & KAGGLE_KEY (or ~/.kaggle/kaggle.json).
LLM providers: set GEMINI_KEY and/or OPENAI_KEY.
Neo4j: set NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD (and database name if applicable).

Bronze — `cleantech-fetch`¶

Purpose: Fetch Kaggle datasets and OpenAlex Works/Topics into date‑bucketed Bronze folders, writing a per‑run manifest and optional JSONL mirrors for uniform downstream ingestion.

Invocation¶

Bash
cleantech-fetch [OPTIONS]

Options¶

General

--download-dir PATH — root for all artifacts (default: ./cleantech_data). Creates bronze/... as needed.

Kaggle datasets

--kaggle-no-mirror — skip building raw.jsonl.gz mirror; keep only original ZIP.
--kaggle-keep-extracted — preserve the temporary extracted/ CSV/JSON files.

OpenAlex Works

--start YYYY-MM-DD, --end YYYY-MM-DD — publication date window (defaults: current year).
--openalex-per-page 200 — items per page (max 200).
--openalex-pages 0 — number of result pages (0 = no limit).
--openalex-search "..." — full‑text search query.
--openalex-oa-only — restrict to open‑access works.
--openalex-mailto you@example.com — include a mailto for polite API usage.

OpenAlex Topics (topics‑only mode)

--openalex-only-topics — fetch Topics only (skips Works).
--openalex-topics-per-page 200, --openalex-topics-pages 0, --openalex-topics-search "..." — topics paging/search.
--openalex-topics-keep-extracted — keep a plain topics.jsonl under extracted/.

Outputs¶

Text Only

cleantech_data/
  bronze/
    kaggle/<slug>/<YYYY-MM-DD>/
      original.zip
      raw.jsonl.gz                # if mirror enabled
      raw_manifest.jsonl
      extracted/                  # if --kaggle-keep-extracted
    openalex/works/<YYYY-MM-DD>/
      raw.jsonl.gz
      raw_manifest.jsonl
    openalex/topics/<YYYY-MM-DD>/
      topics.jsonl.gz
      raw_manifest.jsonl
      extracted/topics.jsonl      # if --openalex-topics-keep-extracted

Examples¶

Bash
# Works (open access only) with query; cap to 3 pages
cleantech-fetch --openalex-oa-only --openalex-search "renewable OR hydrogen" --openalex-pages 3

# Topics only; keep a plain JSONL copy
cleantech-fetch --openalex-only-topics --openalex-topics-keep-extracted

# Kaggle only; keep extracted CSVs but skip JSONL mirror
cleantech-fetch --kaggle-keep-extracted --kaggle-no-mirror

Silver — `ctclean`¶

Purpose: Canonicalize Bronze snapshots into analysis‑ready Silver tables per dataset (Media, Patents, OpenAlex Topics). Parquet first; CSV.gz fallback if needed.

Invocation¶

Bash
ctclean <subcommand> [OPTIONS]
# subcommands: media | patents | openalex | all

Subcommands & Options¶

ctclean media

--n-rows N — process only first N rows (smoke test).
--bronze-dir DIR — override media Bronze location.
--silver-dir DIR — override output path.
--include-listings / --no-include-listings — keep (default off) or drop listing/archive pages.

Outputs

Text Only
1 2 3 4 5	`cleantech_data/silver/media/<YYYY-MM-DD>/ media_canonical.* # unique articles media_dupe_links.* # duplicates → canonical media_excluded_listings.* media_excluded_non_articles.*`

ctclean patents

--n-rows N, --bronze-dir DIR, --silver-dir DIR

Outputs

Text Only
1 2 3 4	`cleantech_data/silver/patents/<YYYY-MM-DD>/ patent_canonical.* patent_dupe_links.* patents_normalized.*`

ctclean openalex

--n-rows N, --bronze-dir DIR, --silver-dir DIR

Outputs

Text Only
1 2 3 4 5	`cleantech_data/silver/openalex/<YYYY-MM-DD>/ topics_canonical.* topic_keywords_m2m.* topic_siblings_m2m.* domains_ref.* fields_ref.* subfields_ref.*`

ctclean all

Runs media → patents → openalex in sequence, honoring the same flags.

Examples¶

Bash
# Run all three
ctclean all

# Media only; keep listing pages and write to a custom silver root
ctclean media --include-listings --silver-dir ./cleantech_data_custom/silver

Validation & Unify — `ctunify`¶

Purpose: Validate latest Silver buckets and merge into a single Unified Silver file for Gold processing.

Invocation¶

Bash
ctunify run [--output PATH]

Behavior & Outputs¶

Auto‑discovers latest silver/media, silver/patents, silver/openalex buckets.
Default output: cleantech_data/silver/unified/unified_docs.parquet (override with --output).
Non‑destructive: inputs remain unchanged.

Examples¶

Bash
ctunify run                                 # writes default unified path
ctunify run --output ./.../unified_2025-10-27.parquet

Gold — Subsample → Chunk → Enrich → (Language Patch)¶

Subsample — `ctsubsample`¶

Purpose: Create a stratified sample of the unified corpus for faster Gold iteration. Media & patent rows are sampled; topic rows are always kept and appended.

Invocation¶

Bash
ctsubsample [OPTIONS]

Options¶

I/O: --input PATH (default: latest silver/unified/unified_docs.*), --outdir PATH (default: new folder under silver_subsample/).
Size: exactly one of --n INT or --frac FLOAT (e.g., --frac 0.25).
Strata: --by "doc_type,lang" (default). Optional --min-per-stratum, --cap-per-stratum.
Repro/format: --seed 42 (default), --dedupe (by doc_id pre‑sample), --write-csv (write CSV.gz alongside Parquet).

Outputs¶

Text Only
1 2 3 4	`cleantech_data/silver_subsample/<YYYY-MM-DD>/ unified_docs_subsample.parquet unified_docs_subsample.csv.gz # only if --write-csv manifest.json # parameters + TV distance by strata`

Examples¶

Bash
ctsubsample --frac 0.05 --by doc_type,lang --seed 42
ctsubsample --n 5000 --dedupe --outdir ./cleantech_data/silver_subsample/small_run

Chunk — `ctchunk`¶

Purpose: Split documents into smaller chunks for retrieval & enrichment. Two strategies:

fixed: fixed token windows with overlap
semantic: similarity‑aware boundaries using embeddings
both: convenience to run both

Invocation¶

Bash
ctchunk <fixed|semantic|both> [OPTIONS]
# or: python -m ctchunk.cli <fixed|semantic|both> [OPTIONS]

Common Options¶

--input PATH (default: latest silver_subsample/unified_docs_subsample.*)
--outdir PATH (default: silver_subsample_chunk/<mode>/<YYYY-MM-DD>/)
--doc-types "media,patent" (filter; default: all types present)
--n-rows N (debug subset)

Mode: `fixed`¶

--max-tokens 512 — max tokens per chunk
--overlap 64 — tokens overlapped with previous chunk
--prepend-title / --no-prepend-title — include title in chunk text (default: on)

Outputs

Text Only
1 2 3 4	`.../fixed_size/<YYYY-MM-DD>/ chunks.jsonl chunks.parquet manifest.json`

Mode: `semantic`¶

--sim-threshold 0.75 — merge when adjacent similarity ≥ threshold
--min-tokens 50 — minimum chunk size (merge small fragments)
--embedding-model sentence-transformers/distiluse-base-multilingual-cased-v2
--prepend-title / --no-prepend-title

Outputs

Text Only
1 2 3 4	`.../semantic/<YYYY-MM-DD>/ chunks.jsonl chunks.parquet manifest.json`

Examples¶

Bash
# Fixed-size chunking
ctchunk fixed --max-tokens 512 --overlap 64 --doc-types media,patent

# Semantic chunking with stricter boundary detection
ctchunk semantic --sim-threshold 0.8 --min-tokens 80

Enrich — `python -m ctenrichement.cli`¶

Purpose: LLM‑powered enrichment of chunks with summaries, entities, facts, and topic joins → writes Gold artifacts.

Invocation¶

Bash
python -m ctenrichement.cli <all|media|patent|topics|progress> [OPTIONS]

Options¶

Dataset selection: --dataset fixed_size or --dataset semantic (default: both; run twice internally).
Provider & model: --provider auto|openai|gemini, --model-id <name>; --summary-provider, --summary-model-id (advanced). Env: LEX_SUMMARY_SENTENCES=3 (default).
Throughput: --rps 8.0, --max-workers 12; env: LEX_MAX_RETRIES=5, LEX_MIN_EXTRACTIONS=1, LEX_MIN_SPAN_INTEGRITY=0.4.
Resume: --resume none|failed|skip-completed (uses .lex_cache state).
With topics: --with-topics — also build topic assets/joins when enriching media/patent.

Outputs¶

Text Only

cleantech_data/gold_subsample_chunk/<dataset>/<YYYY-MM-DD>/
  chunks_enriched.parquet
  grounded_extractions.jsonl
  visualization_interactive.html
  completed_ids.json
  failed_chunks.json
  failed_details.json
  manifest.json
  topics/                    # if --with-topics
  topic_chunk_join/          # if --with-topics

Examples¶

Bash
# Enrich both datasets with OpenAI; skip completed
python -m ctenrichement.cli all --provider openai --rps 5 --max-workers 8 --resume skip-completed

# Media only on semantic chunks with topic joins
python -m ctenrichement.cli media --dataset semantic --with-topics

(Optional) Language Patch — `python -m patch.language.lang_overwrite`¶

Purpose: Re‑detect / overwrite metadata.lang for chunks using a lightweight LLM prompt; writes patched Gold + audit report.

Invocation¶

Bash
python -m patch.language.lang_overwrite --dataset <fixed_size|semantic> --date YYYY-MM-DD [OPTIONS]

Options¶

--from-text — detect from raw chunk text (default prefers summary if present)
--overwrite / --no-overwrite — replace existing metadata.lang (default: no)
--write-mode copy|inplace — write chunks_enriched_langpatched.* or edit in place (with .bak)
--limit N — process only first N rows
Provider/model: --provider auto|openai|gemini, --openai-model <id>, --gemini-model <id>

Outputs¶

Text Only
1 2 3	`.../gold_subsample_chunk/<dataset>/<YYYY-MM-DD>/ chunks_enriched_langpatched.parquet # if write-mode=copy audit/lang_patch_report.csv`

Example¶

Bash
python -m patch.language.lang_overwrite --dataset fixed_size --date 2025-10-27 --write-mode copy --overwrite

RAG Build & Retrieval¶

Graph Build — `python -m graphbuild`¶

Purpose: Export Gold chunks to graph CSVs and ingest them into Neo4j under a chosen ingest tag for GraphRAG.

Export CSV¶

Bash
python -m graphbuild csv --dataset <fixed_size|semantic> --date YYYY-MM-DD [--outdir PATH]

Writes

Text Only
1 2 3	`gr_nodes.csv gr_edges.csv manifest.json`

Ingest Neo4j¶

Bash
python -m graphbuild ingest --dataset <fixed_size|semantic> --date YYYY-MM-DD --ingest-tag <tag>

- Uses NEO4J_* env vars for connection/auth. - Creates labeled nodes/edges (e.g., Chunk, Community, Entity) under the tag.

Example

Bash
python -m graphbuild csv --dataset fixed_size --date 2025-10-27
python -m graphbuild ingest --dataset fixed_size --date 2025-10-27 --ingest-tag demo_oct27

Communities — `python -m communities`¶

Purpose: Build community structure (Leiden), summarize/embedd communities, and perform GraphRAG retrieval.

Subcommands¶

Bash
python -m communities communities   [--dataset <...>] [--levels "C1:1.2,C2:1.0"] [--ingest-tag <tag>]
python -m communities summaries     [--dataset <...>] [--level C1] [--estimate-only] [--refresh]
python -m communities ensure-index  [--dataset <...>] [--level C1]
python -m communities search        [--dataset <...>] [--level C1] --query "..." [--top-k 10]
python -m communities retrieve      [--dataset <...>] [--level C1] --query "..." [--k-comms 4] [--top-k 12] [--rerank-mode dense|summary]
python -m communities cleanup       --ingest-tag <tag>
python -m communities progress

Common options

--dataset <fixed_size|semantic>
--level C0|C1|C2|C3
--ingest-tag <tag> (for operations tied to a tagged ingestion)

Notes

communities: runs Leiden; use --levels "C1:1.2" to set the resolution (e.g., level C1 at 1.2).
summaries: LLM summarizes + embeds communities; add --estimate-only for dry‑run or --refresh to recompute.
retrieve: GraphRAG retrieval across selected communities; supports optional re‑ranking.

Example

Bash
python -m communities communities --dataset fixed_size --levels "C1:1.2"
python -m communities summaries --dataset fixed_size --level C1
python -m communities retrieve --dataset fixed_size --level C1 --query "solid-state hydrogen storage" --k-comms 4 --top-k 12

Hybrid retrieval (triple retriever) — `python -m rag.retrieval.triple_retriever`¶

Purpose: Fuse BM25 + dense seeds, optionally expand via Neo4j, and optionally apply cross-encoder reranking.

Stage A (IDs only)¶

Bash
python -m rag.retrieval.triple_retriever ids "query text" \
  --dataset fixed_size \
  --date 2025-09-14 \
  --injest-tag comm_fixed_C1_g1_2 \
  --level C1 \
  --top-k 50

Options (defaults)

--dataset (required)
--date (default: latest GOLD bucket)
--injest-tag (typo in CLI; default: None, falls back to COMM_INGEST_TAG)
--level (default: C1)
--top-k (default: 50)
--seed-k (default: 120)
--expand-ratio (default: 2.0)
--expand-limit (default: 800)
--bench-root (default: None)
--dense-date (default: None)
--dense-persist (default: None)
--rerank-mode (default: dense)
--graph-required/--no-graph-required (default: False)
--max-per-doc (default: 0)
--out (default: None; .jsonl | .json | .csv)

Stage B (rerank overlay)¶

Bash
python -m rag.retrieval.triple_retriever rerank "query text" \
  --dataset fixed_size \
  --date 2025-09-14 \
  --ingest-tag comm_fixed_C1_g1_2 \
  --level C1 \
  --fetch-top-n 120 \
  --top-k 20

Options (defaults)

--dataset (required)
--date (default: latest GOLD bucket)
--ingest-tag (default: None, falls back to COMM_INGEST_TAG)
--level (default: C1)
--top-k (default: 20)
--fetch-top-n (default: 120)
--seed-k (default: 120)
--expand-ratio (default: 2.0)
--expand-limit (default: 800)
--bench-root (default: None)
--dense-date (default: None)
--dense-persist (default: None)
--rerank-mode (default: dense)
--rerank-spec (default: hf:BAAI/bge-reranker-base)
--prefer (default: summary)
--include-doc/--no-include-doc (default: False)
--attach-metadata/--no-attach-metadata (default: True)
--graph-required/--no-graph-required (default: False)
--alpha (default: 0.7)
--sort-by (default: final)
--max-per-doc (default: 0)
--diversify (default: none)
--mmr-lambda (default: 0.5)
--min-occurrence / --min-occurance (default: 0)
--fusion (default: None; text | summary)
--coverage/--no-coverage (default: False)
--out (default: None; .jsonl | .json | .csv)

See Retrieval params reference for how CLI defaults compare to app defaults.

Dense Retriever — `python -m dense`¶

Purpose: Build/query a dense vector index (Chroma) over Gold chunks.

Build¶

Bash
python -m dense build --dataset <fixed_size|semantic> --date YYYY-MM-DD [--persist PATH] [--batch 128] [--limit N] [--use-summary|--use-text]

- Env: DENSE_EMBED_MODEL (default text-embedding-004), DENSE_PERSIST_DIR

Query¶

Bash
python -m dense query --top-k 10 [--hydrate] [--json]

- Uses the latest (or --persist) collection; --hydrate attaches chunk text/metadata

Writes

Text Only
1	`vectordb_dense/<model>/<dataset>/<YYYY-MM-DD>/ # or --persist path`

Example

Bash
python -m dense build --dataset fixed_size --date 2025-10-27 --use-summary
python -m dense query --top-k 10 --hydrate --json --query "electrolyzer efficiency roadmap"

BM25 Retriever — `python -m bm25`¶

Purpose: Build/query a lightweight lexical (BM25) index, optionally sharded by language.

Build¶

Bash
python -m bm25 build [--use-summary] [--k1 1.2] [--b 0.75] [--sharding mono|lang] [--limit N] [--persist PATH]

- Env: BM25_PERSIST_DIR

Query¶

Bash
python -m bm25 query --top-k 10 [--use-summary|--use-text] [--route auto|all|<lang>] [--hydrate] [--json] --query "..."

Writes

Text Only
1	`vectordb_bm25/<dataset>/<YYYY-MM-DD>/index.json # plus shards if sharded`

Example

Bash
python -m bm25 build --use-text
python -m bm25 query --top-k 10 --route auto --query "perovskite tandem solar"

Benchmarks — `python -m rag.bench.cli`¶

Purpose: Generate QA sets, run retrieval adapters (BM25, Dense, GraphRAG, Hybrid), and evaluate with TREC metrics. Includes latency summaries and weight tuning for hybrid fusion.

Commands¶

Bash
# 1) Generate QA
python -m rag.bench.cli generate --dataset <fixed_size|semantic> --date YYYY-MM-DD [--n 50] [--use-text|--use-summary] [--provider auto|openai|gemini|none] [--out-dir PATH]

# 2) Run an adapter
python -m rag.bench.cli run --adapter {bm25|dense|graphrag|hybrid} --dataset <...> --date YYYY-MM-DD --qa-file <path> --top-k 10 --out <path>
# adapter-specific flags:
#   BM25: --bm25-use-text|--bm25-use-summary, --bm25-route <auto|all|lang>
#   GraphRAG: --ingest-tag <tag> --level C1 --k-comms 4 --rerank-mode {dense|summary} [--dense-date <YYYY-MM-DD>] [--dense-persist <path>]
#   Hybrid: --hyb-rrf-k 60 --hyb-w-bm25 0.5 --hyb-w-dense 0.5 --hyb-w-graph 0.0 --hyb-seed-k 50 --graph-expand/--no-graph-expand --hyb-expand-ratio 0.2 --hyb-expand-limit 20

# 3) Evaluate
python -m rag.bench.cli evaluate --qrels <path> --run <path> [--run-id <id>] [--leaderboard <path>] [--e2e-log <path>] [--stage-log <path>]

# 4) Tune hybrid
python -m rag.bench.cli tune-hybrid --dataset <...> --date YYYY-MM-DD --qa-file <path> --out <path>

# 5) Latency & leaderboard utils
python -m rag.bench.cli latency --e2e-log <path> --stage-log <path>
python -m rag.bench.cli leaderboard [--leaderboard <path>]

Outputs¶

Text Only

bench_out/<dataset>/<YYYY-MM-DD>/
  qa/...
  evals/
    runs/*.run
    qrels/*.qrels
    metrics/*.json         # e.g., map, ndcg_cut_10, recip_rank, P_5, P_10, recall_100
    hybrid_tuned.json      # from tune-hybrid
    latency/*.jsonl        # optional timing logs
    leaderboard.csv

Example¶

Bash
# Build QA
python -m rag.bench.cli generate --dataset fixed_size --date 2025-10-27 --n 50 --use-summary --provider openai

# Run & evaluate Dense
python -m rag.bench.cli run --adapter dense --dataset fixed_size --date 2025-10-27 --qa-file bench_out/fixed_size/2025-10-27/qa/qa.json --top-k 10 --out bench_out/fixed_size/2025-10-27/evals/runs/dense.run
python -m rag.bench.cli evaluate --qrels bench_out/fixed_size/2025-10-27/qa/qrels.txt --run bench_out/fixed_size/2025-10-27/evals/runs/dense.run --leaderboard bench_out/fixed_size/2025-10-27/evals/leaderboard.csv

End‑to‑End (quick path)¶

Bash
# Bronze
cleantech-fetch --openalex-pages 3 --openalex-per-page 200 --openalex-mailto you@example.com

# Silver
ctclean media
ctclean patents
ctclean openalex
ctunify run

# Gold
ctsubsample --frac 0.25 --by doc_type,lang --seed 42
ctchunk fixed --max-tokens 512 --overlap 64 --doc-types media,patent
python -m ctenrichement.cli all --dataset fixed_size --provider auto --rps 8 --max-workers 12 --resume none --with-topics

# Graph + Retrievers
python -m graphbuild csv --dataset fixed_size --date <date>
python -m graphbuild ingest --dataset fixed_size --date <date> --ingest-tag <tag>
python -m dense build --dataset fixed_size --date <date> --use-summary
python -m bm25  build --use-text

# Bench
python -m rag.bench.cli generate --dataset fixed_size --date <date> --n 50 --use-summary --provider openai
python -m rag.bench.cli run --adapter hybrid --dataset fixed_size --date <date> --qa-file bench_out/fixed_size/<date>/qa/qa.json --top-k 10 --out bench_out/fixed_size/<date>/evals/runs/hybrid.run
python -m rag.bench.cli evaluate --qrels bench_out/fixed_size/<date>/qa/qrels.txt --run bench_out/fixed_size/<date>/evals/runs/hybrid.run

End‑to‑End Study Replication (BM25, Dense, Hybrid, Hybrid+GraphRAG) — with Latency¶

This section provides a step‑by‑step runbook for reproducing a study using BM25, Dense, Hybrid, and Hybrid+Graph variants with latency logging.
The commands below use the bench module path (python -m bench ...). If your environment exposes the full path, replace bench with rag.bench.cli (e.g., python -m rag.bench.cli run ...).
Where paths or dates differ, substitute your <DATASET> and <DATE> accordingly.

0) One‑time prep (only if not done yet)¶

Activate env & make sources importable (PowerShell):

PowerShell
. .\.venv\Scripts\Activate.ps1
$env:PYTHONPATH = "$PWD\src
ag"

Activate env (bash/zsh):

Bash
source .venv/bin/activate
export PYTHONPATH="$PWD/src/rag"

.env (root of repo):

INI
# Data root (either of these is honored depending on component)
CLEANTECH_DATA_DIR=...        # preferred name used across the pipeline
CT_DATA_ROOT=...              # alias used by some bench utilities

# LLMs
GEMINI_KEY=your_gemini_api_key

# Retrievers (optional explicit persist paths)
DENSE_PERSIST_DIR=...
BM25_PERSIST_DIR=...

# GraphRAG / Hybrid+Graph
NEO4J_URI=neo4j://127.0.0.1:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
NEO4J_DATABASE_FIXED_SIZE=graph-fixed-size

Build indexes (once per dataset/date):

Bash
python -m bm25 build  --dataset fixed_size --date 2025-09-14 --use-text
python -m dense build --dataset fixed_size --date 2025-09-14

(Optional) Ensure community vector index ONLINE:

Bash
python -m communities ensure-index --dataset fixed_size --level C1

0b) Latency logging (set once per session/date)¶

PowerShell:

PowerShell
# where your evals live
$evals = "bench_out\fixed_size\5-09-14\evals"
mkdir $evals -Force | Out-Null

# end-to-end (all adapters) + Hybrid stage-level logs
$env:BENCH_LATENCY_LOG = "$evals\latency_e2e.jsonl"
$env:HYB_LATENCY_LOG   = "$evals\latency_stages.jsonl"

# optional: start fresh logs
#Remove-Item $env:BENCH_LATENCY_LOG -ErrorAction SilentlyContinue
#Remove-Item $env:HYB_LATENCY_LOG   -ErrorAction SilentlyContinue

bash/zsh:

Bash
evals="bench_out/fixed_size/2025-09-14/evals"
mkdir -p "$evals"

export BENCH_LATENCY_LOG="$evals/latency_e2e.jsonl"
export HYB_LATENCY_LOG="$evals/latency_stages.jsonl"

# optional: start fresh logs
# rm -f "$BENCH_LATENCY_LOG" "$HYB_LATENCY_LOG"

With these set, bench runs will log latency automatically.
bench evaluate will also fold latency percentiles into the printed output and leaderboard rows.

1) Generate QA & qrels (Parquet‑first, text‑first)¶

Bash
python -m bench generate   --dataset fixed_size   --date 2025-09-14   --n 50   --use-text   --provider auto

Outputs

bench_out/fixed_size/2025-09-14/qa.parquet
bench_out/fixed_size/2025-09-14/qrels.parquet
bench_out/fixed_size/2025-09-14/build_meta.json
(+ qa.jsonl, qrels.json if RAG_BENCH_WRITE_JSON=1)

2) Build baseline runs (BM25, Dense)¶

BM25 (text)

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter bm25 --dataset fixed_size --date 2025-09-14   --top-k 100 --bm25-use-text

Dense (Chroma)

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter dense --dataset fixed_size --date 2025-09-14   --top-k 100

Writes

bench_out/fixed_size/2025-09-14/evals/bm25_run.json
bench_out/fixed_size/2025-09-14/evals/dense_run.json

3) Hybrid (BM25 ⊕ Dense, no graph) — tune & re‑run¶

Tune weights to a metric (e.g., recall_100):

Bash
python -m bench tune-hybrid   --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --dataset fixed_size --date 2025-09-14   --top-k 100 --metric recall_100

Load tuned weights (PowerShell):

PowerShell
$best = Get-Content -Raw bench_out\fixed_size\5-09-14\evals\hybrid_tuned.json | ConvertFrom-Json

Run Hybrid with tuned weights (no graph expansion):

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter hybrid --dataset fixed_size --date 2025-09-14   --top-k 100 --bm25-use-text --no-graph-expand   --hyb-w-bm25 $best.w_bm25 --hyb-w-dense $best.w_dense --hyb-rrf-k $best.rrf_k   --hyb-seed-k 120

Writes

bench_out/fixed_size/2025-09-14/evals/hybrid_run.json

3b) Hybrid + GraphRAG expansion (with tuned weights)¶

Requires a communities DB for --ingest-tag / --level (e.g., comm_fixed_C1_g1_2 at level C1). Ensure community vector index is ONLINE if searching communities.

Bash
python -m bench run --qa-file bench_out/fixed_size/2025-09-14/qa.parquet   --adapter hybrid --dataset fixed_size --date 2025-09-14   --graph-expand   --ingest-tag comm_fixed_C1_g1_2 --level C1   --hyb-expand-ratio 2.0 --hyb-expand-limit 800   --rerank-mode dense   --hyb-w-graph 1.0   --hyb-w-bm25 $best.w_bm25 --hyb-w-dense $best.w_dense --hyb-rrf-k $best.rrf_k   --hyb-seed-k 120 --top-k 100   --out bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json

Writes

bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json

4) Evaluate & log (chunk‑level + doc‑level + latency)¶

BM25

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/bm25_run.json   --run-id BM25_text_2025-09-14

Dense

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/dense_run.json   --run-id Dense_2025-09-14

Hybrid (tuned)

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/hybrid_run.json   --run-id Hybrid_tuned_2025-09-14

Hybrid + GraphRAG (tuned)

Bash
python -m bench evaluate   --qrels bench_out/fixed_size/2025-09-14/qrels.parquet   --run   bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json   --run-id Hybrid_tuned_graph_2025-09-14

Each call logs:

chunk‑level → bench_out/fixed_size/2025-09-14/evals/leaderboard.csv
doc‑level → bench_out/fixed_size/2025-09-14/evals/leaderboard_doc.csv
…and prints latency (end‑to‑end for all adapters; stage‑level for Hybrid).
If you prefer explicit paths, pass --e2e-log and/or --stage-log to bench evaluate.

5) Leaderboard (view)¶

Chunk‑level leaderboard (auto‑picks latest evals/leaderboard.csv):

Bash
python -m bench leaderboard --sort-by recall_100

Doc‑level leaderboard:

Bash
python -m bench leaderboard   --leaderboard bench_out/fixed_size/2025-09-14/evals/leaderboard_doc.csv   --sort-by recall_100

(Optional) One‑shot latency summaries

Bash
python -m bench latency --end-to-end "$BENCH_LATENCY_LOG"
python -m bench latency --stages     "$HYB_LATENCY_LOG"

Notes & tips

Graph fix applied: Hybrid graph expansion uses Neo4j 5 / GDS 2‑safe Cypher (COUNT { pattern }), avoiding deprecation errors.
GraphRAG prereqs: communities exist for --ingest-tag / --level; ensure vector index ONLINE if you search communities (python -m communities ensure-index --dataset fixed_size --level C1).
Doc‑level metrics: bench evaluate logs both chunk‑ and doc‑level automatically.
Latency logging model:
BENCH_LATENCY_LOG: end‑to‑end per‑query for all adapters.
HYB_LATENCY_LOG: stage‑level (seed, graph_expand, rerank, total) for Hybrid only.
Evaluator filters by the run’s QIDs — you can reuse the same JSONL logs across runs; to keep per‑run logs, point the env vars to different files before each run.

Environment quick reference¶

Data root: CLEANTECH_DATA_DIR (default ./cleantech_data) or CT_DATA_ROOT (bench alias)
Kaggle: KAGGLE_USERNAME, KAGGLE_KEY (or ~/.kaggle/kaggle.json)
LLM: GEMINI_KEY, OPENAI_KEY
Neo4j: NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, NEO4J_DATABASE, optional NEO4J_DATABASE_FIXED_SIZE
Dense retriever: DENSE_EMBED_MODEL, DENSE_PERSIST_DIR
BM25 retriever: BM25_PERSIST_DIR
Bench logs: BENCH_LATENCY_LOG, HYB_LATENCY_LOG
Enrichment knobs: LEX_SUMMARY_SENTENCES, LEX_MAX_RETRIES, LEX_MIN_EXTRACTIONS, LEX_MIN_SPAN_INTEGRITY

CLI¶

Bronze — cleantech-fetch¶

Invocation¶

Options¶

Outputs¶

Examples¶

Silver — ctclean¶

Invocation¶

Subcommands & Options¶

Examples¶

Validation & Unify — ctunify¶

Invocation¶

Behavior & Outputs¶

Examples¶

Gold — Subsample → Chunk → Enrich → (Language Patch)¶

Subsample — ctsubsample¶

Invocation¶

Options¶

Outputs¶

Examples¶

Chunk — ctchunk¶

Invocation¶

Common Options¶

Mode: fixed¶

Mode: semantic¶

Examples¶

Enrich — python -m ctenrichement.cli¶

Invocation¶

Options¶

Outputs¶

Examples¶

(Optional) Language Patch — python -m patch.language.lang_overwrite¶

Invocation¶

Options¶

Outputs¶

Example¶

RAG Build & Retrieval¶

Graph Build — python -m graphbuild¶

Export CSV¶

Ingest Neo4j¶

Communities — python -m communities¶

Subcommands¶

Hybrid retrieval (triple retriever) — python -m rag.retrieval.triple_retriever¶

Stage A (IDs only)¶

Stage B (rerank overlay)¶

Dense Retriever — python -m dense¶

Build¶

Query¶

BM25 Retriever — python -m bm25¶

Build¶

Query¶

Benchmarks — python -m rag.bench.cli¶

Commands¶

Outputs¶

Example¶

End‑to‑End (quick path)¶

End‑to‑End Study Replication (BM25, Dense, Hybrid, Hybrid+GraphRAG) — with Latency¶

0) One‑time prep (only if not done yet)¶

0b) Latency logging (set once per session/date)¶

1) Generate QA & qrels (Parquet‑first, text‑first)¶

2) Build baseline runs (BM25, Dense)¶

3) Hybrid (BM25 ⊕ Dense, no graph) — tune & re‑run¶

3b) Hybrid + GraphRAG expansion (with tuned weights)¶

4) Evaluate & log (chunk‑level + doc‑level + latency)¶

5) Leaderboard (view)¶

Environment quick reference¶

Bronze — `cleantech-fetch`¶

Silver — `ctclean`¶

Validation & Unify — `ctunify`¶

Subsample — `ctsubsample`¶

Chunk — `ctchunk`¶

Mode: `fixed`¶

Mode: `semantic`¶

Enrich — `python -m ctenrichement.cli`¶

(Optional) Language Patch — `python -m patch.language.lang_overwrite`¶

Graph Build — `python -m graphbuild`¶

Communities — `python -m communities`¶

Hybrid retrieval (triple retriever) — `python -m rag.retrieval.triple_retriever`¶

Dense Retriever — `python -m dense`¶

BM25 Retriever — `python -m bm25`¶

Benchmarks — `python -m rag.bench.cli`¶