CLI¶
This page is the engineer‑oriented CLI reference for the MT project. It covers all packages end‑to‑end — Bronze, Silver, Unify, Gold (subsample → chunk → enrich → language patch), RAG build & retrieval (graph, communities, dense & BM25), and benchmarks.
The layout for each CLI:
- Purpose → what it does and when to use it
- Invocation → how to run it
- Options → grouped by category; sensible defaults called out
- Outputs → what gets written and where
- Examples → copy‑paste snippets
Data root: defaults to
./cleantech_data. Override globally withCLEANTECH_DATA_DIRor per command via--download-dir,--bronze-dir,--silver-dir, etc.
Kaggle: exportKAGGLE_USERNAME&KAGGLE_KEY(or~/.kaggle/kaggle.json).
LLM providers: setGEMINI_KEYand/orOPENAI_KEY.
Neo4j: setNEO4J_URI,NEO4J_USER,NEO4J_PASSWORD(and database name if applicable).
Bronze — cleantech-fetch¶
Purpose: Fetch Kaggle datasets and OpenAlex Works/Topics into date‑bucketed Bronze folders, writing a per‑run manifest and optional JSONL mirrors for uniform downstream ingestion.
Invocation¶
| Bash | |
|---|---|
Options¶
General
--download-dir PATH— root for all artifacts (default:./cleantech_data). Createsbronze/...as needed.
Kaggle datasets
--kaggle-no-mirror— skip buildingraw.jsonl.gzmirror; keep only original ZIP.--kaggle-keep-extracted— preserve the temporaryextracted/CSV/JSON files.
OpenAlex Works
--start YYYY-MM-DD,--end YYYY-MM-DD— publication date window (defaults: current year).--openalex-per-page 200— items per page (max 200).--openalex-pages 0— number of result pages (0 = no limit).--openalex-search "..."— full‑text search query.--openalex-oa-only— restrict to open‑access works.--openalex-mailto you@example.com— include a mailto for polite API usage.
OpenAlex Topics (topics‑only mode)
--openalex-only-topics— fetch Topics only (skips Works).--openalex-topics-per-page 200,--openalex-topics-pages 0,--openalex-topics-search "..."— topics paging/search.--openalex-topics-keep-extracted— keep a plaintopics.jsonlunderextracted/.
Outputs¶
Examples¶
Silver — ctclean¶
Purpose: Canonicalize Bronze snapshots into analysis‑ready Silver tables per dataset (Media, Patents, OpenAlex Topics). Parquet first; CSV.gz fallback if needed.
Invocation¶
Subcommands & Options¶
ctclean media
--n-rows N— process only first N rows (smoke test).--bronze-dir DIR— override media Bronze location.--silver-dir DIR— override output path.--include-listings/--no-include-listings— keep (default off) or drop listing/archive pages.
Outputs
| Text Only | |
|---|---|
ctclean patents
--n-rows N,--bronze-dir DIR,--silver-dir DIR
Outputs
| Text Only | |
|---|---|
ctclean openalex
--n-rows N,--bronze-dir DIR,--silver-dir DIR
Outputs
| Text Only | |
|---|---|
ctclean all
- Runs media → patents → openalex in sequence, honoring the same flags.
Examples¶
| Bash | |
|---|---|
Validation & Unify — ctunify¶
Purpose: Validate latest Silver buckets and merge into a single Unified Silver file for Gold processing.
Invocation¶
| Bash | |
|---|---|
Behavior & Outputs¶
- Auto‑discovers latest
silver/media,silver/patents,silver/openalexbuckets. - Default output:
cleantech_data/silver/unified/unified_docs.parquet(override with--output). - Non‑destructive: inputs remain unchanged.
Examples¶
| Bash | |
|---|---|
Gold — Subsample → Chunk → Enrich → (Language Patch)¶
Subsample — ctsubsample¶
Purpose: Create a stratified sample of the unified corpus for faster Gold iteration. Media & patent rows are sampled; topic rows are always kept and appended.
Invocation¶
| Bash | |
|---|---|
Options¶
- I/O:
--input PATH(default: latestsilver/unified/unified_docs.*),--outdir PATH(default: new folder undersilver_subsample/). - Size: exactly one of
--n INTor--frac FLOAT(e.g.,--frac 0.25). - Strata:
--by "doc_type,lang"(default). Optional--min-per-stratum,--cap-per-stratum. - Repro/format:
--seed 42(default),--dedupe(bydoc_idpre‑sample),--write-csv(write CSV.gz alongside Parquet).
Outputs¶
| Text Only | |
|---|---|
Examples¶
| Bash | |
|---|---|
Chunk — ctchunk¶
Purpose: Split documents into smaller chunks for retrieval & enrichment. Two strategies:
- fixed: fixed token windows with overlap
- semantic: similarity‑aware boundaries using embeddings
- both: convenience to run both
Invocation¶
| Bash | |
|---|---|
Common Options¶
--input PATH(default: latestsilver_subsample/unified_docs_subsample.*)--outdir PATH(default:silver_subsample_chunk/<mode>/<YYYY-MM-DD>/)--doc-types "media,patent"(filter; default: all types present)--n-rows N(debug subset)
Mode: fixed¶
--max-tokens 512— max tokens per chunk--overlap 64— tokens overlapped with previous chunk--prepend-title / --no-prepend-title— include title in chunk text (default: on)
Outputs
Mode: semantic¶
--sim-threshold 0.75— merge when adjacent similarity ≥ threshold--min-tokens 50— minimum chunk size (merge small fragments)--embedding-model sentence-transformers/distiluse-base-multilingual-cased-v2--prepend-title / --no-prepend-title
Outputs
Examples¶
| Bash | |
|---|---|
Enrich — python -m ctenrichement.cli¶
Purpose: LLM‑powered enrichment of chunks with summaries, entities, facts, and topic joins → writes Gold artifacts.
Invocation¶
| Bash | |
|---|---|
Options¶
- Dataset selection:
--dataset fixed_sizeor--dataset semantic(default: both; run twice internally). - Provider & model:
--provider auto|openai|gemini,--model-id <name>;--summary-provider,--summary-model-id(advanced). Env:LEX_SUMMARY_SENTENCES=3(default). - Throughput:
--rps 8.0,--max-workers 12; env:LEX_MAX_RETRIES=5,LEX_MIN_EXTRACTIONS=1,LEX_MIN_SPAN_INTEGRITY=0.4. - Resume:
--resume none|failed|skip-completed(uses.lex_cachestate). - With topics:
--with-topics— also build topic assets/joins when enriching media/patent.
Outputs¶
| Text Only | |
|---|---|
Examples¶
| Bash | |
|---|---|
(Optional) Language Patch — python -m patch.language.lang_overwrite¶
Purpose: Re‑detect / overwrite metadata.lang for chunks using a lightweight LLM prompt; writes patched Gold + audit report.
Invocation¶
| Bash | |
|---|---|
Options¶
--from-text— detect from raw chunk text (default prefers summary if present)--overwrite / --no-overwrite— replace existingmetadata.lang(default: no)--write-mode copy|inplace— writechunks_enriched_langpatched.*or edit in place (with.bak)--limit N— process only first N rows- Provider/model:
--provider auto|openai|gemini,--openai-model <id>,--gemini-model <id>
Outputs¶
| Text Only | |
|---|---|
Example¶
| Bash | |
|---|---|
RAG Build & Retrieval¶
Graph Build — python -m graphbuild¶
Purpose: Export Gold chunks to graph CSVs and ingest them into Neo4j under a chosen ingest tag for GraphRAG.
Export CSV¶
| Bash | |
|---|---|
Ingest Neo4j¶
| Bash | |
|---|---|
NEO4J_* env vars for connection/auth.
- Creates labeled nodes/edges (e.g., Chunk, Community, Entity) under the tag.
Example
| Bash | |
|---|---|
Communities — python -m communities¶
Purpose: Build community structure (Leiden), summarize/embedd communities, and perform GraphRAG retrieval.
Subcommands¶
Common options
--dataset <fixed_size|semantic>--level C0|C1|C2|C3--ingest-tag <tag>(for operations tied to a tagged ingestion)
Notes
communities: runs Leiden; use--levels "C1:1.2"to set the resolution (e.g., level C1 at 1.2).summaries: LLM summarizes + embeds communities; add--estimate-onlyfor dry‑run or--refreshto recompute.retrieve: GraphRAG retrieval across selected communities; supports optional re‑ranking.
Example
| Bash | |
|---|---|
Hybrid retrieval (triple retriever) — python -m rag.retrieval.triple_retriever¶
Purpose: Fuse BM25 + dense seeds, optionally expand via Neo4j, and optionally apply cross-encoder reranking.
Stage A (IDs only)¶
| Bash | |
|---|---|
Options (defaults)
--dataset(required)--date(default: latest GOLD bucket)--injest-tag(typo in CLI; default: None, falls back toCOMM_INGEST_TAG)--level(default: C1)--top-k(default: 50)--seed-k(default: 120)--expand-ratio(default: 2.0)--expand-limit(default: 800)--bench-root(default: None)--dense-date(default: None)--dense-persist(default: None)--rerank-mode(default: dense)--graph-required/--no-graph-required(default: False)--max-per-doc(default: 0)--out(default: None; .jsonl | .json | .csv)
Stage B (rerank overlay)¶
| Bash | |
|---|---|
Options (defaults)
--dataset(required)--date(default: latest GOLD bucket)--ingest-tag(default: None, falls back toCOMM_INGEST_TAG)--level(default: C1)--top-k(default: 20)--fetch-top-n(default: 120)--seed-k(default: 120)--expand-ratio(default: 2.0)--expand-limit(default: 800)--bench-root(default: None)--dense-date(default: None)--dense-persist(default: None)--rerank-mode(default: dense)--rerank-spec(default: hf:BAAI/bge-reranker-base)--prefer(default: summary)--include-doc/--no-include-doc(default: False)--attach-metadata/--no-attach-metadata(default: True)--graph-required/--no-graph-required(default: False)--alpha(default: 0.7)--sort-by(default: final)--max-per-doc(default: 0)--diversify(default: none)--mmr-lambda(default: 0.5)--min-occurrence/--min-occurance(default: 0)--fusion(default: None; text | summary)--coverage/--no-coverage(default: False)--out(default: None; .jsonl | .json | .csv)
See Retrieval params reference for how CLI defaults compare to app defaults.
Dense Retriever — python -m dense¶
Purpose: Build/query a dense vector index (Chroma) over Gold chunks.
Build¶
| Bash | |
|---|---|
DENSE_EMBED_MODEL (default text-embedding-004), DENSE_PERSIST_DIR
Query¶
| Bash | |
|---|---|
--persist) collection; --hydrate attaches chunk text/metadata
Writes
| Text Only | |
|---|---|
Example
| Bash | |
|---|---|
BM25 Retriever — python -m bm25¶
Purpose: Build/query a lightweight lexical (BM25) index, optionally sharded by language.
Build¶
| Bash | |
|---|---|
BM25_PERSIST_DIR
Query¶
| Bash | |
|---|---|
Writes
| Text Only | |
|---|---|
Example
| Bash | |
|---|---|
Benchmarks — python -m rag.bench.cli¶
Purpose: Generate QA sets, run retrieval adapters (BM25, Dense, GraphRAG, Hybrid), and evaluate with TREC metrics. Includes latency summaries and weight tuning for hybrid fusion.
Commands¶
Outputs¶
| Text Only | |
|---|---|
Example¶
End‑to‑End (quick path)¶
End‑to‑End Study Replication (BM25, Dense, Hybrid, Hybrid+GraphRAG) — with Latency¶
This section provides a step‑by‑step runbook for reproducing a study using BM25, Dense, Hybrid, and Hybrid+Graph variants with latency logging.
The commands below use thebenchmodule path (python -m bench ...). If your environment exposes the full path, replacebenchwithrag.bench.cli(e.g.,python -m rag.bench.cli run ...).
Where paths or dates differ, substitute your<DATASET>and<DATE>accordingly.
0) One‑time prep (only if not done yet)¶
Activate env & make sources importable (PowerShell):
Activate env (bash/zsh):
.env (root of repo):
Build indexes (once per dataset/date):
| Bash | |
|---|---|
(Optional) Ensure community vector index ONLINE:
| Bash | |
|---|---|
0b) Latency logging (set once per session/date)¶
PowerShell:
bash/zsh:
| Bash | |
|---|---|
With these set, bench runs will log latency automatically.
bench evaluate will also fold latency percentiles into the printed output and leaderboard rows.
1) Generate QA & qrels (Parquet‑first, text‑first)¶
| Bash | |
|---|---|
bench_out/fixed_size/2025-09-14/qa.parquetbench_out/fixed_size/2025-09-14/qrels.parquetbench_out/fixed_size/2025-09-14/build_meta.json
(+qa.jsonl,qrels.jsonifRAG_BENCH_WRITE_JSON=1)
2) Build baseline runs (BM25, Dense)¶
BM25 (text)
| Bash | |
|---|---|
Dense (Chroma)
| Bash | |
|---|---|
Writes
bench_out/fixed_size/2025-09-14/evals/bm25_run.jsonbench_out/fixed_size/2025-09-14/evals/dense_run.json
3) Hybrid (BM25 ⊕ Dense, no graph) — tune & re‑run¶
Tune weights to a metric (e.g., recall_100):
| Bash | |
|---|---|
Load tuned weights (PowerShell):
| PowerShell | |
|---|---|
Run Hybrid with tuned weights (no graph expansion):
| Bash | |
|---|---|
bench_out/fixed_size/2025-09-14/evals/hybrid_run.json
3b) Hybrid + GraphRAG expansion (with tuned weights)¶
Requires a communities DB for
--ingest-tag/--level(e.g.,comm_fixed_C1_g1_2at levelC1). Ensure community vector index is ONLINE if searching communities.
bench_out/fixed_size/2025-09-14/evals/hybrid_graph_run.json
4) Evaluate & log (chunk‑level + doc‑level + latency)¶
BM25
| Bash | |
|---|---|
Dense
| Bash | |
|---|---|
Hybrid (tuned)
| Bash | |
|---|---|
Hybrid + GraphRAG (tuned)
| Bash | |
|---|---|
- chunk‑level →
bench_out/fixed_size/2025-09-14/evals/leaderboard.csv - doc‑level →
bench_out/fixed_size/2025-09-14/evals/leaderboard_doc.csv
…and prints latency (end‑to‑end for all adapters; stage‑level for Hybrid).
If you prefer explicit paths, pass--e2e-logand/or--stage-logtobench evaluate.
5) Leaderboard (view)¶
Chunk‑level leaderboard (auto‑picks latest evals/leaderboard.csv):
| Bash | |
|---|---|
Doc‑level leaderboard:
| Bash | |
|---|---|
(Optional) One‑shot latency summaries
| Bash | |
|---|---|
Notes & tips
- Graph fix applied: Hybrid graph expansion uses Neo4j 5 / GDS 2‑safe Cypher (
COUNT { pattern }), avoiding deprecation errors. - GraphRAG prereqs: communities exist for
--ingest-tag/--level; ensure vector index ONLINE if you search communities (python -m communities ensure-index --dataset fixed_size --level C1). - Doc‑level metrics:
bench evaluatelogs both chunk‑ and doc‑level automatically. - Latency logging model:
BENCH_LATENCY_LOG: end‑to‑end per‑query for all adapters.HYB_LATENCY_LOG: stage‑level (seed, graph_expand, rerank, total) for Hybrid only.- Evaluator filters by the run’s QIDs — you can reuse the same JSONL logs across runs; to keep per‑run logs, point the env vars to different files before each run.
Environment quick reference¶
- Data root:
CLEANTECH_DATA_DIR(default./cleantech_data) orCT_DATA_ROOT(bench alias) - Kaggle:
KAGGLE_USERNAME,KAGGLE_KEY(or~/.kaggle/kaggle.json) - LLM:
GEMINI_KEY,OPENAI_KEY - Neo4j:
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORD,NEO4J_DATABASE, optionalNEO4J_DATABASE_FIXED_SIZE - Dense retriever:
DENSE_EMBED_MODEL,DENSE_PERSIST_DIR - BM25 retriever:
BM25_PERSIST_DIR - Bench logs:
BENCH_LATENCY_LOG,HYB_LATENCY_LOG - Enrichment knobs:
LEX_SUMMARY_SENTENCES,LEX_MAX_RETRIES,LEX_MIN_EXTRACTIONS,LEX_MIN_SPAN_INTEGRITY