FAQ¶

Where does data get stored?
Default root is ./cleantech_data. Override globally with CLEANTECH_DATA_DIR or use command flags (--download-dir, --bronze-dir, --silver-dir). See the CLI.

How are Kaggle credentials provided?
Set env vars KAGGLE_USERNAME and KAGGLE_KEY or place a ~/.kaggle/kaggle.json. The Bronze fetch mirrors upstream archives to immutable snapshots with per‑run manifests.

How do I stay within OpenAlex rate limits?
Provide --openalex-mailto you@example.com, keep --openalex-per-page modest, optionally restrict --openalex-search, and use --openalex-only-topics when you just need the taxonomy.

Which taxonomy do you use and why?
OpenAlex Topics (domain→field→subfield→topic) is the only taxonomy represented as nodes/edges in the graph. Patent CPC codes stay as per‑chunk metadata for filtering/faceting. This maximizes transparency and keeps traversal simple.

What are the core metrics and acceptance budgets?
We track @K metrics at pool/final checkpoints: Recall@K (pool ≥ 0.90, final ≥ 0.60), nDCG@K (final ≥ 0.68), Precision@K (final ≥ 0.25), Coverage@K across OpenAlex subfields (≥ 0.60), Explainability rate (≥ 2 distinct citations per answer, ≥ 70%), and E2E p95 latency (≤ 2.5 s cached / ≤ 5 s cold). These are targets to validate, not results.

How do I interpret Coverage@K?
It’s the fraction of distinct OpenAlex subfields represented in the final top‑K, normalized by min(K, number of subfields in the candidate pool)—a direct measure of topical breadth after diversity control.

Fixed vs Semantic chunking — which should I use?
Start with fixed (--max-tokens 512 --overlap 64) for a predictable baseline. On a subset, semantic chunking increased volume by ~2.7× (e.g., 8,538 → 23,482 chunks), which affects indexing and latency. Choose one mode per run and keep the other fixed for sensitivity checks.

Can I fetch topics without works?
Yes: cleantech-fetch --openalex-only-topics (add --openalex-topics-keep-extracted to leave a plain JSONL copy).

Graph/communities prerequisites?
Ingest GraphRAG CSVs to Neo4j, then build communities (Leiden). If a vector index is needed for community search, run python -m communities ensure-index --dataset <...> --level C1.