References¶

Curated references and upstream resources used by this project, grouped by topic. Brief notes explain how each item is used in practice.

Data sources & taxonomies¶

Kaggle — https://www.kaggle.com · API/CLI: https://github.com/Kaggle/kaggle-api
Used in Bronze (cleantech-fetch) to mirror upstream datasets into immutable snapshots.
OpenAlex Topics — https://docs.openalex.org/api-entities/topics
Transparent hierarchy (domain → field → subfield → topic) for alignment, coverage, and explainability; OpenAlex Works may be used only for augment‑on‑demand.
CPC (Cooperative Patent Classification) — EPO overview: https://www.epo.org/en/searching-for-patents/helpful-resources/first-time-here/classification/cpc
Patent technology vocabulary; kept as per‑chunk metadata for filtering/faceting (e.g., Y02E).
WIPO Patent Landscape Guidelines — https://doi.org/10.34667/tind.28858
Anchors for patent‑analytics practice and reporting.

Typer — https://typer.tiangolo.com · CLI for ctclean, ctsubsample, ctchunk, ctenrichement, etc.
Pandas — https://pandas.pydata.org · Tabular processing across Bronze/Silver/Gold.
Rich — https://rich.readthedocs.io/ · Progress and logging (optional).

Apache Parquet (PyArrow) — https://arrow.apache.org/docs/python/
Default columnar format; ctclean.io.safe_write handles mixed dtypes and falls back to CSV.gz if needed.
fastparquet — https://fastparquet.readthedocs.io/ · Alternate engine.

BM25 (Robertson & Zaragoza) — overview: https://www.nowpublishers.com/article/Details/INR-019
Sparse lexical baseline used by python -m bm25.
Sentence‑BERT (bi‑encoder) — https://arxiv.org/abs/1908.10084 · docs: https://www.sbert.net/
Dense retrieval and semantic chunking.
Reciprocal Rank Fusion (RRF) — https://doi.org/10.1145/1571941.1572114
Robust fusion of BM25 and dense ranks in the Hybrid adapter.
Cross‑encoder re‑ranking — https://www.sbert.net/examples/applications/cross-encoder/README.html
Improves early precision before composition.
Maximal Marginal Relevance (MMR) — https://dl.acm.org/doi/10.1145/290941.291025
Reduces redundancy and improves topical breadth.
RAG (survey) — https://arxiv.org/abs/2409.14924 · Primer on RAG methods and evaluation.

Neo4j Cypher — LOAD CSV — https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/
Used to ingest GraphRAG CSV exports.
GraphRAG (patterns & repo) — https://github.com/microsoft/graphrag
Query‑focused summarization and graph‑assisted retrieval ideas for exploration.
Leiden communities — https://www.nature.com/articles/s41598-019-41695-z
Community detection used for cluster labels & diversity analysis.

ChromaDB — https://docs.trychroma.com · Dense vector store for python -m dense.
rank‑bm25 (PyPI) — https://pypi.org/project/rank-bm25/ · Sparse retriever used by python -m bm25.
NLTK — https://www.nltk.org/ · Stopwords/tokenization utilities (optional).

pytrec_eval — https://github.com/cvangysel/pytrec_eval · TREC metrics for the benchmark runner.
Introduction to IR — https://nlp.stanford.edu/IR-book/ · Textbook reference for nDCG, MAP, etc.

Google Gemini — https://ai.google.dev/ · Embeddings/summaries (GEMINI_KEY).
OpenAI — https://platform.openai.com/docs · Optional summaries/rerankers (OPENAI_KEY).

MkDocs — https://www.mkdocs.org · Material for MkDocs — https://squidfunk.github.io/mkdocs-material/
mkdocstrings (Python) — https://mkdocstrings.github.io/python/ · mkdocs‑jupyter — https://github.com/danielfrg/mkdocs-jupyter
mkdocs-video — https://github.com/soulless-viewer/mkdocs-video

GRAINS preliminary study (Gerber, 2025) — internal PDF in this repo.
Source for acceptance budgets (Recall/nDCG/Precision/Coverage, explainability ≥ 2 citations, p95 latency envelopes), single‑adapter serving path, and evaluation protocol used across these docs.