References¶
Curated references and upstream resources used by this project, grouped by topic. Brief notes explain how each item is used in practice.
Data sources & taxonomies¶
- Kaggle — https://www.kaggle.com · API/CLI: https://github.com/Kaggle/kaggle-api
Used in Bronze (cleantech-fetch) to mirror upstream datasets into immutable snapshots. - OpenAlex Topics — https://docs.openalex.org/api-entities/topics
Transparent hierarchy (domain → field → subfield → topic) for alignment, coverage, and explainability; OpenAlex Works may be used only for augment‑on‑demand. - CPC (Cooperative Patent Classification) — EPO overview: https://www.epo.org/en/searching-for-patents/helpful-resources/first-time-here/classification/cpc
Patent technology vocabulary; kept as per‑chunk metadata for filtering/faceting (e.g., Y02E). - WIPO Patent Landscape Guidelines — https://doi.org/10.34667/tind.28858
Anchors for patent‑analytics practice and reporting.
Pipeline & CLI¶
- Typer — https://typer.tiangolo.com · CLI for
ctclean,ctsubsample,ctchunk,ctenrichement, etc. - Pandas — https://pandas.pydata.org · Tabular processing across Bronze/Silver/Gold.
- Rich — https://rich.readthedocs.io/ · Progress and logging (optional).
Storage & serialization¶
- Apache Parquet (PyArrow) — https://arrow.apache.org/docs/python/
Default columnar format;ctclean.io.safe_writehandles mixed dtypes and falls back to CSV.gz if needed. - fastparquet — https://fastparquet.readthedocs.io/ · Alternate engine.
Retrieval, ranking & RAG¶
- BM25 (Robertson & Zaragoza) — overview: https://www.nowpublishers.com/article/Details/INR-019
Sparse lexical baseline used bypython -m bm25. - Sentence‑BERT (bi‑encoder) — https://arxiv.org/abs/1908.10084 · docs: https://www.sbert.net/
Dense retrieval and semantic chunking. - Reciprocal Rank Fusion (RRF) — https://doi.org/10.1145/1571941.1572114
Robust fusion of BM25 and dense ranks in the Hybrid adapter. - Cross‑encoder re‑ranking — https://www.sbert.net/examples/applications/cross-encoder/README.html
Improves early precision before composition. - Maximal Marginal Relevance (MMR) — https://dl.acm.org/doi/10.1145/290941.291025
Reduces redundancy and improves topical breadth. - RAG (survey) — https://arxiv.org/abs/2409.14924 · Primer on RAG methods and evaluation.
Graph & communities¶
- Neo4j Cypher — LOAD CSV — https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/
Used to ingest GraphRAG CSV exports. - GraphRAG (patterns & repo) — https://github.com/microsoft/graphrag
Query‑focused summarization and graph‑assisted retrieval ideas for exploration. - Leiden communities — https://www.nature.com/articles/s41598-019-41695-z
Community detection used for cluster labels & diversity analysis.
Vector & lexical stores¶
- ChromaDB — https://docs.trychroma.com · Dense vector store for
python -m dense. - rank‑bm25 (PyPI) — https://pypi.org/project/rank-bm25/ · Sparse retriever used by
python -m bm25. - NLTK — https://www.nltk.org/ · Stopwords/tokenization utilities (optional).
Evaluation¶
- pytrec_eval — https://github.com/cvangysel/pytrec_eval · TREC metrics for the benchmark runner.
- Introduction to IR — https://nlp.stanford.edu/IR-book/ · Textbook reference for nDCG, MAP, etc.
LLM providers (optional)¶
- Google Gemini — https://ai.google.dev/ · Embeddings/summaries (
GEMINI_KEY). - OpenAI — https://platform.openai.com/docs · Optional summaries/rerankers (
OPENAI_KEY).
Documentation stack¶
- MkDocs — https://www.mkdocs.org · Material for MkDocs — https://squidfunk.github.io/mkdocs-material/
- mkdocstrings (Python) — https://mkdocstrings.github.io/python/ · mkdocs‑jupyter — https://github.com/danielfrg/mkdocs-jupyter
- mkdocs-video — https://github.com/soulless-viewer/mkdocs-video
Study & internal docs¶
- GRAINS preliminary study (Gerber, 2025) — internal PDF in this repo.
Source for acceptance budgets (Recall/nDCG/Precision/Coverage, explainability ≥ 2 citations, p95 latency envelopes), single‑adapter serving path, and evaluation protocol used across these docs.