Architecture¶
Overview¶
The Cleantech data platform spans from the Bronze ingestion layer through Silver normalization and Unified publishing into the Gold subsample that seeds the Bronze ➜ RAG build. Kaggle media and patent datasets together with OpenAlex Works and Topics land in Bronze, progress through ctclean for canonical tables, are consolidated by ctunify, and finally become enriched Gold chunks that power the graph, dense, and sparse retrieval tracks. Communities merge those indices to support the GraphRAG and hybrid adapters that drive internal benchmarks and the public demo.
Flow¶
- Kaggle media/patent datasets and OpenAlex Works/Topics feed the Bronze layer.
- Bronze (
cleantech-fetch) writes original archives, JSONL mirrors, andraw_manifest.jsonlwithrun_id, parameters,sha256,records, andgit_commitin date-bucketed directories. - Silver (
ctclean) runsmedia,patents, andopenalexpipelines to produce canonical and normalized tables. - Unified (
ctunify) reads the latest Silver buckets and outputssilver/unified/unified_docs.parquet(doc_id,doc_type,title,text,date,lang,url, ...). - Gold (
gold_subsample_chunk) samples documents, chunks them, and enriches metadata intogold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquetready for RAG builds. - RAG build fans out into graph/Neo4j, dense/Chroma, and sparse/BM25 indices, then aggregates communities and adapters for benchmarks and the demo.
Packages & entrypoints¶
- Bronze –
cleantech_pipeline cleantech-fetch→cleantech_pipeline.__main__:main- Sources:
KaggleDataset,OpenAlexDataset,OpenAlexTopics. - Silver –
ctclean ctclean→ctclean.cli:mainwithmedia,patents,openalex,all.- Validation/Unify –
validation_unification ctunify→validation_unification.cli:main(unify()).- Gold –
gold_subsample_chunk python -m gold.subsample→ select representative docs.python -m gold.chunk→ chunk canonical text.python -m gold.enrich→ attach entities, taxonomies, and layout metadata.- RAG build –
graphbuild,dense,bm25,communities,rag python -m graphbuild csv/ingest→ CSV exports and Neo4j loads.python -m dense build --use-summary→ Gemini embeddings in Chroma.python -m bm25 build --use-text→ lexical indices.python -m communities communities/summaries→ community hierarchy with summaries and embeddings.python -m rag.bench.cli/streamlit run src/rag/demo/app.py→ benchmarks and legacy Streamlit demo adapters.
Canonicalization (high level)¶
- Media: union by
url_key; exactcontent_sha1(≥ 60 words);(domain,title_fp)gated by length‑ratio ≥ 0.90 and date span ≤ 7 days; prefer non‑listing URL. - Patents: pick per
publication_numberby(abstract_len desc, englishness desc, publication_date asc, title_len desc), plus links for non‑canon members.
Manifests¶
Every Bronze fetch appends a JSON line with run_id, parameters, file SHA256, record counts, and git_commit, enabling end-to-end provenance.
Gold & RAG pipeline details¶
Gold inputs¶
- Source:
gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet - Purpose: canonical chunk text plus metadata shared across all build paths.
Graph / Neo4j path¶
- Build:
- Outputs:
gr_nodes.csv,gr_edges.csv- Neo4j nodes (
Chunk,Community,Entity) tagged by<tag> - Feeds: community detection, GraphRAG adapter, hybrid retrieval seed sets.
Dense / Chroma path¶
- Build:
Bash - Outputs:
- Chroma persistence in
vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/ - Summary embeddings for community rerankers
- Feeds: community enrichment, dense and hybrid adapters.
Sparse / BM25 path¶
- Build:
Bash - Outputs:
- BM25 models in
vectordb_bm25/<dataset>/<date>/ - Lexical manifests (
index.json) - Feeds: hybrid adapter scoring, benchmark baselines.
Communities & adapters¶
- Communities:
- Produces levels
C0–C3with summaries and embeddings persisted to Neo4j. - Hybrid adapter: combines dense, BM25, and community signals.
- GraphRAG adapter: traverses Neo4j communities with stored embeddings.
Benchmarks & apps¶
- Benchmarks:
Bash - Production app (React frontend + API):
- Deployed UI is the React frontend backed by the API service.
- Legacy test app (Streamlit):
Bash - Behavior: see app.md for defaults and
AUTO_PIPELINE_FROM_CHATbehavior (=1locks defaults; deployment uses=1). - LLM pipeline: see llm_pipeline.md, including Diagram and Stage memory and follow-up chat.
- Mode:
RAG_MODE=localruns the local CLI (rag.retrieval.triple_retriever),RAG_MODE=apicalls the API. - Purpose: validate retrieval quality and expose the pipeline through the interactive interface.
See deployment.md for the full production runbook.