Architecture¶

Overview¶

The Cleantech data platform spans from the Bronze ingestion layer through Silver normalization and Unified publishing into the Gold subsample that seeds the Bronze ➜ RAG build. Kaggle media and patent datasets together with OpenAlex Works and Topics land in Bronze, progress through ctclean for canonical tables, are consolidated by ctunify, and finally become enriched Gold chunks that power the graph, dense, and sparse retrieval tracks. Communities merge those indices to support the GraphRAG and hybrid adapters that drive internal benchmarks and the public demo.

Pipeline architecture

Flow¶

Kaggle media/patent datasets and OpenAlex Works/Topics feed the Bronze layer.
Bronze (cleantech-fetch) writes original archives, JSONL mirrors, and raw_manifest.jsonl with run_id, parameters, sha256, records, and git_commit in date-bucketed directories.
Silver (ctclean) runs media, patents, and openalex pipelines to produce canonical and normalized tables.
Unified (ctunify) reads the latest Silver buckets and outputs silver/unified/unified_docs.parquet (doc_id, doc_type, title, text, date, lang, url, ...).
Gold (gold_subsample_chunk) samples documents, chunks them, and enriches metadata into gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet ready for RAG builds.
RAG build fans out into graph/Neo4j, dense/Chroma, and sparse/BM25 indices, then aggregates communities and adapters for benchmarks and the demo.

Packages & entrypoints¶

Bronze – cleantech_pipeline
cleantech-fetch → cleantech_pipeline.__main__:main
Sources: KaggleDataset, OpenAlexDataset, OpenAlexTopics.
Silver – ctclean
ctclean → ctclean.cli:main with media, patents, openalex, all.
Validation/Unify – validation_unification
ctunify → validation_unification.cli:main (unify()).
Gold – gold_subsample_chunk
python -m gold.subsample → select representative docs.
python -m gold.chunk → chunk canonical text.
python -m gold.enrich → attach entities, taxonomies, and layout metadata.
RAG build – graphbuild, dense, bm25, communities, rag
python -m graphbuild csv / ingest → CSV exports and Neo4j loads.
python -m dense build --use-summary → Gemini embeddings in Chroma.
python -m bm25 build --use-text → lexical indices.
python -m communities communities / summaries → community hierarchy with summaries and embeddings.
python -m rag.bench.cli / streamlit run src/rag/demo/app.py → benchmarks and legacy Streamlit demo adapters.

Canonicalization (high level)¶

Media: union by url_key; exact content_sha1 (≥ 60 words); (domain,title_fp) gated by length‑ratio ≥ 0.90 and date span ≤ 7 days; prefer non‑listing URL.
Patents: pick per publication_number by (abstract_len desc, englishness desc, publication_date asc, title_len desc), plus links for non‑canon members.

Manifests¶

Every Bronze fetch appends a JSON line with run_id, parameters, file SHA256, record counts, and git_commit, enabling end-to-end provenance.

Gold & RAG pipeline details¶

Gold inputs¶

Source: gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet
Purpose: canonical chunk text plus metadata shared across all build paths.

Graph / Neo4j path¶

Build:

Bash
python -m graphbuild csv --dataset <dataset> --date <date>
python -m graphbuild ingest --dataset <dataset> --date <date> --ingest-tag <tag>

Outputs:
gr_nodes.csv, gr_edges.csv
Neo4j nodes (Chunk, Community, Entity) tagged by <tag>
Feeds: community detection, GraphRAG adapter, hybrid retrieval seed sets.

Dense / Chroma path¶

Build:

Bash
python -m dense build --dataset <dataset> --date <date> --use-summary

Outputs:
Chroma persistence in vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/
Summary embeddings for community rerankers
Feeds: community enrichment, dense and hybrid adapters.

Sparse / BM25 path¶

Build:

Bash
python -m bm25 build --dataset <dataset> --date <date> --use-text

Outputs:
BM25 models in vectordb_bm25/<dataset>/<date>/
Lexical manifests (index.json)
Feeds: hybrid adapter scoring, benchmark baselines.

Communities & adapters¶

Communities:

Bash
python -m communities communities --dataset <dataset> --ingest-tag <tag> --levels "C1:1.2" --min-size 8
python -m communities summaries --dataset <dataset> --level C1 --ingest-tag <tag> --refresh

Produces levels C0–C3 with summaries and embeddings persisted to Neo4j.
Hybrid adapter: combines dense, BM25, and community signals.
GraphRAG adapter: traverses Neo4j communities with stored embeddings.

Benchmarks & apps¶

Benchmarks:

Bash
python -m rag.bench.cli run --adapter hybrid --dataset <dataset> --date <date> --ingest-tag <tag> --level C1 --top-k 100

Production app (React frontend + API):
Deployed UI is the React frontend backed by the API service.
Legacy test app (Streamlit):
Bash
1
streamlit run src/rag/demo/app.py
Behavior: see app.md for defaults and AUTO_PIPELINE_FROM_CHAT behavior (=1 locks defaults; deployment uses =1).
LLM pipeline: see llm_pipeline.md, including Diagram and Stage memory and follow-up chat.
Mode: RAG_MODE=local runs the local CLI (rag.retrieval.triple_retriever), RAG_MODE=api calls the API.
Purpose: validate retrieval quality and expose the pipeline through the interactive interface.

See deployment.md for the full production runbook.