Skip to content

Architecture

Overview

The Cleantech data platform spans from the Bronze ingestion layer through Silver normalization and Unified publishing into the Gold subsample that seeds the Bronze ➜ RAG build. Kaggle media and patent datasets together with OpenAlex Works and Topics land in Bronze, progress through ctclean for canonical tables, are consolidated by ctunify, and finally become enriched Gold chunks that power the graph, dense, and sparse retrieval tracks. Communities merge those indices to support the GraphRAG and hybrid adapters that drive internal benchmarks and the public demo.

Pipeline architecture

Flow

  • Kaggle media/patent datasets and OpenAlex Works/Topics feed the Bronze layer.
  • Bronze (cleantech-fetch) writes original archives, JSONL mirrors, and raw_manifest.jsonl with run_id, parameters, sha256, records, and git_commit in date-bucketed directories.
  • Silver (ctclean) runs media, patents, and openalex pipelines to produce canonical and normalized tables.
  • Unified (ctunify) reads the latest Silver buckets and outputs silver/unified/unified_docs.parquet (doc_id, doc_type, title, text, date, lang, url, ...).
  • Gold (gold_subsample_chunk) samples documents, chunks them, and enriches metadata into gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet ready for RAG builds.
  • RAG build fans out into graph/Neo4j, dense/Chroma, and sparse/BM25 indices, then aggregates communities and adapters for benchmarks and the demo.

Packages & entrypoints

  • Bronze – cleantech_pipeline
  • cleantech-fetchcleantech_pipeline.__main__:main
  • Sources: KaggleDataset, OpenAlexDataset, OpenAlexTopics.
  • Silver – ctclean
  • ctcleanctclean.cli:main with media, patents, openalex, all.
  • Validation/Unify – validation_unification
  • ctunifyvalidation_unification.cli:main (unify()).
  • Gold – gold_subsample_chunk
  • python -m gold.subsample → select representative docs.
  • python -m gold.chunk → chunk canonical text.
  • python -m gold.enrich → attach entities, taxonomies, and layout metadata.
  • RAG build – graphbuild, dense, bm25, communities, rag
  • python -m graphbuild csv / ingest → CSV exports and Neo4j loads.
  • python -m dense build --use-summary → Gemini embeddings in Chroma.
  • python -m bm25 build --use-text → lexical indices.
  • python -m communities communities / summaries → community hierarchy with summaries and embeddings.
  • python -m rag.bench.cli / streamlit run src/rag/demo/app.py → benchmarks and legacy Streamlit demo adapters.

Canonicalization (high level)

  • Media: union by url_key; exact content_sha1 (≥ 60 words); (domain,title_fp) gated by length‑ratio ≥ 0.90 and date span ≤ 7 days; prefer non‑listing URL.
  • Patents: pick per publication_number by (abstract_len desc, englishness desc, publication_date asc, title_len desc), plus links for non‑canon members.

Manifests

Every Bronze fetch appends a JSON line with run_id, parameters, file SHA256, record counts, and git_commit, enabling end-to-end provenance.

Gold & RAG pipeline details

Gold inputs

  • Source: gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet
  • Purpose: canonical chunk text plus metadata shared across all build paths.

Graph / Neo4j path

  • Build:
    Bash
    python -m graphbuild csv --dataset <dataset> --date <date>
    python -m graphbuild ingest --dataset <dataset> --date <date> --ingest-tag <tag>
    
  • Outputs:
  • gr_nodes.csv, gr_edges.csv
  • Neo4j nodes (Chunk, Community, Entity) tagged by <tag>
  • Feeds: community detection, GraphRAG adapter, hybrid retrieval seed sets.

Dense / Chroma path

  • Build:
    Bash
    python -m dense build --dataset <dataset> --date <date> --use-summary
    
  • Outputs:
  • Chroma persistence in vectordb_dense/gemini_text-embedding-004/<dataset>/<date>/
  • Summary embeddings for community rerankers
  • Feeds: community enrichment, dense and hybrid adapters.

Sparse / BM25 path

  • Build:
    Bash
    python -m bm25 build --dataset <dataset> --date <date> --use-text
    
  • Outputs:
  • BM25 models in vectordb_bm25/<dataset>/<date>/
  • Lexical manifests (index.json)
  • Feeds: hybrid adapter scoring, benchmark baselines.

Communities & adapters

  • Communities:
    Bash
    python -m communities communities --dataset <dataset> --ingest-tag <tag> --levels "C1:1.2" --min-size 8
    python -m communities summaries --dataset <dataset> --level C1 --ingest-tag <tag> --refresh
    
  • Produces levels C0C3 with summaries and embeddings persisted to Neo4j.
  • Hybrid adapter: combines dense, BM25, and community signals.
  • GraphRAG adapter: traverses Neo4j communities with stored embeddings.

Benchmarks & apps

  • Benchmarks:
    Bash
    python -m rag.bench.cli run --adapter hybrid --dataset <dataset> --date <date> --ingest-tag <tag> --level C1 --top-k 100
    
  • Production app (React frontend + API):
  • Deployed UI is the React frontend backed by the API service.
  • Legacy test app (Streamlit):
    Bash
    streamlit run src/rag/demo/app.py
    
  • Behavior: see app.md for defaults and AUTO_PIPELINE_FROM_CHAT behavior (=1 locks defaults; deployment uses =1).
  • LLM pipeline: see llm_pipeline.md, including Diagram and Stage memory and follow-up chat.
  • Mode: RAG_MODE=local runs the local CLI (rag.retrieval.triple_retriever), RAG_MODE=api calls the API.
  • Purpose: validate retrieval quality and expose the pipeline through the interactive interface.

See deployment.md for the full production runbook.