Skip to content

Bronze

Bronze pipeline

The Bronze layer downloads raw data and writes date‑bucketed artifacts with a manifest per run.

Kaggle datasets (KaggleDataset)

  • Datasets:
  • jannalipenkova/cleantech-media-dataset
  • prakharbhandari20/cleantech-google-patent-dataset
  • Output (example):
    Text Only
    1
    2
    3
    4
    5
    cleantech_data/bronze/kaggle/<slug>/<YYYY-MM-DD>/
      original.zip
      raw.jsonl.gz          # optional mirror (CSV/JSON normalized to 1 object/line)
      raw_manifest.jsonl    # one JSON per run (params, sha256, records, git commit)
      extracted/            # temp for mirroring (deleted unless --kaggle-keep-extracted)
    
  • Mirror rules:
  • CSV rows → dicts; JSON/JSONL/NDJSON supported; _doc_id computed (patents prefer publication_number; else hash of url/title/date).
  • Manifest fields (all sources): run_id, input parameters, sha256 checksum, records count, git_commit.
  • Useful CLI flags (via cleantech-fetch):
  • --kaggle-no-mirror to skip raw.jsonl.gz.
  • --kaggle-keep-extracted to keep extracted/.

OpenAlex Works (OpenAlexDataset)

  • Cursor‑based /works fetch with retries and optional filters.
  • Key params:
  • Date range: from_publication_date:YYYY-MM-DD,to_publication_date:YYYY-MM-DD
  • --openalex-oa-only (is_oa:true/false), --openalex-search "renewable OR hydrogen"
  • --openalex-pages (0 = unlimited), --openalex-per-page (≤ 200), --openalex-mailto (recommended).
  • Output:
    Text Only
    cleantech_data/bronze/openalex/<YYYY-MM-DD>/raw.jsonl.gz
    cleantech_data/bronze/openalex/<YYYY-MM-DD>/raw_manifest.jsonl
    

OpenAlex Topics (OpenAlexTopics)

  • Cursor‑based /topics endpoint:
    Text Only
    1
    2
    3
    4
    cleantech_data/bronze/openalex/topics/<YYYY-MM-DD>/
      topics.jsonl.gz
      raw_manifest.jsonl
      extracted/topics.jsonl   # optional plain mirror when --openalex-topics-keep-extracted
    
  • Useful CLI flags:
  • --openalex-only-topics to skip Works.
  • --openalex-topics-pages / --openalex-topics-per-page / --openalex-topics-search.
  • --openalex-topics-keep-extracted to keep plain JSONL.

Examples

Bash
cleantech-fetch --openalex-pages 3 --openalex-per-page 200 --openalex-mailto you@example.com
cleantech-fetch --kaggle-keep-extracted --kaggle-no-mirror