Bronze¶
The Bronze layer downloads raw data and writes date‑bucketed artifacts with a manifest per run.
Kaggle datasets (KaggleDataset)¶
- Datasets:
jannalipenkova/cleantech-media-datasetprakharbhandari20/cleantech-google-patent-dataset- Output (example):
- Mirror rules:
- CSV rows → dicts; JSON/JSONL/NDJSON supported;
_doc_idcomputed (patents preferpublication_number; else hash of url/title/date). - Manifest fields (all sources):
run_id, input parameters,sha256checksum,recordscount,git_commit. - Useful CLI flags (via
cleantech-fetch): --kaggle-no-mirrorto skipraw.jsonl.gz.--kaggle-keep-extractedto keepextracted/.
OpenAlex Works (OpenAlexDataset)¶
- Cursor‑based
/worksfetch with retries and optional filters. - Key params:
- Date range:
from_publication_date:YYYY-MM-DD,to_publication_date:YYYY-MM-DD --openalex-oa-only(is_oa:true/false),--openalex-search "renewable OR hydrogen"--openalex-pages(0 = unlimited),--openalex-per-page(≤ 200),--openalex-mailto(recommended).- Output:
OpenAlex Topics (OpenAlexTopics)¶
- Cursor‑based
/topicsendpoint: - Useful CLI flags:
--openalex-only-topicsto skip Works.--openalex-topics-pages/--openalex-topics-per-page/--openalex-topics-search.--openalex-topics-keep-extractedto keep plain JSONL.