Silver¶
The Silver layer (package ctclean) transforms Bronze data into canonical, analysis‑ready artifacts.
Media pipeline (ctclean media)¶
Steps:
prepare: URL normalization (url_clean,domain,url_key), title/content cleaning, per‑domain stoplines, language detection, counts/hashes (content_sha1,title_fp), robust date parsing.attach_non_article_reason: domain/path/title rules and listing detection.- Quality gate: drop rows with
content_clean< 200 chars. canonicalize:- A) same
url_key - B) same
content_sha1(≥ 60 words) - C)
(domain, title_fp)gated by length‑ratio ≥ 0.90 and date span ≤ 7 days; prefer non‑listing canonical. - Ranking: earliest
date→ longestcontent_clean→ longesttitle_clean.
- A) same
Outputs:
media_canonical.*,media_dupe_links.*,media_excluded_listings.*,media_excluded_non_articles.*
Patent pipeline (ctclean patents)¶
Steps:
prepare: minimal clean, abstract_len ≥ 40, robustpublication_date_dt, language detection on title/abstract.canonicalize: pick perpublication_numberby(abstract_len, englishness, earliest date, title_len).normalize: aggregate to one row per publication with lists forcpc_codes,inventors, etc.
Outputs:
patent_canonical.*,patent_dupe_links.*,patents_normalized.*
OpenAlex topics (ctclean openalex)¶
Processing:
- Canonical topics table (
topics_canonical.*) + M2M:topic_keywords_m2m.*,topic_siblings_m2m.*; reference tables:domains_ref.*,fields_ref.*,subfields_ref.*.
Writing & types¶
- All writers go through
ctclean.io.safe_write: Parquet first; on schema/type issues it sanitizes object columns and retries; final fallback to CSV.gz.
Validation & Unify¶
Run:
| Bash | |
|---|---|
cleantech_data/silver/unified/unified_docs.parquet with columns:
doc_iddoc_typetitletextdatelangsourceurlcountrycpc_codes
See also¶
- Notebook: 02_Cleantech_Silver_Validation_and_Unify