Skip to content

Silver

Silver pipeline

The Silver layer (package ctclean) transforms Bronze data into canonical, analysis‑ready artifacts.

Media pipeline (ctclean media)

Steps:

  1. prepare: URL normalization (url_clean,domain,url_key), title/content cleaning, per‑domain stoplines, language detection, counts/hashes (content_sha1, title_fp), robust date parsing.
  2. attach_non_article_reason: domain/path/title rules and listing detection.
  3. Quality gate: drop rows with content_clean < 200 chars.
  4. canonicalize:
    • A) same url_key
    • B) same content_sha1 (≥ 60 words)
    • C) (domain, title_fp) gated by length‑ratio ≥ 0.90 and date span ≤ 7 days; prefer non‑listing canonical.
    • Ranking: earliest date → longest content_clean → longest title_clean.

Outputs:

  • media_canonical.*, media_dupe_links.*, media_excluded_listings.*, media_excluded_non_articles.*

Patent pipeline (ctclean patents)

Steps:

  1. prepare: minimal clean, abstract_len ≥ 40, robust publication_date_dt, language detection on title/abstract.
  2. canonicalize: pick per publication_number by (abstract_len, englishness, earliest date, title_len).
  3. normalize: aggregate to one row per publication with lists for cpc_codes, inventors, etc.

Outputs:

  • patent_canonical.*, patent_dupe_links.*, patents_normalized.*

OpenAlex topics (ctclean openalex)

Processing:

  • Canonical topics table (topics_canonical.*) + M2M: topic_keywords_m2m.*, topic_siblings_m2m.*; reference tables: domains_ref.*, fields_ref.*, subfields_ref.*.

Writing & types

  • All writers go through ctclean.io.safe_write: Parquet first; on schema/type issues it sanitizes object columns and retries; final fallback to CSV.gz.

Validation & Unify

Run:

Bash
ctunify run
This consolidates the latest Silver buckets into cleantech_data/silver/unified/unified_docs.parquet with columns:
  • doc_id
  • doc_type
  • title
  • text
  • date
  • lang
  • source
  • url
  • country
  • cpc_codes

See also