Silver¶

Silver pipeline

The Silver layer (package ctclean) transforms Bronze data into canonical, analysis‑ready artifacts.

Media pipeline (`ctclean media`)¶

Steps:

prepare: URL normalization (url_clean,domain,url_key), title/content cleaning, per‑domain stoplines, language detection, counts/hashes (content_sha1, title_fp), robust date parsing.
attach_non_article_reason: domain/path/title rules and listing detection.
Quality gate: drop rows with content_clean < 200 chars.
canonicalize:
- A) same url_key
- B) same content_sha1 (≥ 60 words)
- C) (domain, title_fp) gated by length‑ratio ≥ 0.90 and date span ≤ 7 days; prefer non‑listing canonical.
- Ranking: earliest date → longest content_clean → longest title_clean.

Outputs:

media_canonical.*, media_dupe_links.*, media_excluded_listings.*, media_excluded_non_articles.*

Steps:

prepare: minimal clean, abstract_len ≥ 40, robust publication_date_dt, language detection on title/abstract.
canonicalize: pick per publication_number by (abstract_len, englishness, earliest date, title_len).
normalize: aggregate to one row per publication with lists for cpc_codes, inventors, etc.

Outputs:

Processing:

Canonical topics table (topics_canonical.*) + M2M: topic_keywords_m2m.*, topic_siblings_m2m.*; reference tables: domains_ref.*, fields_ref.*, subfields_ref.*.

All writers go through ctclean.io.safe_write: Parquet first; on schema/type issues it sanitizes object columns and retries; final fallback to CSV.gz.

Run:

Bash
1	`ctunify run`

This consolidates the latest Silver buckets into cleantech_data/silver/unified/unified_docs.parquet with columns: