ctclean — Silver cleaning pipeline¶
This package turns Bronze snapshots into clean, typed, de‑duplicated Silver tables for:
- Media (Kaggle: cleantech-media-dataset)
- Patents (Kaggle: cleantech-google-patent-dataset)
- OpenAlex Topics
It mirrors the latest Bronze date bucket on disk and writes the Silver outputs under:
| Text Only | |
|---|---|
If Parquet engines are missing, files are written as *.csv.gz automatically.
End‑to‑end: pull Bronze data with cleantech_pipeline, then run one of the
ctclean commands below. Each command reads the newest Bronze bucket and writes
its Silver counterpart. When finished, use ctunify run to validate and merge
the latest Silver buckets into cleantech_data/silver/unified/unified_docs.parquet.
What each command produces¶
python -m ctclean media¶
Outputs
media_canonical— one canonical row per articlemedia_dupe_links— mapping of merged duplicates → canonical- (audit)
media_excluded_listings.*,media_excluded_non_articles.*
Key normalisations
- URL canonicalisation (drops tracking params), domain extraction
- Content cleaning with per‑domain stoplines (footer/boilerplate removal), HTML → text, whitespace
- Quality gate on content length (200+ chars)
-
De‑duplication by:
-
exact
url_key - exact
content_sha1(with length/word guards) -
gated title fingerprint within domain (length ratio ≥ 0.9, date span ≤ 7 days)
-
Listings / non‑articles behaviour
- Non‑articles (about/privacy/store/events, etc.) are always dropped before canonicalization.
- Listings are dropped by default, but you can keep them with
--include-listings(this matches the notebook which flags listings but doesn’t drop them pre‑merge).
python -m ctclean patents¶
Outputs
patent_canonical— one canonical row perpublication_numberpatent_dupe_links— member → canonical mappingspatents_normalized— one row per publication with aggregated CPC codes & inventors
Key normalisations
- Title/abstract cleaning, strict date parsing (
YYYYMMDD, 10/13‑digit epoch) - Canonical selection favours longer abstracts and mild “Englishness”
- Aggregation of CPC and inventors across duplicate rows
python -m ctclean openalex¶
Outputs
topics_canonicaltopic_keywords_m2mtopic_siblings_m2mdomains_ref,fields_ref,subfields_ref
Input discovery (automatic newest bucket + file picking)¶
For each dataset the CLI selects the newest date bucket (folder YYYY‑MM‑DD) under Bronze.
Media CSV (under …/bronze/kaggle/cleantech-media-dataset/<date>/extracted)¶
Preference:
cleantech_media_dataset_v3_*.csv- any
*media*.csv - avoids
rag_evaluation,evaluation,sample
Patent JSON (under …/bronze/kaggle/cleantech-google-patent-dataset/<date>/extracted)¶
Preference:
*updated*.json/*.jsonl(also supports.gz)cleantech_22-24*.json*- avoids
bq-results,tmp,sample
OpenAlex Topics (under …/bronze/openalex/topics/<date>/extracted)¶
topics.jsonl(ortopics.jsonl.gzif present)
The chosen file is echoed in the terminal.
How to run (Windows, from your repo root)¶
| PowerShell | |
|---|---|
Outputs will be written under:
| Text Only | |
|---|---|
Useful switches¶
-
Keep listing/archive pages in media canonicalization (notebook‑parity):
PowerShell -
Use a specific Bronze bucket:
-
Write Silver to a custom folder:
PowerShell
Environment overrides¶
KAGGLE_BRONZE_DIR→ pointsmediaandpatentsto a specific Kaggle bucketOPENALEX_TOPICS_DIR→ pointsopenalexto a specific topics bucket
If these are unset, the newest bucket is picked automatically.
CLI --help¶
View available switches and processing notes via the built‑in help. Examples:
Modules overview¶
ctclean.text_clean– HTML strip, domain-tail removal (e.g. Energy‑XPRT ads), deduped paragraphs, listing detectionctclean.fingerprints–content_sha1,title_fingerprint, normalizedurl_keyctclean.media_pipeline–prepare,attach_non_article_reason,quality_gate,canonicalizectclean.patent_pipeline–prepare,canonicalize,normalizectclean.openalex_pipeline– canonical topics + M2M tables and refsctclean.io–safe_writewith Parquet-first and CSV.gz fallbackctclean.paths– data root helpers and latest bucket discovery
Notes & troubleshooting¶
- Parquet vs CSV.GZ: if
pyarrow/fastparquetisn’t installed, the writer falls back to*.csv.gz. - Echoed selection: media/patents log which file they chose; adjust the heuristics in
file_select.pyif naming changes. - Paths module: newest‑bucket logic lives in
ctclean/paths.py(latest_bucket).