Skip to content

ctclean — Silver cleaning pipeline

This package turns Bronze snapshots into clean, typed, de‑duplicated Silver tables for:

  • Media (Kaggle: cleantech-media-dataset)
  • Patents (Kaggle: cleantech-google-patent-dataset)
  • OpenAlex Topics

It mirrors the latest Bronze date bucket on disk and writes the Silver outputs under:

Text Only
1
2
3
4
cleantech_data/silver/
  ├─ media/<YYYY-MM-DD>/
  ├─ patents/<YYYY-MM-DD>/
  └─ openalex/<YYYY-MM-DD>/

If Parquet engines are missing, files are written as *.csv.gz automatically.

End‑to‑end: pull Bronze data with cleantech_pipeline, then run one of the ctclean commands below. Each command reads the newest Bronze bucket and writes its Silver counterpart. When finished, use ctunify run to validate and merge the latest Silver buckets into cleantech_data/silver/unified/unified_docs.parquet.


What each command produces

python -m ctclean media

Outputs

  • media_canonical — one canonical row per article
  • media_dupe_links — mapping of merged duplicates → canonical
  • (audit) media_excluded_listings.*, media_excluded_non_articles.*

Key normalisations

  • URL canonicalisation (drops tracking params), domain extraction
  • Content cleaning with per‑domain stoplines (footer/boilerplate removal), HTML → text, whitespace
  • Quality gate on content length (200+ chars)
  • De‑duplication by:

  • exact url_key

  • exact content_sha1 (with length/word guards)
  • gated title fingerprint within domain (length ratio ≥ 0.9, date span ≤ 7 days)

  • Listings / non‑articles behaviour

  • Non‑articles (about/privacy/store/events, etc.) are always dropped before canonicalization.
  • Listings are dropped by default, but you can keep them with --include-listings (this matches the notebook which flags listings but doesn’t drop them pre‑merge).

python -m ctclean patents

Outputs

  • patent_canonical — one canonical row per publication_number
  • patent_dupe_links — member → canonical mappings
  • patents_normalized — one row per publication with aggregated CPC codes & inventors

Key normalisations

  • Title/abstract cleaning, strict date parsing (YYYYMMDD, 10/13‑digit epoch)
  • Canonical selection favours longer abstracts and mild “Englishness”
  • Aggregation of CPC and inventors across duplicate rows

python -m ctclean openalex

Outputs

  • topics_canonical
  • topic_keywords_m2m
  • topic_siblings_m2m
  • domains_ref, fields_ref, subfields_ref

Input discovery (automatic newest bucket + file picking)

For each dataset the CLI selects the newest date bucket (folder YYYY‑MM‑DD) under Bronze.

Media CSV (under …/bronze/kaggle/cleantech-media-dataset/<date>/extracted)

Preference:

  1. cleantech_media_dataset_v3_*.csv
  2. any *media*.csv
  3. avoids rag_evaluation, evaluation, sample

Patent JSON (under …/bronze/kaggle/cleantech-google-patent-dataset/<date>/extracted)

Preference:

  1. *updated*.json / *.jsonl (also supports .gz)
  2. cleantech_22-24*.json*
  3. avoids bq-results, tmp, sample

OpenAlex Topics (under …/bronze/openalex/topics/<date>/extracted)

  • topics.jsonl (or topics.jsonl.gz if present)

The chosen file is echoed in the terminal.


How to run (Windows, from your repo root)

PowerShell
1
2
3
4
5
6
7
8
9
$env:PYTHONPATH="C:\Users\gerbe\PycharmProjects\MT\src\pipeline"

# Run individually
python -m ctclean media
python -m ctclean patents
python -m ctclean openalex

# Or run all three
python -m ctclean all

Outputs will be written under:

Text Only
C:\Users\gerbe\PycharmProjects\MT\cleantech_data\silver\<dataset>\<YYYY-MM-DD>\

Useful switches

  • Keep listing/archive pages in media canonicalization (notebook‑parity):

    PowerShell
    python -m ctclean media --include-listings
    
  • Use a specific Bronze bucket:

    PowerShell
    1
    2
    3
    python -m ctclean media   --bronze_dir "C:\...\cleantech_data\bronze\kaggle\cleantech-media-dataset\2025-08-09"
    python -m ctclean patents --bronze_dir "C:\...\cleantech_data\bronze\kaggle\cleantech-google-patent-dataset\2025-08-09"
    python -m ctclean openalex --bronze_dir "C:\...\cleantech_data\bronze\openalex\topics\2025-08-09"
    
  • Write Silver to a custom folder:

    PowerShell
    python -m ctclean patents --silver_dir "C:\...\cleantech_data\silver\patents\2025-08-09"
    

Environment overrides

  • KAGGLE_BRONZE_DIR → points media and patents to a specific Kaggle bucket
  • OPENALEX_TOPICS_DIR → points openalex to a specific topics bucket

If these are unset, the newest bucket is picked automatically.

CLI --help

View available switches and processing notes via the built‑in help. Examples:

PowerShell
python -m ctclean media --help

Usage: python -m ctclean media [OPTIONS]

Notebook sequence: 1) prepare() 2) non-article suppression 3) quality gate (min 200 chars) 4) canonicalize

Options:
  --n-rows INTEGER                                      Read only top N rows (for debugging)
  --bronze-dir PATH
  --silver-dir PATH
  --include-listings / --no-include-listings             Keep listing/archive pages in canonicalization (the notebook keeps them).
  --help                                                Show this message and exit.

python -m ctclean patents --help

Usage: python -m ctclean patents [OPTIONS]

Options:
  --n-rows INTEGER        Read only top N rows (for debugging)
  --bronze-dir PATH
  --silver-dir PATH
  --help                  Show this message and exit.

python -m ctclean openalex --help

Usage: python -m ctclean openalex [OPTIONS]

Options:
  --n-rows INTEGER        Read only top N rows (for debugging)
  --bronze-dir PATH
  --silver-dir PATH
  --help                  Show this message and exit.

Modules overview

  • ctclean.text_clean – HTML strip, domain-tail removal (e.g. Energy‑XPRT ads), deduped paragraphs, listing detection
  • ctclean.fingerprintscontent_sha1, title_fingerprint, normalized url_key
  • ctclean.media_pipelineprepare, attach_non_article_reason, quality_gate, canonicalize
  • ctclean.patent_pipelineprepare, canonicalize, normalize
  • ctclean.openalex_pipeline – canonical topics + M2M tables and refs
  • ctclean.iosafe_write with Parquet-first and CSV.gz fallback
  • ctclean.paths – data root helpers and latest bucket discovery

Notes & troubleshooting

  • Parquet vs CSV.GZ: if pyarrow/fastparquet isn’t installed, the writer falls back to *.csv.gz.
  • Echoed selection: media/patents log which file they chose; adjust the heuristics in file_select.py if naming changes.
  • Paths module: newest‑bucket logic lives in ctclean/paths.py (latest_bucket).