ctclean — Silver cleaning pipeline¶

This package turns Bronze snapshots into clean, typed, de‑duplicated Silver tables for:

Media (Kaggle: cleantech-media-dataset)
Patents (Kaggle: cleantech-google-patent-dataset)
OpenAlex Topics

It mirrors the latest Bronze date bucket on disk and writes the Silver outputs under:

Text Only
1 2 3 4	`cleantech_data/silver/ ├─ media/<YYYY-MM-DD>/ ├─ patents/<YYYY-MM-DD>/ └─ openalex/<YYYY-MM-DD>/`

If Parquet engines are missing, files are written as *.csv.gz automatically.

End‑to‑end: pull Bronze data with cleantech_pipeline, then run one of the ctclean commands below. Each command reads the newest Bronze bucket and writes its Silver counterpart. When finished, use ctunify run to validate and merge the latest Silver buckets into cleantech_data/silver/unified/unified_docs.parquet.

What each command produces¶

`python -m ctclean media`¶

Outputs

media_canonical — one canonical row per article
media_dupe_links — mapping of merged duplicates → canonical
(audit) media_excluded_listings.*, media_excluded_non_articles.*

Key normalisations

URL canonicalisation (drops tracking params), domain extraction
Content cleaning with per‑domain stoplines (footer/boilerplate removal), HTML → text, whitespace
Quality gate on content length (200+ chars)
De‑duplication by:
exact url_key
exact content_sha1 (with length/word guards)
gated title fingerprint within domain (length ratio ≥ 0.9, date span ≤ 7 days)
Listings / non‑articles behaviour
Non‑articles (about/privacy/store/events, etc.) are always dropped before canonicalization.
Listings are dropped by default, but you can keep them with --include-listings (this matches the notebook which flags listings but doesn’t drop them pre‑merge).

`python -m ctclean patents`¶

Outputs

patent_canonical — one canonical row per publication_number
patent_dupe_links — member → canonical mappings
patents_normalized — one row per publication with aggregated CPC codes & inventors

Key normalisations

Title/abstract cleaning, strict date parsing (YYYYMMDD, 10/13‑digit epoch)
Canonical selection favours longer abstracts and mild “Englishness”
Aggregation of CPC and inventors across duplicate rows

`python -m ctclean openalex`¶

Outputs

topics_canonical
topic_keywords_m2m
topic_siblings_m2m
domains_ref, fields_ref, subfields_ref

Input discovery (automatic newest bucket + file picking)¶

For each dataset the CLI selects the newest date bucket (folder YYYY‑MM‑DD) under Bronze.

Media CSV (under `…/bronze/kaggle/cleantech-media-dataset/<date>/extracted`)¶

Preference:

cleantech_media_dataset_v3_*.csv
any *media*.csv
avoids rag_evaluation, evaluation, sample

Patent JSON (under `…/bronze/kaggle/cleantech-google-patent-dataset/<date>/extracted`)¶

Preference:

*updated*.json / *.jsonl (also supports .gz)
cleantech_22-24*.json*
avoids bq-results, tmp, sample

OpenAlex Topics (under `…/bronze/openalex/topics/<date>/extracted`)¶

topics.jsonl (or topics.jsonl.gz if present)

The chosen file is echoed in the terminal.

How to run (Windows, from your repo root)¶

PowerShell
$env:PYTHONPATH="C:\Users\gerbe\PycharmProjects\MT\src\pipeline"

# Run individually
python -m ctclean media
python -m ctclean patents
python -m ctclean openalex

# Or run all three
python -m ctclean all

Outputs will be written under:

Text Only
1	`C:\Users\gerbe\PycharmProjects\MT\cleantech_data\silver\<dataset>\<YYYY-MM-DD>\`

Useful switches¶

Keep listing/archive pages in media canonicalization (notebook‑parity):
PowerShell
1
python -m ctclean media --include-listings

Use a specific Bronze bucket:

PowerShell
python -m ctclean media   --bronze_dir "C:\...\cleantech_data\bronze\kaggle\cleantech-media-dataset\2025-08-09"
python -m ctclean patents --bronze_dir "C:\...\cleantech_data\bronze\kaggle\cleantech-google-patent-dataset\2025-08-09"
python -m ctclean openalex --bronze_dir "C:\...\cleantech_data\bronze\openalex\topics\2025-08-09"

Write Silver to a custom folder:

PowerShell
python -m ctclean patents --silver_dir "C:\...\cleantech_data\silver\patents\2025-08-09"

Environment overrides¶

KAGGLE_BRONZE_DIR → points media and patents to a specific Kaggle bucket
OPENALEX_TOPICS_DIR → points openalex to a specific topics bucket

If these are unset, the newest bucket is picked automatically.

CLI `--help`¶

View available switches and processing notes via the built‑in help. Examples:

PowerShell
python -m ctclean media --help

Usage: python -m ctclean media [OPTIONS]

Notebook sequence: 1) prepare() 2) non-article suppression 3) quality gate (min 200 chars) 4) canonicalize

Options:
  --n-rows INTEGER                                      Read only top N rows (for debugging)
  --bronze-dir PATH
  --silver-dir PATH
  --include-listings / --no-include-listings             Keep listing/archive pages in canonicalization (the notebook keeps them).
  --help                                                Show this message and exit.

python -m ctclean patents --help

Usage: python -m ctclean patents [OPTIONS]

Options:
  --n-rows INTEGER        Read only top N rows (for debugging)
  --bronze-dir PATH
  --silver-dir PATH
  --help                  Show this message and exit.

python -m ctclean openalex --help

Usage: python -m ctclean openalex [OPTIONS]

Options:
  --n-rows INTEGER        Read only top N rows (for debugging)
  --bronze-dir PATH
  --silver-dir PATH
  --help                  Show this message and exit.

Modules overview¶

ctclean.text_clean – HTML strip, domain-tail removal (e.g. Energy‑XPRT ads), deduped paragraphs, listing detection
ctclean.fingerprints – content_sha1, title_fingerprint, normalized url_key
ctclean.media_pipeline – prepare, attach_non_article_reason, quality_gate, canonicalize
ctclean.patent_pipeline – prepare, canonicalize, normalize
ctclean.openalex_pipeline – canonical topics + M2M tables and refs
ctclean.io – safe_write with Parquet-first and CSV.gz fallback
ctclean.paths – data root helpers and latest bucket discovery

Notes & troubleshooting¶

Parquet vs CSV.GZ: if pyarrow/fastparquet isn’t installed, the writer falls back to *.csv.gz.
Echoed selection: media/patents log which file they chose; adjust the heuristics in file_select.py if naming changes.
Paths module: newest‑bucket logic lives in ctclean/paths.py (latest_bucket).

ctclean — Silver cleaning pipeline¶

What each command produces¶

python -m ctclean media¶

python -m ctclean patents¶

python -m ctclean openalex¶

Input discovery (automatic newest bucket + file picking)¶

Media CSV (under …/bronze/kaggle/cleantech-media-dataset/<date>/extracted)¶

Patent JSON (under …/bronze/kaggle/cleantech-google-patent-dataset/<date>/extracted)¶

OpenAlex Topics (under …/bronze/openalex/topics/<date>/extracted)¶

How to run (Windows, from your repo root)¶

Useful switches¶

Environment overrides¶

CLI --help¶

Modules overview¶

Notes & troubleshooting¶

`python -m ctclean media`¶

`python -m ctclean patents`¶

`python -m ctclean openalex`¶

Media CSV (under `…/bronze/kaggle/cleantech-media-dataset/<date>/extracted`)¶

Patent JSON (under `…/bronze/kaggle/cleantech-google-patent-dataset/<date>/extracted`)¶

OpenAlex Topics (under `…/bronze/openalex/topics/<date>/extracted`)¶

CLI `--help`¶