Skip to content

Gold

The Gold layer turns the unified Silver corpus into enriched, retrieval-ready chunks. It subsamples ctunify output, slices each document into fixed windows, and annotates every chunk with LLM-derived metadata.

Pipline Gold

Inputs & Outputs

  • Inputs
  • cleantech_data/silver/unified/unified_docs.parquet — produced by ctunify.
  • Optional environment overrides via CT_DATA_ROOT / DATA_ROOT.
  • Intermediate outputs
  • Subsample: cleantech_data/silver_subsample/unified_docs_subsample.parquet + manifest.json (by, seed, allocation, metrics.tv_doc_type, metrics.tv_lang, ...).
  • Chunking: cleantech_data/silver_subsample_chunk/fixed_size/<YYYY-MM-DD>/chunks.parquet and chunks.jsonl.
  • Gold artifacts (per dataset = fixed_size | semantic)
  • cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/chunks_enriched.parquet
  • cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/grounded_extractions.jsonl
  • cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/visualization_interactive.html
  • Progress state: completed_ids.json, failed_chunks.json, failed_details.json, manifest.json.

Schema (chunks_enriched.parquet)

column dtype notes
id string chunk identifier (<doc_id>_<chunk_type>_chunk_<index>).
text string chunk body (title optionally prepended).
metadata string (JSON) JSON object preserving provenance and enrichment (see below).

Example metadata payload (abridged):

JSON
{
  "doc_id": "media_2024-02-19_000123",
  "doc_type": "media",
  "chunk_type": "fixed",
  "chunk_index": 0,
  "token_count": 486,
  "date": "2024-02-19T00:00:00",
  "lang": "en",
  "source": "cleantech_media",
  "entities": {
    "person": ["Jane Doe"],
    "org": ["HydrogenCo"],
    "location": ["Hamburg"],
    "technology": ["green hydrogen"],
    "technology_with_subfields": [
      {"term": "green hydrogen", "subfield_id": "6204", "subfield_name": "Hydrogen technologies", "score": 0.82}
    ]
  },
  "event_dates": [{"label": "2024", "year": 2024, "month": null, "day": null}],
  "numeric_facts": [{"name": "capacity_MW", "value": 140.0, "unit": "MW"}],
  "role_annotations": [{"person": "Jane Doe", "role": "CTO"}],
  "chunk_summary": "HydrogenCo commissions a 140 MW electrolyzer in Hamburg…",
  "chunk_summary_source": "llm",
  "chunk_summary_provider": "gemini",
  "validation": {
    "passed": true,
    "min_extractions": 1,
    "min_span_integrity_pct": 40.0,
    "span_integrity_pct": 78.6,
    "extraction_provider_used": "gemini",
    "extraction_model_used": "gemini-2.5-flash-lite"
  }
}

Parameters

Subsample (ctsubsample)

  • --input / --outdir default to silver/unified/ and silver_subsample/ (auto-detects parquet/csv).
  • --by (doc_type,lang default) controls strata columns.
  • Provide exactly one of --n or --frac; optional --min-per-stratum, --cap-per-stratum.
  • --seed (default 42) for reproducible RNG.
  • --dedupe removes duplicate doc_id before sampling; --write-csv writes .csv.gz alongside parquet.

Fixed-size chunking (ctchunk fixed)

  • --max-tokens (default 512) and --overlap (64) operate on tokenizer token counts. Auto-load order: mosaicml/mpt-7b-storywriterbert-base-uncased → whitespace fallback.
  • --prepend-title (default on) prefixes the chunk text with the original title.
  • --doc-types filter to media, patent, topic, etc. Input auto-resolves to subsample output.
  • Output directories bucket by run date: cleantech_data/silver_subsample_chunk/fixed_size/<YYYY-MM-DD>/.

Enrichment (python -m ctenrichement.cli ...)

  • Commands: all, media, patent, topics (Typer CLI). Default dataset list = both fixed_size and semantic.
  • Providers: autodetect between gemini / openai via GEMINI_KEY / OPENAI_KEY; override with --provider, --model-id.
  • Summary options: --summary-provider, --summary-model-id, LEX_SUMMARY toggle, LEX_SUMMARY_SENTENCES (default 3).
  • Rate limits: --rps (8.0), --max-workers (12), LEX_MAX_RETRIES (5), LEX_MIN_EXTRACTIONS (1), LEX_MIN_SPAN_INTEGRITY (40).
  • Resume controls: --resume {none|failed|skip-completed}; caches stored under LEX_CACHE_DIR (default .lex_cache).
  • Optional OpenAlex topic join via python -m ctenrichement.cli all --with-topics (writes topics/ and topic_chunk_join/).

Step-by-step Tasks

1. Subsample unified docs

  • Preconditions: ctunify completed; cleantech_data/silver/unified/unified_docs.parquet exists.
  • CLI:
    Bash
    ctsubsample --frac 0.25 --by doc_type,lang --seed 42
    
  • Python:
    Python
    from subsample.cli import run as subsample_run
    subsample_run(frac=0.25, by="doc_type,lang", seed=42)
    
  • Expected artifacts: silver_subsample/unified_docs_subsample.parquet, silver_subsample/manifest.json with metrics.tv_lang ≤ 0.1 for well-balanced samples.
  • Troubleshooting: ensure doc_type column present; set CT_DATA_ROOT if running outside repo layout.

2. Fixed-size chunking

  • Preconditions: Subsample parquet written; optional GPU not required (tokenizers CPU).
  • CLI:
    Bash
    ctchunk fixed --max-tokens 512 --overlap 64 --doc-types media,patent
    
  • Python:
    Python
    from ctchunk.cli import fixed as chunk_fixed
    chunk_fixed(max_tokens=512, overlap=64, doc_types="media,patent")
    
  • Expected artifacts: silver_subsample_chunk/fixed_size/<today>/chunks.parquet and chunks.jsonl (~3–6 chunks per medium-length article).
  • Troubleshooting: tokenizer download requires transformers; install pip install cleantech-pipeline[chunk]. If overlap ≥ max tokens Typer aborts.

3. Enrichment & Gold write

  • Preconditions: GEMINI_KEY or OPENAI_KEY; optional LANGEXTRACT installed (pip install langextract).
  • CLI (all doc types, both datasets):
    Bash
    python -m ctenrichement.cli all --dataset fixed_size \
      --provider auto --rps 8 --max-workers 12 --resume none --with-topics
    
    Run again with --dataset semantic if semantic chunks are present.
  • Python:
    Python
    from ctenrichement import cli as lex_cli
    lex_cli.all(dataset="fixed_size", provider="auto", with_topics=True)
    
  • Expected artifacts: gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet (or latest run date), JSONL extractions, interactive HTML viewer, updated completed_ids.json.
  • Troubleshooting:
  • Missing provider key → RuntimeError: No provider keys found; set GEMINI_KEY or OPENAI_KEY.
  • Out-of-memory: use --max-workers 4 and LEX_RPS=4.
  • Tokenizer mismatch: delete .cache/huggingface and rerun ctchunk.
  • Resume stuck: call python -m ctenrichement.cli progress --dataset fixed_size to inspect counts.

Validation & Quality Gates

  • Subsample metrics: inspect manifest.json for tv_doc_type, tv_lang, tv_date_month (expect ≤ 0.15). Re-run with different --seed if needed.
  • Chunking sanity:
    Python
    1
    2
    3
    4
    import pandas as pd
    df = pd.read_parquet("cleantech_data/silver_subsample_chunk/fixed_size/2025-09-14/chunks.parquet")
    assert df.id.is_unique
    assert df.text.str.len().ge(200).mean() > 0.8
    
  • Enrichment checks:
    Python
    1
    2
    3
    4
    5
    6
    import json, pandas as pd
    gold = pd.read_parquet("cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet")
    meta = gold.metadata.apply(json.loads)
    assert {"doc_id", "entities", "validation"} <= set(meta.iloc[0])
    assert meta.apply(lambda m: m["validation"]["passed"]).mean() > 0.9
    assert meta.apply(lambda m: len(m["entities"]["technology"])).mean() >= 1
    
  • Use python -m ctenrichement.cli progress --dataset fixed_size to verify completed == total_target.

Reproducibility

  • RNG seeds: ctsubsample --seed and deterministic iteration order ensure repeatable sampling.
  • Chunks deterministic given identical tokenizer/model versions.
  • Enrichment caches (.lex_cache) persist extraction/summary responses; delete to force re-run.
  • Manifest & state files live next to each dataset bucket; commit manifest.json to capture parameters.
  • Logs emitted via Typer; redirect > logs/ctenrich_$(date +%F).log for long runs.

See also