Gold¶

The Gold layer turns the unified Silver corpus into enriched, retrieval-ready chunks. It subsamples ctunify output, slices each document into fixed windows, and annotates every chunk with LLM-derived metadata.

Pipline Gold

Inputs & Outputs¶

Inputs
cleantech_data/silver/unified/unified_docs.parquet — produced by ctunify.
Optional environment overrides via CT_DATA_ROOT / DATA_ROOT.
Intermediate outputs
Subsample: cleantech_data/silver_subsample/unified_docs_subsample.parquet + manifest.json (by, seed, allocation, metrics.tv_doc_type, metrics.tv_lang, ...).
Chunking: cleantech_data/silver_subsample_chunk/fixed_size/<YYYY-MM-DD>/chunks.parquet and chunks.jsonl.
Gold artifacts (per dataset = fixed_size | semantic)
cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/chunks_enriched.parquet
cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/grounded_extractions.jsonl
cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/visualization_interactive.html
Progress state: completed_ids.json, failed_chunks.json, failed_details.json, manifest.json.

Schema (chunks_enriched.parquet)

column	dtype	notes
`id`	string	chunk identifier (`<doc_id>_<chunk_type>_chunk_<index>`).
`text`	string	chunk body (title optionally prepended).
`metadata`	string (JSON)	JSON object preserving provenance and enrichment (see below).

Example metadata payload (abridged):

JSON
{
  "doc_id": "media_2024-02-19_000123",
  "doc_type": "media",
  "chunk_type": "fixed",
  "chunk_index": 0,
  "token_count": 486,
  "date": "2024-02-19T00:00:00",
  "lang": "en",
  "source": "cleantech_media",
  "entities": {
    "person": ["Jane Doe"],
    "org": ["HydrogenCo"],
    "location": ["Hamburg"],
    "technology": ["green hydrogen"],
    "technology_with_subfields": [
      {"term": "green hydrogen", "subfield_id": "6204", "subfield_name": "Hydrogen technologies", "score": 0.82}
    ]
  },
  "event_dates": [{"label": "2024", "year": 2024, "month": null, "day": null}],
  "numeric_facts": [{"name": "capacity_MW", "value": 140.0, "unit": "MW"}],
  "role_annotations": [{"person": "Jane Doe", "role": "CTO"}],
  "chunk_summary": "HydrogenCo commissions a 140 MW electrolyzer in Hamburg…",
  "chunk_summary_source": "llm",
  "chunk_summary_provider": "gemini",
  "validation": {
    "passed": true,
    "min_extractions": 1,
    "min_span_integrity_pct": 40.0,
    "span_integrity_pct": 78.6,
    "extraction_provider_used": "gemini",
    "extraction_model_used": "gemini-2.5-flash-lite"
  }
}

Parameters¶

Subsample (`ctsubsample`)¶

--input / --outdir default to silver/unified/ and silver_subsample/ (auto-detects parquet/csv).
--by (doc_type,lang default) controls strata columns.
Provide exactly one of --n or --frac; optional --min-per-stratum, --cap-per-stratum.
--seed (default 42) for reproducible RNG.
--dedupe removes duplicate doc_id before sampling; --write-csv writes .csv.gz alongside parquet.

Fixed-size chunking (`ctchunk fixed`)¶

--max-tokens (default 512) and --overlap (64) operate on tokenizer token counts. Auto-load order: mosaicml/mpt-7b-storywriter → bert-base-uncased → whitespace fallback.
--prepend-title (default on) prefixes the chunk text with the original title.
--doc-types filter to media, patent, topic, etc. Input auto-resolves to subsample output.
Output directories bucket by run date: cleantech_data/silver_subsample_chunk/fixed_size/<YYYY-MM-DD>/.

Enrichment (`python -m ctenrichement.cli ...`)¶

Commands: all, media, patent, topics (Typer CLI). Default dataset list = both fixed_size and semantic.
Providers: autodetect between gemini / openai via GEMINI_KEY / OPENAI_KEY; override with --provider, --model-id.
Summary options: --summary-provider, --summary-model-id, LEX_SUMMARY toggle, LEX_SUMMARY_SENTENCES (default 3).
Rate limits: --rps (8.0), --max-workers (12), LEX_MAX_RETRIES (5), LEX_MIN_EXTRACTIONS (1), LEX_MIN_SPAN_INTEGRITY (40).
Resume controls: --resume {none|failed|skip-completed}; caches stored under LEX_CACHE_DIR (default .lex_cache).
Optional OpenAlex topic join via python -m ctenrichement.cli all --with-topics (writes topics/ and topic_chunk_join/).

Step-by-step Tasks¶

1. Subsample unified docs¶

⬜ Preconditions: ctunify completed; cleantech_data/silver/unified/unified_docs.parquet exists.

⬜ CLI:

Bash
ctsubsample --frac 0.25 --by doc_type,lang --seed 42

⬜ Python:

Python
from subsample.cli import run as subsample_run
subsample_run(frac=0.25, by="doc_type,lang", seed=42)

✅ Expected artifacts: silver_subsample/unified_docs_subsample.parquet, silver_subsample/manifest.json with metrics.tv_lang ≤ 0.1 for well-balanced samples.
⬜ Troubleshooting: ensure doc_type column present; set CT_DATA_ROOT if running outside repo layout.

2. Fixed-size chunking¶

⬜ Preconditions: Subsample parquet written; optional GPU not required (tokenizers CPU).

⬜ CLI:

Bash
ctchunk fixed --max-tokens 512 --overlap 64 --doc-types media,patent

⬜ Python:

Python
from ctchunk.cli import fixed as chunk_fixed
chunk_fixed(max_tokens=512, overlap=64, doc_types="media,patent")

✅ Expected artifacts: silver_subsample_chunk/fixed_size/<today>/chunks.parquet and chunks.jsonl (~3–6 chunks per medium-length article).
⬜ Troubleshooting: tokenizer download requires transformers; install pip install cleantech-pipeline[chunk]. If overlap ≥ max tokens Typer aborts.

3. Enrichment & Gold write¶

⬜ Preconditions: GEMINI_KEY or OPENAI_KEY; optional LANGEXTRACT installed (pip install langextract).

⬜ CLI (all doc types, both datasets):

Bash
python -m ctenrichement.cli all --dataset fixed_size \
  --provider auto --rps 8 --max-workers 12 --resume none --with-topics

Run again with --dataset semantic if semantic chunks are present.

⬜ Python:

Python
from ctenrichement import cli as lex_cli
lex_cli.all(dataset="fixed_size", provider="auto", with_topics=True)

✅ Expected artifacts: gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet (or latest run date), JSONL extractions, interactive HTML viewer, updated completed_ids.json.
⬜ Troubleshooting:
Missing provider key → RuntimeError: No provider keys found; set GEMINI_KEY or OPENAI_KEY.
Out-of-memory: use --max-workers 4 and LEX_RPS=4.
Tokenizer mismatch: delete .cache/huggingface and rerun ctchunk.
Resume stuck: call python -m ctenrichement.cli progress --dataset fixed_size to inspect counts.

Validation & Quality Gates¶

Subsample metrics: inspect manifest.json for tv_doc_type, tv_lang, tv_date_month (expect ≤ 0.15). Re-run with different --seed if needed.

Chunking sanity:

Python
import pandas as pd
df = pd.read_parquet("cleantech_data/silver_subsample_chunk/fixed_size/2025-09-14/chunks.parquet")
assert df.id.is_unique
assert df.text.str.len().ge(200).mean() > 0.8

Enrichment checks:

Python
import json, pandas as pd
gold = pd.read_parquet("cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet")
meta = gold.metadata.apply(json.loads)
assert {"doc_id", "entities", "validation"} <= set(meta.iloc[0])
assert meta.apply(lambda m: m["validation"]["passed"]).mean() > 0.9
assert meta.apply(lambda m: len(m["entities"]["technology"])).mean() >= 1

Use python -m ctenrichement.cli progress --dataset fixed_size to verify completed == total_target.

Reproducibility¶

RNG seeds: ctsubsample --seed and deterministic iteration order ensure repeatable sampling.
Chunks deterministic given identical tokenizer/model versions.
Enrichment caches (.lex_cache) persist extraction/summary responses; delete to force re-run.
Manifest & state files live next to each dataset bucket; commit manifest.json to capture parameters.
Logs emitted via Typer; redirect > logs/ctenrich_$(date +%F).log for long runs.

03_subsample_unified — stratified sampling analysis.
04_subsample_unified_chunked — chunk size diagnostics.
05_metadata_extraction — enrichment QA, cache strategies, and viewer screenshots.

Gold¶

Inputs & Outputs¶

Parameters¶

Subsample (ctsubsample)¶

Fixed-size chunking (ctchunk fixed)¶

Enrichment (python -m ctenrichement.cli ...)¶

Step-by-step Tasks¶

1. Subsample unified docs¶

2. Fixed-size chunking¶

3. Enrichment & Gold write¶

Validation & Quality Gates¶

Reproducibility¶

Related notebooks¶

See also¶

Subsample (`ctsubsample`)¶

Fixed-size chunking (`ctchunk fixed`)¶

Enrichment (`python -m ctenrichement.cli ...`)¶