Gold¶
The Gold layer turns the unified Silver corpus into enriched, retrieval-ready chunks. It subsamples ctunify output, slices each document into fixed windows, and annotates every chunk with LLM-derived metadata.
Inputs & Outputs¶
- Inputs
cleantech_data/silver/unified/unified_docs.parquet— produced byctunify.- Optional environment overrides via
CT_DATA_ROOT/DATA_ROOT. - Intermediate outputs
- Subsample:
cleantech_data/silver_subsample/unified_docs_subsample.parquet+manifest.json(by,seed,allocation,metrics.tv_doc_type,metrics.tv_lang, ...). - Chunking:
cleantech_data/silver_subsample_chunk/fixed_size/<YYYY-MM-DD>/chunks.parquetandchunks.jsonl. - Gold artifacts (per dataset =
fixed_size|semantic) cleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/chunks_enriched.parquetcleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/grounded_extractions.jsonlcleantech_data/gold_subsample_chunk/<dataset>/<RUN_DATE>/visualization_interactive.html- Progress state:
completed_ids.json,failed_chunks.json,failed_details.json,manifest.json.
Schema (chunks_enriched.parquet)
| column | dtype | notes |
|---|---|---|
id |
string | chunk identifier (<doc_id>_<chunk_type>_chunk_<index>). |
text |
string | chunk body (title optionally prepended). |
metadata |
string (JSON) | JSON object preserving provenance and enrichment (see below). |
Example metadata payload (abridged):
Parameters¶
Subsample (ctsubsample)¶
--input / --outdirdefault tosilver/unified/andsilver_subsample/(auto-detects parquet/csv).--by(doc_type,langdefault) controls strata columns.- Provide exactly one of
--nor--frac; optional--min-per-stratum,--cap-per-stratum. --seed(default42) for reproducible RNG.--deduperemoves duplicatedoc_idbefore sampling;--write-csvwrites.csv.gzalongside parquet.
Fixed-size chunking (ctchunk fixed)¶
--max-tokens(default512) and--overlap(64) operate on tokenizer token counts. Auto-load order:mosaicml/mpt-7b-storywriter→bert-base-uncased→ whitespace fallback.--prepend-title(default on) prefixes the chunk text with the original title.--doc-typesfilter tomedia,patent,topic, etc. Input auto-resolves to subsample output.- Output directories bucket by run date:
cleantech_data/silver_subsample_chunk/fixed_size/<YYYY-MM-DD>/.
Enrichment (python -m ctenrichement.cli ...)¶
- Commands:
all,media,patent,topics(Typer CLI). Default dataset list = bothfixed_sizeandsemantic. - Providers: autodetect between
gemini/openaiviaGEMINI_KEY/OPENAI_KEY; override with--provider,--model-id. - Summary options:
--summary-provider,--summary-model-id,LEX_SUMMARYtoggle,LEX_SUMMARY_SENTENCES(default3). - Rate limits:
--rps(8.0),--max-workers(12),LEX_MAX_RETRIES(5),LEX_MIN_EXTRACTIONS(1),LEX_MIN_SPAN_INTEGRITY(40). - Resume controls:
--resume {none|failed|skip-completed}; caches stored underLEX_CACHE_DIR(default.lex_cache). - Optional OpenAlex topic join via
python -m ctenrichement.cli all --with-topics(writestopics/andtopic_chunk_join/).
Step-by-step Tasks¶
1. Subsample unified docs¶
- ⬜ Preconditions:
ctunifycompleted;cleantech_data/silver/unified/unified_docs.parquetexists. - ⬜ CLI:
Bash - ⬜ Python:
- ✅ Expected artifacts:
silver_subsample/unified_docs_subsample.parquet,silver_subsample/manifest.jsonwithmetrics.tv_lang≤ 0.1 for well-balanced samples. - ⬜ Troubleshooting: ensure
doc_typecolumn present; setCT_DATA_ROOTif running outside repo layout.
2. Fixed-size chunking¶
- ⬜ Preconditions: Subsample parquet written; optional GPU not required (tokenizers CPU).
- ⬜ CLI:
Bash - ⬜ Python:
- ✅ Expected artifacts:
silver_subsample_chunk/fixed_size/<today>/chunks.parquetandchunks.jsonl(~3–6 chunks per medium-length article). - ⬜ Troubleshooting: tokenizer download requires
transformers; installpip install cleantech-pipeline[chunk]. If overlap ≥ max tokens Typer aborts.
3. Enrichment & Gold write¶
- ⬜ Preconditions:
GEMINI_KEYorOPENAI_KEY; optionalLANGEXTRACTinstalled (pip install langextract). - ⬜ CLI (all doc types, both datasets):
Run again with
Bash --dataset semanticif semantic chunks are present. - ⬜ Python:
- ✅ Expected artifacts:
gold_subsample_chunk/fixed_size/2025-09-14/chunks_enriched.parquet(or latest run date), JSONL extractions, interactive HTML viewer, updatedcompleted_ids.json. - ⬜ Troubleshooting:
- Missing provider key →
RuntimeError: No provider keys found; setGEMINI_KEYorOPENAI_KEY. - Out-of-memory: use
--max-workers 4andLEX_RPS=4. - Tokenizer mismatch: delete
.cache/huggingfaceand rerunctchunk. - Resume stuck: call
python -m ctenrichement.cli progress --dataset fixed_sizeto inspect counts.
Validation & Quality Gates¶
- Subsample metrics: inspect
manifest.jsonfortv_doc_type,tv_lang,tv_date_month(expect ≤ 0.15). Re-run with different--seedif needed. - Chunking sanity:
- Enrichment checks:
- Use
python -m ctenrichement.cli progress --dataset fixed_sizeto verifycompleted==total_target.
Reproducibility¶
- RNG seeds:
ctsubsample --seedand deterministic iteration order ensure repeatable sampling. - Chunks deterministic given identical tokenizer/model versions.
- Enrichment caches (
.lex_cache) persist extraction/summary responses; delete to force re-run. - Manifest & state files live next to each dataset bucket; commit
manifest.jsonto capture parameters. - Logs emitted via Typer; redirect
> logs/ctenrich_$(date +%F).logfor long runs.
Related notebooks¶
- 03_subsample_unified — stratified sampling analysis.
- 04_subsample_unified_chunked — chunk size diagnostics.
- 05_metadata_extraction — enrichment QA, cache strategies, and viewer screenshots.
See also¶
- Bronze and Silver lineage.
- Architecture for end-to-end context.
- RAG overview for downstream consumers.
- CLI references:
cli.mdandreference.md.