Fixed‑Size vs. Semantic Chunking — Cleantech Subsample (Full Report)¶

This notebook compares chunk statistics between fixed‑size and semantic chunking for a Cleantech subsample. It produces separate plots per dataset (Fixed vs Semantic) and a further breakdown by document type — topic, media, patent — using explicit metadata when available or inferring from doc_id prefixes like topic:..., media:..., patent:....

Outputs - Dataset‑level tables and figures - CSV exports

Paths & outputs¶

Inputs (adjust to your environment if needed):

Python
fixed_path    = Path('../cleantech_data/silver_subsample_chunk/fixed_size/2025-08-29/chunks.parquet')
semantic_path = Path('../cleantech_data/silver_subsample_chunk/semantic/2025-08-29/chunks.parquet')

Outputs (this notebook writes here):

Figures: reports/chunking/figs/<dataset>/<doc_type>/...
Tables: reports/chunking/<dataset>/<doc_type>/... (plus dataset‑level under reports/chunking/<dataset>/)

If the input Parquet files are not found, a small synthetic sample is used so you can still run the notebook end‑to‑end.

Python
from __future__ import annotations

import json
from typing import Dict, Iterable, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from pathlib import Path
from datetime import datetime

# Render crisper figures; DO NOT set styles or colors to keep defaults.
plt.rcParams['figure.dpi'] = 130

# --- Latest input parquet paths
fixed_root = Path('../cleantech_data/silver_subsample_chunk/fixed_size')
semantic_root = Path('../cleantech_data/silver_subsample_chunk/semantic')

def resolve_latest_chunk_path(root: Path, filename: str = 'chunks.parquet', date_fmt: str = '%Y-%m-%d') -> Path:
    """Return <root>/<YYYY-MM-DD>/<filename> for the most recent date that has the file.
    If none found, return <root>/<filename> so your loader can fall back to synthetic data.
    """
    candidates = []
    if root.exists():
        for child in root.iterdir():
            if child.is_dir():
                try:
                    dt = datetime.strptime(child.name, date_fmt).date()
                except ValueError:
                    continue
                fpath = child / filename
                if fpath.exists():
                    candidates.append((dt, fpath))
    if candidates:
        dt, latest = max(candidates, key=lambda t: t[0])
        print(f"Resolved latest under {root}: {dt} -> {latest}")
        return latest
    print(f"WARNING: No dated subfolders with {filename} under {root}. Will attempt fallback/synthetic.")
    return root / filename  # likely non-existent; your loader can detect and use synthetic fallback

# use newest fixed & semantic chunk files
fixed_path    = resolve_latest_chunk_path(fixed_root)
semantic_path = resolve_latest_chunk_path(semantic_root)


# --- Output roots (as in your snippet) ---
CHUNKING_DIR = Path('reports/chunking')
PLOTS_DIR = CHUNKING_DIR / 'figs'     # figures
TABLES_DIR = CHUNKING_DIR             # tables; dataset/type subfolders are created automatically

PLOTS_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

Text Only
Resolved latest under ..\cleantech_data\silver_subsample_chunk\fixed_size: 2025-09-03 -> ..\cleantech_data\silver_subsample_chunk\fixed_size\2025-09-03\chunks.parquet
Resolved latest under ..\cleantech_data\silver_subsample_chunk\semantic: 2025-09-03 -> ..\cleantech_data\silver_subsample_chunk\semantic\2025-09-03\chunks.parquet

Helper functions¶

Loading with Parquet → synthetic fallback
Metadata expansion (handles JSON strings or dicts)
Doc type inference (topic, media, patent) from either metadata or doc_id prefix
Stats utilities (global and per‑doc)
Plotting utilities (histogram, ECDF, boxplot), each plot in its own figure

Python
def _read_parquet(path: Path) -> pd.DataFrame | None:
    try:
        if path.exists():
            return pd.read_parquet(path)
    except Exception as e:
        print(f"WARNING: Failed to read {path} as Parquet: {e}")
    return None


def load_parquet_or_sample(path: Path, kind: str) -> pd.DataFrame:
    """Load Parquet file if available; otherwise provide small synthetic sample."""
    df = _read_parquet(path)
    if df is not None:
        print(f"Loaded {kind} dataset from {path} with {len(df):,} rows.")
        return df

    print(f"WARNING: {kind} path {path} not found. Using synthetic sample data for demonstration.")
    if kind == 'fixed':
        data = [
            {"text": "Fixed sample chunk A", "metadata": json.dumps({"doc_id": "topic_doc1", "token_count": 100})},
            {"text": "Fixed sample chunk B", "metadata": json.dumps({"doc_id": "media_doc1", "token_count": 110})},
            {"text": "Fixed sample chunk C", "metadata": json.dumps({"doc_id": "patent_doc2", "token_count": 90})},
        ]
    else:
        data = [
            {"text": "Semantic sample chunk A", "metadata": json.dumps({"doc_id": "topic:doc1", "token_count": 80})},
            {"text": "Semantic sample chunk B", "metadata": json.dumps({"doc_id": "media:doc2", "token_count": 95})},
        ]
    return pd.DataFrame(data)


def expand_metadata(df: pd.DataFrame) -> pd.DataFrame:
    """Expand 'metadata' JSON; enforce doc_id and token_count; keep text/doc_type if present."""
    if 'metadata' in df.columns:
        meta = df['metadata'].apply(lambda m: m if isinstance(m, dict) else json.loads(m))
        meta_df = pd.json_normalize(meta)
        df = pd.concat([df.drop(columns=['metadata'], errors='ignore'), meta_df], axis=1)

    if 'doc_id' not in df.columns:
        df['doc_id'] = 'unknown'
    df['doc_id'] = df['doc_id'].astype(str)

    token_col = next((c for c in ['token_count', 'tokens', 'n_tokens'] if c in df.columns), None)
    if token_col is None:
        raise ValueError("No token count field found in metadata. Expected one of: token_count, tokens, n_tokens.")

    df['token_count'] = pd.to_numeric(df[token_col], errors='coerce')
    before = len(df)
    df = df.dropna(subset=['token_count']).copy()
    df['token_count'] = df['token_count'].astype(int)
    dropped = before - len(df)
    if dropped:
        print(f"Dropped {dropped} rows without valid token_count.")

    keep = [c for c in ['doc_id', 'text', 'token_count', 'doc_type'] if c in df.columns]
    return df[keep]


def add_doc_type(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure 'doc_type' exists. Prefer explicit metadata, otherwise infer from doc_id prefix."""
    out = df.copy()
    # Normalize provided doc_type if any
    if 'doc_type' in out.columns:
        dt = out['doc_type'].astype(str).str.lower().str.strip()
        dt = dt.replace({'topics':'topic', 'patents':'patent', 'medias':'media'})
    else:
        dt = pd.Series(index=out.index, dtype=object)

    base = out['doc_id'].astype(str).str.lower().str.strip()
    inferred = pd.Series('unknown', index=out.index)
    for t in ['topic', 'media', 'patent']:
        inferred[base.str.startswith(f"{t}:")] = t

    dt_final = dt.mask(dt.eq('') | dt.isna(), inferred) if 'doc_type' in out.columns else inferred
    out['doc_type'] = dt_final
    return out


def describe_series(x: Iterable[float | int]) -> Dict[str, float]:
    arr = np.asarray(list(x), dtype=float)
    arr = arr[~np.isnan(arr)]
    if arr.size == 0:
        return {k: 0.0 for k in ['count','mean','median','std','min','max','p05','p25','p75','p95']}
    return {
        'count': float(arr.size),
        'mean': float(arr.mean()),
        'median': float(np.median(arr)),
        'std': float(arr.std(ddof=0)),
        'min': float(arr.min()),
        'max': float(arr.max()),
        'p05': float(np.percentile(arr, 5)),
        'p25': float(np.percentile(arr, 25)),
        'p75': float(np.percentile(arr, 75)),
        'p95': float(np.percentile(arr, 95)),
    }


def per_doc_stats(df: pd.DataFrame) -> pd.DataFrame:
    g = df.groupby('doc_id')['token_count']
    out = g.agg(
        chunk_count='count',
        mean='mean',
        median='median',
        std=lambda s: float(np.std(s.to_numpy(dtype=float), ddof=0)),
        min='min',
        max='max'
    ).reset_index().sort_values('chunk_count', ascending=False)
    return out


def fd_bins(arr: np.ndarray) -> int:
    x = np.asarray(arr, dtype=float)
    x = x[~np.isnan(x)]
    n = x.size
    if n <= 1:
        return max(1, n)
    iqr = np.subtract(*np.percentile(x, [75, 25]))
    if iqr == 0:
        return max(5, int(np.sqrt(n)))
    h = 2 * iqr / (n ** (1/3))
    if h <= 0:
        return max(5, int(np.sqrt(n)))
    bins = int(np.ceil((x.max() - x.min()) / h))
    return max(5, min(bins, 100))


def plot_hist(series: pd.Series, title: str, fname: Path, xlabel: str = "Tokens per chunk", ylabel: str = "Frequency"):
    vals = series.dropna().to_numpy()
    if len(vals) == 0:
        return
    fname.parent.mkdir(parents=True, exist_ok=True)
    plt.figure()
    plt.hist(vals, bins=fd_bins(vals))
    plt.title(title)
    plt.xlabel(xlabel)   # supports "Chunks per document"
    plt.ylabel(ylabel)   # can be customized if desired
    plt.tight_layout()
    plt.savefig(fname, bbox_inches='tight')
    plt.close()


def plot_ecdf(series: pd.Series, title: str, fname: Path):
    vals = np.sort(series.dropna().to_numpy(dtype=float))
    if vals.size == 0:
        return
    fname.parent.mkdir(parents=True, exist_ok=True)
    y = np.arange(1, vals.size + 1) / vals.size
    plt.figure()
    plt.step(vals, y, where='post')
    plt.title(title)
    plt.xlabel('Tokens per chunk')
    plt.ylabel('ECDF')
    plt.tight_layout()
    plt.savefig(fname, bbox_inches='tight')
    plt.close()


def plot_box(series: pd.Series, title: str, fname: Path):
    vals = series.dropna().to_numpy(dtype=float)
    if len(vals) == 0:
        return
    fname.parent.mkdir(parents=True, exist_ok=True)
    plt.figure()
    plt.boxplot(vals, vert=True, showfliers=True)
    plt.title(title)
    plt.ylabel('Tokens per chunk')
    plt.tight_layout()
    plt.savefig(fname, bbox_inches='tight')
    plt.close()


def dataset_slug(label: str) -> str:
    return label.lower().replace(' ', '_')

Analysis routines¶

Two layers of analysis: 1. Dataset‑level (Fixed Size vs Semantic): global stats & per‑doc stats, plus token distribution plots. 2. Doc‑type breakdown within each dataset: global & per‑doc stats per topic, media, patent (and any others found).

Python
def analyze_one_dataset(df: pd.DataFrame, label: str):
    """Dataset-level stats & plots (no mixing with other dataset)."""
    df_expanded = expand_metadata(df)
    df_expanded = add_doc_type(df_expanded)  # ensure doc_type present for downstream

    display(df_expanded.head())

    # Global summary (dataset level)
    gstats = describe_series(df_expanded['token_count'])
    global_df = pd.DataFrame([gstats])
    global_df.insert(0, 'dataset', label)
    display(global_df)

    # Per-document summary (dataset level)
    perdoc = per_doc_stats(df_expanded)
    display(perdoc.head())

    # Save tables (dataset-level)
    ds_slug = dataset_slug(label)
    (TABLES_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
    global_df.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_global_summary.csv", index=False)
    perdoc.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_per_doc_summary.csv", index=False)

    # Plots (dataset-level)
    (PLOTS_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
    plot_hist(df_expanded['token_count'], f"{label}: Token Count Distribution", PLOTS_DIR / ds_slug / f"{ds_slug}_token_hist.png")
    plot_ecdf(df_expanded['token_count'], f"{label}: Token Count ECDF", PLOTS_DIR / ds_slug / f"{ds_slug}_token_ecdf.png")
    plot_box(df_expanded['token_count'], f"{label}: Token Count Boxplot", PLOTS_DIR / ds_slug / f"{ds_slug}_token_boxplot.png")

    # Chunks-per-document distribution (dataset-level) — CORRECT x-axis label
    plot_hist(
        perdoc['chunk_count'],
        f"{label}: Chunks per Document",
        PLOTS_DIR / ds_slug / f"{ds_slug}_chunks_per_doc_hist.png",
        xlabel="Chunks per document",
        ylabel="Number of documents",
    )

    return df_expanded, perdoc, global_df


def analyze_by_doc_type(df_expanded: pd.DataFrame, label: str):
    """Break down one dataset by doc_type (topic/media/patent/unknown)."""
    ds_slug = dataset_slug(label)

    # Ensure doc_type column exists
    if 'doc_type' not in df_expanded.columns:
        df_expanded = add_doc_type(df_expanded)

    order = ['topic', 'media', 'patent']
    present = list(pd.unique(df_expanded['doc_type'].dropna().astype(str)))
    ordered = [t for t in order if t in present] + [t for t in present if t not in order]

    global_rows = []
    perdoc_rows = []

    for dt in ordered:
        df_dt = df_expanded[df_expanded['doc_type'] == dt]
        if df_dt.empty:
            continue

        # Global per-type stats
        gstats = describe_series(df_dt['token_count'])
        gdf = pd.DataFrame([gstats])
        gdf.insert(0, 'doc_type', dt)
        gdf.insert(0, 'dataset', label)
        global_rows.append(gdf)

        # Per-doc stats within this type
        perdoc = per_doc_stats(df_dt)
        perdoc.insert(0, 'doc_type', dt)
        perdoc.insert(0, 'dataset', label)
        perdoc_rows.append(perdoc)

        # Ensure subdirs
        tdir = TABLES_DIR / ds_slug / dt
        pdir = PLOTS_DIR / ds_slug / dt
        tdir.mkdir(parents=True, exist_ok=True)
        pdir.mkdir(parents=True, exist_ok=True)

        # Save tables
        gdf.to_csv(tdir / f"{ds_slug}_{dt}_global_summary.csv", index=False)
        perdoc.to_csv(tdir / f"{ds_slug}_{dt}_per_doc_summary.csv", index=False)

        # Plots for this type
        plot_hist(df_dt['token_count'], f"{label} [{dt}] Token Count Distribution", pdir / f"{ds_slug}_{dt}_token_hist.png")
        plot_ecdf(df_dt['token_count'], f"{label} [{dt}] Token Count ECDF", pdir / f"{ds_slug}_{dt}_token_ecdf.png")
        plot_box(df_dt['token_count'], f"{label} [{dt}] Token Count Boxplot", pdir / f"{ds_slug}_{dt}_token_boxplot.png")

        # Chunks-per-doc within this type — CORRECT x-axis label
        plot_hist(
            perdoc['chunk_count'],
            f"{label} [{dt}] Chunks per Document",
            pdir / f"{ds_slug}_{dt}_chunks_per_doc_hist.png",
            xlabel="Chunks per document",
            ylabel="Number of documents",
        )

    global_by_type = pd.concat(global_rows, ignore_index=True) if global_rows else pd.DataFrame()
    perdoc_by_type = pd.concat(perdoc_rows, ignore_index=True) if perdoc_rows else pd.DataFrame()

    # Save combined (per dataset)
    if not global_by_type.empty:
        (TABLES_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
        global_by_type.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_global_by_doc_type.csv", index=False)
    if not perdoc_by_type.empty:
        (TABLES_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
        perdoc_by_type.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_per_doc_with_doc_type.csv", index=False)

    return perdoc_by_type, global_by_type

Run the analysis¶

This cell loads the datasets (or synthetic fallbacks), runs the dataset‑level analyses, then the doc‑type breakdowns, and finally writes comparison tables.

Python
# Load (with synthetic fallback)
df_fixed_raw = load_parquet_or_sample(fixed_path, 'fixed')
df_sem_raw   = load_parquet_or_sample(semantic_path, 'semantic')

# Dataset-level analyses
display("\n### FIXED DATASET ###")
fixed_expanded, fixed_perdoc, fixed_global = analyze_one_dataset(df_fixed_raw, label='Fixed Size')

display("\n### SEMANTIC DATASET ###")
sem_expanded, sem_perdoc, sem_global = analyze_one_dataset(df_sem_raw, label='Semantic')

# Doc-type breakdowns
display("\n### FIXED by doc_type ###")
fixed_perdoc_by_type, fixed_global_by_type = analyze_by_doc_type(fixed_expanded, label='Fixed Size')
display(fixed_global_by_type.head())

display("\n### SEMANTIC by doc_type ###")
sem_perdoc_by_type, sem_global_by_type = analyze_by_doc_type(sem_expanded, label='Semantic')
display(sem_global_by_type.head())

# Side-by-side comparisons
global_compare = pd.concat([fixed_global, sem_global], ignore_index=True)
display(global_compare)

global_by_doc_type_compare = pd.concat([fixed_global_by_type, sem_global_by_type], ignore_index=True)
display(global_by_doc_type_compare)

# Save comparison tables
global_compare.to_csv(TABLES_DIR / "global_comparison_fixed_vs_semantic.csv", index=False)
global_by_doc_type_compare.to_csv(TABLES_DIR / "global_by_doc_type_comparison.csv", index=False)

print("\nArtifacts written to:\n")
print(f" - Tables: {TABLES_DIR.resolve()} ")
print(f" - Plots:  {PLOTS_DIR.resolve()} \n")

Text Only
Loaded fixed dataset from ..\cleantech_data\silver_subsample_chunk\fixed_size\2025-09-03\chunks.parquet with 8,538 rows.
Loaded semantic dataset from ..\cleantech_data\silver_subsample_chunk\semantic\2025-09-03\chunks.parquet with 23,482 rows.

'\n### FIXED DATASET ###'

	doc_id	text	token_count	doc_type
0	topic_T10001	Geological and Geochemical Analysis: This clus...	72	topic
1	topic_T10002	Advanced Chemical Physics Studies: This cluste...	74	topic
2	topic_T10003	Innovation and Knowledge Management: This clus...	66	topic
3	topic_T10004	Soil Carbon and Nitrogen Dynamics: This cluste...	87	topic
4	topic_T10005	Ecology and Vegetation Dynamics Studies: This ...	68	topic

	dataset	count	mean	median	std	min	max	p05	p25	p75	p95
0	Fixed Size	8538.0	209.343172	95.0	175.160279	37.0	559.0	59.0	73.0	293.0	529.0

	doc_id	chunk_count	mean	median	std	min	max
724	media_bafbbb3bb6c1273e0d4c07057aae7a535a4340d5	9	477.333333	527.0	140.478547	80	527
815	media_d637cac2b4315a7c8df0722c0e9721955937d8a8	8	511.125000	524.0	34.064048	421	524
260	media_40cb388f766424705c2f6722a253b59641df20fd	7	498.428571	526.0	67.535931	333	526
516	media_83c3952e19cd0f435869f4ded51d08487b696a96	7	481.000000	529.0	117.575508	193	529
690	media_b422f60e44394719c787a827d3749f4e0976b7c8	7	505.428571	530.0	58.563414	362	530

Text Only
'\n### SEMANTIC DATASET ###'

	doc_id	text	token_count	doc_type
0	topic_T10001	Geological and Geochemical Analysis: This clus...	74	topic
1	topic_T10002	Advanced Chemical Physics Studies: This cluste...	76	topic
2	topic_T10003	Innovation and Knowledge Management: This clus...	68	topic
3	topic_T10004	Soil Carbon and Nitrogen Dynamics: This cluste...	62	topic
4	topic_T10004	Soil Carbon and Nitrogen Dynamics: The researc...	35	topic

	dataset	count	mean	median	std	min	max	p05	p25	p75	p95
0	Semantic	23482.0	82.775573	77.0	36.710339	9.0	681.0	34.0	67.0	90.0	149.0

	doc_id	chunk_count	mean	median	std	min	max
815	media_d637cac2b4315a7c8df0722c0e9721955937d8a8	58	75.310345	71.5	10.896388	62	100
724	media_bafbbb3bb6c1273e0d4c07057aae7a535a4340d5	55	82.836364	83.0	13.220870	33	111
668	media_adc00155be930b42ad9abdff11a1db7b22890545	49	77.306122	75.0	12.201044	62	133
258	media_4087c68626961181bb87b47d4594be44992ee389	48	81.375000	80.0	10.080148	65	104
516	media_83c3952e19cd0f435869f4ded51d08487b696a96	47	80.510638	78.0	10.846849	67	110

Text Only
'\n### FIXED by doc_type ###'

	dataset	doc_type	count	mean	median	std	min	max	p05	p25	p75	p95
0	Fixed Size	topic	4516.0	75.001550	74.0	12.719124	37.0	145.0	56.0	66.0	83.0	97.0
1	Fixed Size	media	2451.0	441.106079	522.0	131.487840	78.0	559.0	143.5	361.0	527.0	534.0
2	Fixed Size	patent	1571.0	233.936346	240.0	54.945130	37.0	536.0	138.0	207.0	266.0	302.0

Text Only
'\n### SEMANTIC by doc_type ###'

	dataset	doc_type	count	mean	median	std	min	max	p05	p25	p75	p95
0	Semantic	topic	5835.0	61.108312	66.0	18.636447	16.0	107.0	26.0	55.0	74.0	85.0
1	Semantic	media	14728.0	81.419337	79.0	19.820953	11.0	681.0	60.0	70.0	90.0	112.0
2	Semantic	patent	2919.0	132.930798	117.0	68.566729	9.0	444.0	43.0	78.0	184.0	257.1

	dataset	count	mean	median	std	min	max	p05	p25	p75	p95
0	Fixed Size	8538.0	209.343172	95.0	175.160279	37.0	559.0	59.0	73.0	293.0	529.0
1	Semantic	23482.0	82.775573	77.0	36.710339	9.0	681.0	34.0	67.0	90.0	149.0

	dataset	doc_type	count	mean	median	std	min	max	p05	p25	p75	p95
0	Fixed Size	topic	4516.0	75.001550	74.0	12.719124	37.0	145.0	56.0	66.0	83.0	97.0
1	Fixed Size	media	2451.0	441.106079	522.0	131.487840	78.0	559.0	143.5	361.0	527.0	534.0
2	Fixed Size	patent	1571.0	233.936346	240.0	54.945130	37.0	536.0	138.0	207.0	266.0	302.0
3	Semantic	topic	5835.0	61.108312	66.0	18.636447	16.0	107.0	26.0	55.0	74.0	85.0
4	Semantic	media	14728.0	81.419337	79.0	19.820953	11.0	681.0	60.0	70.0	90.0	112.0
5	Semantic	patent	2919.0	132.930798	117.0	68.566729	9.0	444.0	43.0	78.0	184.0	257.1

Text Only
Artifacts written to:

 - Tables: C:\Users\gerbe\PycharmProjects\MT\notebooks\reports\chunking 
 - Plots:  C:\Users\gerbe\PycharmProjects\MT\notebooks\reports\chunking\figs

Interpreting the outputs¶

Token Count Distribution (histogram/ECDF/boxplot): Characterizes chunk sizes. Look for long tails or multi‑modal patterns signaling uneven chunking.
Chunks per Document (histogram): How many chunks each document produced. Expect a tighter spread for fixed‑size chunking; semantic chunking may vary with content.
Global vs Per‑Doc Tables: Use the global tables to compare central tendency and dispersion across datasets or doc types; use per‑doc tables to diagnose specific documents (outliers, heavy chunk counts, etc.).

Axis labels are set appropriately: token plots show “Tokens per chunk”; per‑doc histograms show “Chunks per document” (y‑axis can be read as the number of documents).