Skip to content

Fixed‑Size vs. Semantic Chunking — Cleantech Subsample (Full Report)

This notebook compares chunk statistics between fixed‑size and semantic chunking for a Cleantech subsample. It produces separate plots per dataset (Fixed vs Semantic) and a further breakdown by document typetopic, media, patent — using explicit metadata when available or inferring from doc_id prefixes like topic:..., media:..., patent:....

Outputs - Dataset‑level tables and figures - CSV exports

Paths & outputs

Inputs (adjust to your environment if needed):

Python
fixed_path    = Path('../cleantech_data/silver_subsample_chunk/fixed_size/2025-08-29/chunks.parquet')
semantic_path = Path('../cleantech_data/silver_subsample_chunk/semantic/2025-08-29/chunks.parquet')

Outputs (this notebook writes here):

  • Figures: reports/chunking/figs/<dataset>/<doc_type>/...
  • Tables: reports/chunking/<dataset>/<doc_type>/... (plus dataset‑level under reports/chunking/<dataset>/)

If the input Parquet files are not found, a small synthetic sample is used so you can still run the notebook end‑to‑end.

Python
from __future__ import annotations

import json
from typing import Dict, Iterable, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from pathlib import Path
from datetime import datetime

# Render crisper figures; DO NOT set styles or colors to keep defaults.
plt.rcParams['figure.dpi'] = 130

# --- Latest input parquet paths
fixed_root = Path('../cleantech_data/silver_subsample_chunk/fixed_size')
semantic_root = Path('../cleantech_data/silver_subsample_chunk/semantic')

def resolve_latest_chunk_path(root: Path, filename: str = 'chunks.parquet', date_fmt: str = '%Y-%m-%d') -> Path:
    """Return <root>/<YYYY-MM-DD>/<filename> for the most recent date that has the file.
    If none found, return <root>/<filename> so your loader can fall back to synthetic data.
    """
    candidates = []
    if root.exists():
        for child in root.iterdir():
            if child.is_dir():
                try:
                    dt = datetime.strptime(child.name, date_fmt).date()
                except ValueError:
                    continue
                fpath = child / filename
                if fpath.exists():
                    candidates.append((dt, fpath))
    if candidates:
        dt, latest = max(candidates, key=lambda t: t[0])
        print(f"Resolved latest under {root}: {dt} -> {latest}")
        return latest
    print(f"WARNING: No dated subfolders with {filename} under {root}. Will attempt fallback/synthetic.")
    return root / filename  # likely non-existent; your loader can detect and use synthetic fallback

# use newest fixed & semantic chunk files
fixed_path    = resolve_latest_chunk_path(fixed_root)
semantic_path = resolve_latest_chunk_path(semantic_root)


# --- Output roots (as in your snippet) ---
CHUNKING_DIR = Path('reports/chunking')
PLOTS_DIR = CHUNKING_DIR / 'figs'     # figures
TABLES_DIR = CHUNKING_DIR             # tables; dataset/type subfolders are created automatically

PLOTS_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)
Text Only
1
2
Resolved latest under ..\cleantech_data\silver_subsample_chunk\fixed_size: 2025-09-03 -> ..\cleantech_data\silver_subsample_chunk\fixed_size\2025-09-03\chunks.parquet
Resolved latest under ..\cleantech_data\silver_subsample_chunk\semantic: 2025-09-03 -> ..\cleantech_data\silver_subsample_chunk\semantic\2025-09-03\chunks.parquet

Helper functions

  • Loading with Parquet → synthetic fallback
  • Metadata expansion (handles JSON strings or dicts)
  • Doc type inference (topic, media, patent) from either metadata or doc_id prefix
  • Stats utilities (global and per‑doc)
  • Plotting utilities (histogram, ECDF, boxplot), each plot in its own figure
Python
def _read_parquet(path: Path) -> pd.DataFrame | None:
    try:
        if path.exists():
            return pd.read_parquet(path)
    except Exception as e:
        print(f"WARNING: Failed to read {path} as Parquet: {e}")
    return None


def load_parquet_or_sample(path: Path, kind: str) -> pd.DataFrame:
    """Load Parquet file if available; otherwise provide small synthetic sample."""
    df = _read_parquet(path)
    if df is not None:
        print(f"Loaded {kind} dataset from {path} with {len(df):,} rows.")
        return df

    print(f"WARNING: {kind} path {path} not found. Using synthetic sample data for demonstration.")
    if kind == 'fixed':
        data = [
            {"text": "Fixed sample chunk A", "metadata": json.dumps({"doc_id": "topic_doc1", "token_count": 100})},
            {"text": "Fixed sample chunk B", "metadata": json.dumps({"doc_id": "media_doc1", "token_count": 110})},
            {"text": "Fixed sample chunk C", "metadata": json.dumps({"doc_id": "patent_doc2", "token_count": 90})},
        ]
    else:
        data = [
            {"text": "Semantic sample chunk A", "metadata": json.dumps({"doc_id": "topic:doc1", "token_count": 80})},
            {"text": "Semantic sample chunk B", "metadata": json.dumps({"doc_id": "media:doc2", "token_count": 95})},
        ]
    return pd.DataFrame(data)


def expand_metadata(df: pd.DataFrame) -> pd.DataFrame:
    """Expand 'metadata' JSON; enforce doc_id and token_count; keep text/doc_type if present."""
    if 'metadata' in df.columns:
        meta = df['metadata'].apply(lambda m: m if isinstance(m, dict) else json.loads(m))
        meta_df = pd.json_normalize(meta)
        df = pd.concat([df.drop(columns=['metadata'], errors='ignore'), meta_df], axis=1)

    if 'doc_id' not in df.columns:
        df['doc_id'] = 'unknown'
    df['doc_id'] = df['doc_id'].astype(str)

    token_col = next((c for c in ['token_count', 'tokens', 'n_tokens'] if c in df.columns), None)
    if token_col is None:
        raise ValueError("No token count field found in metadata. Expected one of: token_count, tokens, n_tokens.")

    df['token_count'] = pd.to_numeric(df[token_col], errors='coerce')
    before = len(df)
    df = df.dropna(subset=['token_count']).copy()
    df['token_count'] = df['token_count'].astype(int)
    dropped = before - len(df)
    if dropped:
        print(f"Dropped {dropped} rows without valid token_count.")

    keep = [c for c in ['doc_id', 'text', 'token_count', 'doc_type'] if c in df.columns]
    return df[keep]


def add_doc_type(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure 'doc_type' exists. Prefer explicit metadata, otherwise infer from doc_id prefix."""
    out = df.copy()
    # Normalize provided doc_type if any
    if 'doc_type' in out.columns:
        dt = out['doc_type'].astype(str).str.lower().str.strip()
        dt = dt.replace({'topics':'topic', 'patents':'patent', 'medias':'media'})
    else:
        dt = pd.Series(index=out.index, dtype=object)

    base = out['doc_id'].astype(str).str.lower().str.strip()
    inferred = pd.Series('unknown', index=out.index)
    for t in ['topic', 'media', 'patent']:
        inferred[base.str.startswith(f"{t}:")] = t

    dt_final = dt.mask(dt.eq('') | dt.isna(), inferred) if 'doc_type' in out.columns else inferred
    out['doc_type'] = dt_final
    return out


def describe_series(x: Iterable[float | int]) -> Dict[str, float]:
    arr = np.asarray(list(x), dtype=float)
    arr = arr[~np.isnan(arr)]
    if arr.size == 0:
        return {k: 0.0 for k in ['count','mean','median','std','min','max','p05','p25','p75','p95']}
    return {
        'count': float(arr.size),
        'mean': float(arr.mean()),
        'median': float(np.median(arr)),
        'std': float(arr.std(ddof=0)),
        'min': float(arr.min()),
        'max': float(arr.max()),
        'p05': float(np.percentile(arr, 5)),
        'p25': float(np.percentile(arr, 25)),
        'p75': float(np.percentile(arr, 75)),
        'p95': float(np.percentile(arr, 95)),
    }


def per_doc_stats(df: pd.DataFrame) -> pd.DataFrame:
    g = df.groupby('doc_id')['token_count']
    out = g.agg(
        chunk_count='count',
        mean='mean',
        median='median',
        std=lambda s: float(np.std(s.to_numpy(dtype=float), ddof=0)),
        min='min',
        max='max'
    ).reset_index().sort_values('chunk_count', ascending=False)
    return out


def fd_bins(arr: np.ndarray) -> int:
    x = np.asarray(arr, dtype=float)
    x = x[~np.isnan(x)]
    n = x.size
    if n <= 1:
        return max(1, n)
    iqr = np.subtract(*np.percentile(x, [75, 25]))
    if iqr == 0:
        return max(5, int(np.sqrt(n)))
    h = 2 * iqr / (n ** (1/3))
    if h <= 0:
        return max(5, int(np.sqrt(n)))
    bins = int(np.ceil((x.max() - x.min()) / h))
    return max(5, min(bins, 100))


def plot_hist(series: pd.Series, title: str, fname: Path, xlabel: str = "Tokens per chunk", ylabel: str = "Frequency"):
    vals = series.dropna().to_numpy()
    if len(vals) == 0:
        return
    fname.parent.mkdir(parents=True, exist_ok=True)
    plt.figure()
    plt.hist(vals, bins=fd_bins(vals))
    plt.title(title)
    plt.xlabel(xlabel)   # supports "Chunks per document"
    plt.ylabel(ylabel)   # can be customized if desired
    plt.tight_layout()
    plt.savefig(fname, bbox_inches='tight')
    plt.close()


def plot_ecdf(series: pd.Series, title: str, fname: Path):
    vals = np.sort(series.dropna().to_numpy(dtype=float))
    if vals.size == 0:
        return
    fname.parent.mkdir(parents=True, exist_ok=True)
    y = np.arange(1, vals.size + 1) / vals.size
    plt.figure()
    plt.step(vals, y, where='post')
    plt.title(title)
    plt.xlabel('Tokens per chunk')
    plt.ylabel('ECDF')
    plt.tight_layout()
    plt.savefig(fname, bbox_inches='tight')
    plt.close()


def plot_box(series: pd.Series, title: str, fname: Path):
    vals = series.dropna().to_numpy(dtype=float)
    if len(vals) == 0:
        return
    fname.parent.mkdir(parents=True, exist_ok=True)
    plt.figure()
    plt.boxplot(vals, vert=True, showfliers=True)
    plt.title(title)
    plt.ylabel('Tokens per chunk')
    plt.tight_layout()
    plt.savefig(fname, bbox_inches='tight')
    plt.close()


def dataset_slug(label: str) -> str:
    return label.lower().replace(' ', '_')

Analysis routines

Two layers of analysis: 1. Dataset‑level (Fixed Size vs Semantic): global stats & per‑doc stats, plus token distribution plots. 2. Doc‑type breakdown within each dataset: global & per‑doc stats per topic, media, patent (and any others found).

Python
def analyze_one_dataset(df: pd.DataFrame, label: str):
    """Dataset-level stats & plots (no mixing with other dataset)."""
    df_expanded = expand_metadata(df)
    df_expanded = add_doc_type(df_expanded)  # ensure doc_type present for downstream

    display(df_expanded.head())

    # Global summary (dataset level)
    gstats = describe_series(df_expanded['token_count'])
    global_df = pd.DataFrame([gstats])
    global_df.insert(0, 'dataset', label)
    display(global_df)

    # Per-document summary (dataset level)
    perdoc = per_doc_stats(df_expanded)
    display(perdoc.head())

    # Save tables (dataset-level)
    ds_slug = dataset_slug(label)
    (TABLES_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
    global_df.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_global_summary.csv", index=False)
    perdoc.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_per_doc_summary.csv", index=False)

    # Plots (dataset-level)
    (PLOTS_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
    plot_hist(df_expanded['token_count'], f"{label}: Token Count Distribution", PLOTS_DIR / ds_slug / f"{ds_slug}_token_hist.png")
    plot_ecdf(df_expanded['token_count'], f"{label}: Token Count ECDF", PLOTS_DIR / ds_slug / f"{ds_slug}_token_ecdf.png")
    plot_box(df_expanded['token_count'], f"{label}: Token Count Boxplot", PLOTS_DIR / ds_slug / f"{ds_slug}_token_boxplot.png")

    # Chunks-per-document distribution (dataset-level) — CORRECT x-axis label
    plot_hist(
        perdoc['chunk_count'],
        f"{label}: Chunks per Document",
        PLOTS_DIR / ds_slug / f"{ds_slug}_chunks_per_doc_hist.png",
        xlabel="Chunks per document",
        ylabel="Number of documents",
    )

    return df_expanded, perdoc, global_df


def analyze_by_doc_type(df_expanded: pd.DataFrame, label: str):
    """Break down one dataset by doc_type (topic/media/patent/unknown)."""
    ds_slug = dataset_slug(label)

    # Ensure doc_type column exists
    if 'doc_type' not in df_expanded.columns:
        df_expanded = add_doc_type(df_expanded)

    order = ['topic', 'media', 'patent']
    present = list(pd.unique(df_expanded['doc_type'].dropna().astype(str)))
    ordered = [t for t in order if t in present] + [t for t in present if t not in order]

    global_rows = []
    perdoc_rows = []

    for dt in ordered:
        df_dt = df_expanded[df_expanded['doc_type'] == dt]
        if df_dt.empty:
            continue

        # Global per-type stats
        gstats = describe_series(df_dt['token_count'])
        gdf = pd.DataFrame([gstats])
        gdf.insert(0, 'doc_type', dt)
        gdf.insert(0, 'dataset', label)
        global_rows.append(gdf)

        # Per-doc stats within this type
        perdoc = per_doc_stats(df_dt)
        perdoc.insert(0, 'doc_type', dt)
        perdoc.insert(0, 'dataset', label)
        perdoc_rows.append(perdoc)

        # Ensure subdirs
        tdir = TABLES_DIR / ds_slug / dt
        pdir = PLOTS_DIR / ds_slug / dt
        tdir.mkdir(parents=True, exist_ok=True)
        pdir.mkdir(parents=True, exist_ok=True)

        # Save tables
        gdf.to_csv(tdir / f"{ds_slug}_{dt}_global_summary.csv", index=False)
        perdoc.to_csv(tdir / f"{ds_slug}_{dt}_per_doc_summary.csv", index=False)

        # Plots for this type
        plot_hist(df_dt['token_count'], f"{label} [{dt}] Token Count Distribution", pdir / f"{ds_slug}_{dt}_token_hist.png")
        plot_ecdf(df_dt['token_count'], f"{label} [{dt}] Token Count ECDF", pdir / f"{ds_slug}_{dt}_token_ecdf.png")
        plot_box(df_dt['token_count'], f"{label} [{dt}] Token Count Boxplot", pdir / f"{ds_slug}_{dt}_token_boxplot.png")

        # Chunks-per-doc within this type — CORRECT x-axis label
        plot_hist(
            perdoc['chunk_count'],
            f"{label} [{dt}] Chunks per Document",
            pdir / f"{ds_slug}_{dt}_chunks_per_doc_hist.png",
            xlabel="Chunks per document",
            ylabel="Number of documents",
        )

    global_by_type = pd.concat(global_rows, ignore_index=True) if global_rows else pd.DataFrame()
    perdoc_by_type = pd.concat(perdoc_rows, ignore_index=True) if perdoc_rows else pd.DataFrame()

    # Save combined (per dataset)
    if not global_by_type.empty:
        (TABLES_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
        global_by_type.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_global_by_doc_type.csv", index=False)
    if not perdoc_by_type.empty:
        (TABLES_DIR / ds_slug).mkdir(parents=True, exist_ok=True)
        perdoc_by_type.to_csv(TABLES_DIR / ds_slug / f"{ds_slug}_per_doc_with_doc_type.csv", index=False)

    return perdoc_by_type, global_by_type

Run the analysis

This cell loads the datasets (or synthetic fallbacks), runs the dataset‑level analyses, then the doc‑type breakdowns, and finally writes comparison tables.

Python
# Load (with synthetic fallback)
df_fixed_raw = load_parquet_or_sample(fixed_path, 'fixed')
df_sem_raw   = load_parquet_or_sample(semantic_path, 'semantic')

# Dataset-level analyses
display("\n### FIXED DATASET ###")
fixed_expanded, fixed_perdoc, fixed_global = analyze_one_dataset(df_fixed_raw, label='Fixed Size')

display("\n### SEMANTIC DATASET ###")
sem_expanded, sem_perdoc, sem_global = analyze_one_dataset(df_sem_raw, label='Semantic')

# Doc-type breakdowns
display("\n### FIXED by doc_type ###")
fixed_perdoc_by_type, fixed_global_by_type = analyze_by_doc_type(fixed_expanded, label='Fixed Size')
display(fixed_global_by_type.head())

display("\n### SEMANTIC by doc_type ###")
sem_perdoc_by_type, sem_global_by_type = analyze_by_doc_type(sem_expanded, label='Semantic')
display(sem_global_by_type.head())

# Side-by-side comparisons
global_compare = pd.concat([fixed_global, sem_global], ignore_index=True)
display(global_compare)

global_by_doc_type_compare = pd.concat([fixed_global_by_type, sem_global_by_type], ignore_index=True)
display(global_by_doc_type_compare)

# Save comparison tables
global_compare.to_csv(TABLES_DIR / "global_comparison_fixed_vs_semantic.csv", index=False)
global_by_doc_type_compare.to_csv(TABLES_DIR / "global_by_doc_type_comparison.csv", index=False)

print("\nArtifacts written to:\n")
print(f" - Tables: {TABLES_DIR.resolve()} ")
print(f" - Plots:  {PLOTS_DIR.resolve()} \n")
Text Only
1
2
3
4
5
6
Loaded fixed dataset from ..\cleantech_data\silver_subsample_chunk\fixed_size\2025-09-03\chunks.parquet with 8,538 rows.
Loaded semantic dataset from ..\cleantech_data\silver_subsample_chunk\semantic\2025-09-03\chunks.parquet with 23,482 rows.



'\n### FIXED DATASET ###'
doc_id text token_count doc_type
0 topic_T10001 Geological and Geochemical Analysis: This clus... 72 topic
1 topic_T10002 Advanced Chemical Physics Studies: This cluste... 74 topic
2 topic_T10003 Innovation and Knowledge Management: This clus... 66 topic
3 topic_T10004 Soil Carbon and Nitrogen Dynamics: This cluste... 87 topic
4 topic_T10005 Ecology and Vegetation Dynamics Studies: This ... 68 topic
dataset count mean median std min max p05 p25 p75 p95
0 Fixed Size 8538.0 209.343172 95.0 175.160279 37.0 559.0 59.0 73.0 293.0 529.0
doc_id chunk_count mean median std min max
724 media_bafbbb3bb6c1273e0d4c07057aae7a535a4340d5 9 477.333333 527.0 140.478547 80 527
815 media_d637cac2b4315a7c8df0722c0e9721955937d8a8 8 511.125000 524.0 34.064048 421 524
260 media_40cb388f766424705c2f6722a253b59641df20fd 7 498.428571 526.0 67.535931 333 526
516 media_83c3952e19cd0f435869f4ded51d08487b696a96 7 481.000000 529.0 117.575508 193 529
690 media_b422f60e44394719c787a827d3749f4e0976b7c8 7 505.428571 530.0 58.563414 362 530
Text Only
1
'\n### SEMANTIC DATASET ###'
doc_id text token_count doc_type
0 topic_T10001 Geological and Geochemical Analysis: This clus... 74 topic
1 topic_T10002 Advanced Chemical Physics Studies: This cluste... 76 topic
2 topic_T10003 Innovation and Knowledge Management: This clus... 68 topic
3 topic_T10004 Soil Carbon and Nitrogen Dynamics: This cluste... 62 topic
4 topic_T10004 Soil Carbon and Nitrogen Dynamics: The researc... 35 topic
dataset count mean median std min max p05 p25 p75 p95
0 Semantic 23482.0 82.775573 77.0 36.710339 9.0 681.0 34.0 67.0 90.0 149.0
doc_id chunk_count mean median std min max
815 media_d637cac2b4315a7c8df0722c0e9721955937d8a8 58 75.310345 71.5 10.896388 62 100
724 media_bafbbb3bb6c1273e0d4c07057aae7a535a4340d5 55 82.836364 83.0 13.220870 33 111
668 media_adc00155be930b42ad9abdff11a1db7b22890545 49 77.306122 75.0 12.201044 62 133
258 media_4087c68626961181bb87b47d4594be44992ee389 48 81.375000 80.0 10.080148 65 104
516 media_83c3952e19cd0f435869f4ded51d08487b696a96 47 80.510638 78.0 10.846849 67 110
Text Only
1
'\n### FIXED by doc_type ###'
dataset doc_type count mean median std min max p05 p25 p75 p95
0 Fixed Size topic 4516.0 75.001550 74.0 12.719124 37.0 145.0 56.0 66.0 83.0 97.0
1 Fixed Size media 2451.0 441.106079 522.0 131.487840 78.0 559.0 143.5 361.0 527.0 534.0
2 Fixed Size patent 1571.0 233.936346 240.0 54.945130 37.0 536.0 138.0 207.0 266.0 302.0
Text Only
1
'\n### SEMANTIC by doc_type ###'
dataset doc_type count mean median std min max p05 p25 p75 p95
0 Semantic topic 5835.0 61.108312 66.0 18.636447 16.0 107.0 26.0 55.0 74.0 85.0
1 Semantic media 14728.0 81.419337 79.0 19.820953 11.0 681.0 60.0 70.0 90.0 112.0
2 Semantic patent 2919.0 132.930798 117.0 68.566729 9.0 444.0 43.0 78.0 184.0 257.1
dataset count mean median std min max p05 p25 p75 p95
0 Fixed Size 8538.0 209.343172 95.0 175.160279 37.0 559.0 59.0 73.0 293.0 529.0
1 Semantic 23482.0 82.775573 77.0 36.710339 9.0 681.0 34.0 67.0 90.0 149.0
dataset doc_type count mean median std min max p05 p25 p75 p95
0 Fixed Size topic 4516.0 75.001550 74.0 12.719124 37.0 145.0 56.0 66.0 83.0 97.0
1 Fixed Size media 2451.0 441.106079 522.0 131.487840 78.0 559.0 143.5 361.0 527.0 534.0
2 Fixed Size patent 1571.0 233.936346 240.0 54.945130 37.0 536.0 138.0 207.0 266.0 302.0
3 Semantic topic 5835.0 61.108312 66.0 18.636447 16.0 107.0 26.0 55.0 74.0 85.0
4 Semantic media 14728.0 81.419337 79.0 19.820953 11.0 681.0 60.0 70.0 90.0 112.0
5 Semantic patent 2919.0 132.930798 117.0 68.566729 9.0 444.0 43.0 78.0 184.0 257.1
Text Only
1
2
3
4
Artifacts written to:

 - Tables: C:\Users\gerbe\PycharmProjects\MT\notebooks\reports\chunking 
 - Plots:  C:\Users\gerbe\PycharmProjects\MT\notebooks\reports\chunking\figs

Interpreting the outputs

  • Token Count Distribution (histogram/ECDF/boxplot): Characterizes chunk sizes. Look for long tails or multi‑modal patterns signaling uneven chunking.
  • Chunks per Document (histogram): How many chunks each document produced. Expect a tighter spread for fixed‑size chunking; semantic chunking may vary with content.
  • Global vs Per‑Doc Tables: Use the global tables to compare central tendency and dispersion across datasets or doc types; use per‑doc tables to diagnose specific documents (outliers, heavy chunk counts, etc.).

Axis labels are set appropriately: token plots show “Tokens per chunk”; per‑doc histograms show “Chunks per document” (y‑axis can be read as the number of documents).