GraphRAG¶

GraphRAG builds a heterogeneous chunk–entity graph from Gold artifacts, stores it in Neo4j, and derives multi-scale communities with summaries and embeddings for downstream retrieval.

Graphrag

Inputs & Outputs¶

Inputs: Gold chunks (gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet), OpenAlex topic enrichments (optional joins), Neo4j credentials (NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD).

Outputs:

CSV exports next to each Gold bucket: gr_nodes.csv, gr_edges.csv.
Neo4j entities: Chunk, person, org, location, tech, subfield, field, domain, topic nodes; CHUNK_MENTIONS_*, CHUNK_IN_*, TECH_IN_SUBFIELD, CHUNK_ABOUT_TOPIC relationships.
Community layer: Community nodes with properties (cid, dataset, level, ingest_tag, size, params, summary, summaryEmbedding, embeddingDim, summaryProvider, summaryModel), and (:Chunk)-[:IN_COMMUNITY {level, ingest_tag}]->(:Community) relationships.
Inspection helpers: completed_ids.json (enrichment state), visualization_interactive.html (chunk viewer) reused during debugging.

Node schema (gr_nodes.csv)

column	dtype	example
`node_id`	string	`chunk:media_2024-02-19_000123_fixed_chunk_0`
`node_type`	string	`chunk`, `person`, `org`, `tech`, `subfield`, ...
`display_name`	string	`HydrogenCo`
`props_b64`	base64 JSON	`{"doc_type":"media","chunk_type":"fixed","lang":"en","summary":"…"}` encoded as base64.

Edge schema (gr_edges.csv)

column	dtype	example
`src`	string	`chunk:media_...`
`dst`	string	`org:hydrogenco`
`edge_type`	string	`CHUNK_MENTIONS_ORG`, `TECH_IN_SUBFIELD`, `CHUNK_ABOUT_TOPIC`
`props_b64`	base64 JSON	e.g., `{"score":0.82}` for weighted edges.

Parameters¶

python -m graphbuild csv / ingest:

--dataset {fixed_size|semantic}, --date YYYY-MM-DD (default latest), --ingest-tag, --delete-tag, --batch-size (default 1000).
Automatically sets NEO4J_DATABASE from NEO4J_DATABASE_FIXED_SIZE if unset. On the VPS (Community), use a single DB (graph-fixed-size) and do not set NEO4J_DATABASE_SEMANTIC.

python -m communities communities:

--levels comma list (e.g., C0:0.6,C1:1.2,C2:1.6), --min-weight (edge weight threshold), --min-size (min entities per community), --ingest-tag (required), --replace / --delete-tag for cleanup.
Runs gds.leiden.stream with supplied gamma (resolution) per level on the in-memory projection entityCooc_<dataset>.

python -m communities summaries:

--level, --ingest-tag, --refresh, --limit, --estimate-only.
Summary provider/model from env: OA_COMM_SUMMARY_PROVIDER, OA_COMM_SUMMARY_MODEL, fallback to OPENAI_KEY (gpt-4o-mini) or GEMINI_KEY (gemini-2.5-flash).
Embedding provider/model overrides: OA_COMM_EMBED_PROVIDER, OA_COMM_EMBED_MODEL, OA_COMM_EMBED_BATCH.
python -m communities ensure-index ensures the Neo4j vector index on Community.summaryEmbedding (dimension auto-detected or provided via --dim).
Retrieval helpers: python -m communities search (--top-k, --dataset, --level), python -m communities retrieve (--k-comms, --top-k, --rerank-mode {dense|summary}, --dense-date, --dense-persist, --hydrate).

Step-by-step Tasks¶

1. Build CSV exports¶

⬜ Preconditions: Gold bucket exists; optional topic joins already written by enrichment.

⬜ Command:

Bash
python -m graphbuild csv --dataset fixed_size --date 2025-09-14

✅ Expected output: gold_subsample_chunk/fixed_size/2025-09-14/gr_nodes.csv (~#chunks + entities rows) and gr_edges.csv.

2. Ingest into Neo4j¶

⬜ Preconditions: Neo4j server reachable; set NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, optional database mapping envs.

⬜ Command:

Bash
python -m graphbuild ingest --dataset fixed_size --date 2025-09-14 --ingest-tag fixed_2025_09_14

✅ Expected output: Console prints [ingest] ingesting nodes (...) and [ingest] done.; Neo4j contains Chunk nodes with props_b64 metadata, entity nodes, and relationships.
⬜ Troubleshooting: Use python -m graphbuild csv first to ensure CSVs exist; set --delete-tag to remove outdated ingest tags; confirm Bolt port in NEO4J_URI (defaults to neo4j://127.0.0.1:7687).

3. Run community detection¶

⬜ Preconditions: Graph ingested; gds plugin available in Neo4j; choose ingest tag.

⬜ Command:

Bash
python -m communities communities --dataset fixed_size --ingest-tag comm_fixed_C1_g1_2 --levels "C0:0.6,C1:1.2,C2:1.6" --min-weight 1 --min-size 8

✅ Expected output: Coverage ratios near ≥0.95, printed per level; communities stamped with params (min weight, resolution, min size).
⬜ Troubleshooting: If No :Chunk with a usable chunk_type found, ensure chunk_type matches dataset or run graphbuild ingest again.

4. Summaries, embeddings, and search¶

⬜ Summaries:

Bash
python -m communities summaries --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --refresh
python -m communities ensure-index --dataset fixed_size --level C1

✅ Neo4j Community nodes now have summary, summaryEmbedding, embeddingDim; vector index ONLINE.

⬜ Search & retrieval:

Bash
python -m communities search "solid-state battery" --dataset fixed_size --level C1
python -m communities retrieve "solid-state battery" --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --k-comms 24 --top-k 50 --rerank --hydrate

✅ Retrieval prints ranked chunks with doc IDs and hydrated snippets when --hydrate provided.
⬜ Troubleshooting: Missing provider keys → set OPENAI_KEY/GEMINI_KEY; ensure dense index exists when using --rerank-mode dense (see rag_retrievers.md).

Validation & Quality Gates¶

CSV sanity:

Python
import pandas as pd
nodes = pd.read_csv("cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/gr_nodes.csv")
edges = pd.read_csv("cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/gr_edges.csv")
assert nodes.node_id.is_unique
assert (edges.edge_type == "CHUNK_MENTIONS_TECH").any()

Neo4j checks:
MATCH (c:Chunk)-[:CHUNK_MENTIONS_TECH]->(t:tech) RETURN count(*) > 0.
MATCH (c:Community {ingest_tag:"comm_fixed_C1_g1_2"}) RETURN count(c) equals CLI-reported communities.
Community coverage: Inspect CLI output for coverage_ratio; rerun with adjusted gamma if <0.95.
Summary embeddings: MATCH (c:Community) WHERE c.summaryEmbedding IS NOT NULL RETURN count(c) should match processed count; CALL db.indexes() shows vector index ONLINE.

Reproducibility¶

Use stable ingest tags (fixed_2025_09_14, comm_fixed_C1_g1_2) recorded in commits or runbooks.
Graph CSVs are deterministic given Gold metadata; committing them provides a frozen snapshot for audits.
Community summaries cache embeddings in Neo4j; re-running with --refresh rewrites while maintaining deterministic order via ORDER BY c.size DESC.
Capture CLI stdout to log files (graphbuild_ingest.log, communities_C1.log) for traceability.

05_metadata_extraction — entity extraction feeding graph nodes.
06_graphrag_gamma_selection — gamma sweep analysis and recommended ingest tags.

GraphRAG¶

Inputs & Outputs¶

Parameters¶

Step-by-step Tasks¶

1. Build CSV exports¶

2. Ingest into Neo4j¶

3. Run community detection¶

4. Summaries, embeddings, and search¶

Validation & Quality Gates¶

Reproducibility¶

Related notebooks¶

See also¶