GraphRAG¶
GraphRAG builds a heterogeneous chunk–entity graph from Gold artifacts, stores it in Neo4j, and derives multi-scale communities with summaries and embeddings for downstream retrieval.
Inputs & Outputs¶
Inputs: Gold chunks (gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet), OpenAlex topic enrichments (optional joins), Neo4j credentials (NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD).
Outputs:
- CSV exports next to each Gold bucket:
gr_nodes.csv,gr_edges.csv. - Neo4j entities:
Chunk,person,org,location,tech,subfield,field,domain,topicnodes;CHUNK_MENTIONS_*,CHUNK_IN_*,TECH_IN_SUBFIELD,CHUNK_ABOUT_TOPICrelationships. - Community layer:
Communitynodes with properties (cid,dataset,level,ingest_tag,size,params,summary,summaryEmbedding,embeddingDim,summaryProvider,summaryModel), and(:Chunk)-[:IN_COMMUNITY {level, ingest_tag}]->(:Community)relationships. - Inspection helpers:
completed_ids.json(enrichment state),visualization_interactive.html(chunk viewer) reused during debugging.
Node schema (gr_nodes.csv)
| column | dtype | example |
|---|---|---|
node_id |
string | chunk:media_2024-02-19_000123_fixed_chunk_0 |
node_type |
string | chunk, person, org, tech, subfield, ... |
display_name |
string | HydrogenCo |
props_b64 |
base64 JSON | {"doc_type":"media","chunk_type":"fixed","lang":"en","summary":"…"} encoded as base64. |
Edge schema (gr_edges.csv)
| column | dtype | example |
|---|---|---|
src |
string | chunk:media_... |
dst |
string | org:hydrogenco |
edge_type |
string | CHUNK_MENTIONS_ORG, TECH_IN_SUBFIELD, CHUNK_ABOUT_TOPIC |
props_b64 |
base64 JSON | e.g., {"score":0.82} for weighted edges. |
Parameters¶
python -m graphbuild csv / ingest:
--dataset {fixed_size|semantic},--date YYYY-MM-DD(default latest),--ingest-tag,--delete-tag,--batch-size(default 1000).- Automatically sets
NEO4J_DATABASEfromNEO4J_DATABASE_FIXED_SIZEif unset. On the VPS (Community), use a single DB (graph-fixed-size) and do not setNEO4J_DATABASE_SEMANTIC.
python -m communities communities:
--levelscomma list (e.g.,C0:0.6,C1:1.2,C2:1.6),--min-weight(edge weight threshold),--min-size(min entities per community),--ingest-tag(required),--replace/--delete-tagfor cleanup.- Runs
gds.leiden.streamwith supplied gamma (resolution) per level on the in-memory projectionentityCooc_<dataset>.
python -m communities summaries:
--level,--ingest-tag,--refresh,--limit,--estimate-only.- Summary provider/model from env:
OA_COMM_SUMMARY_PROVIDER,OA_COMM_SUMMARY_MODEL, fallback toOPENAI_KEY(gpt-4o-mini) orGEMINI_KEY(gemini-2.5-flash). - Embedding provider/model overrides:
OA_COMM_EMBED_PROVIDER,OA_COMM_EMBED_MODEL,OA_COMM_EMBED_BATCH. python -m communities ensure-indexensures the Neo4j vector index onCommunity.summaryEmbedding(dimension auto-detected or provided via--dim).- Retrieval helpers:
python -m communities search(--top-k,--dataset,--level),python -m communities retrieve(--k-comms,--top-k,--rerank-mode {dense|summary},--dense-date,--dense-persist,--hydrate).
Step-by-step Tasks¶
1. Build CSV exports¶
- ⬜ Preconditions: Gold bucket exists; optional topic joins already written by enrichment.
- ⬜ Command:
Bash - ✅ Expected output:
gold_subsample_chunk/fixed_size/2025-09-14/gr_nodes.csv(~#chunks + entities rows) andgr_edges.csv.
2. Ingest into Neo4j¶
- ⬜ Preconditions: Neo4j server reachable; set
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORD, optional database mapping envs. - ⬜ Command:
Bash - ✅ Expected output: Console prints
[ingest] ingesting nodes (...)and[ingest] done.; Neo4j containsChunknodes withprops_b64metadata, entity nodes, and relationships. - ⬜ Troubleshooting: Use
python -m graphbuild csvfirst to ensure CSVs exist; set--delete-tagto remove outdated ingest tags; confirm Bolt port inNEO4J_URI(defaults toneo4j://127.0.0.1:7687).
3. Run community detection¶
- ⬜ Preconditions: Graph ingested;
gdsplugin available in Neo4j; choose ingest tag. - ⬜ Command:
Bash - ✅ Expected output: Coverage ratios near ≥0.95, printed per level; communities stamped with
params(min weight, resolution, min size). - ⬜ Troubleshooting: If
No :Chunk with a usable chunk_type found, ensurechunk_typematches dataset or rungraphbuild ingestagain.
4. Summaries, embeddings, and search¶
- ⬜ Summaries:
- ✅ Neo4j
Communitynodes now havesummary,summaryEmbedding,embeddingDim; vector index ONLINE. - ⬜ Search & retrieval:
- ✅ Retrieval prints ranked chunks with doc IDs and hydrated snippets when
--hydrateprovided. - ⬜ Troubleshooting: Missing provider keys → set
OPENAI_KEY/GEMINI_KEY; ensure dense index exists when using--rerank-mode dense(see rag_retrievers.md).
Validation & Quality Gates¶
- CSV sanity:
Python - Neo4j checks:
MATCH (c:Chunk)-[:CHUNK_MENTIONS_TECH]->(t:tech) RETURN count(*)> 0.MATCH (c:Community {ingest_tag:"comm_fixed_C1_g1_2"}) RETURN count(c)equals CLI-reported communities.- Community coverage: Inspect CLI output for
coverage_ratio; rerun with adjusted gamma if <0.95. - Summary embeddings:
MATCH (c:Community) WHERE c.summaryEmbedding IS NOT NULL RETURN count(c)should match processed count;CALL db.indexes()shows vector indexONLINE.
Reproducibility¶
- Use stable ingest tags (
fixed_2025_09_14,comm_fixed_C1_g1_2) recorded in commits or runbooks. - Graph CSVs are deterministic given Gold metadata; committing them provides a frozen snapshot for audits.
- Community summaries cache embeddings in Neo4j; re-running with
--refreshrewrites while maintaining deterministic order viaORDER BY c.size DESC. - Capture CLI stdout to log files (
graphbuild_ingest.log,communities_C1.log) for traceability.
Related notebooks¶
- 05_metadata_extraction — entity extraction feeding graph nodes.
- 06_graphrag_gamma_selection — gamma sweep analysis and recommended ingest tags.
See also¶
- Upstream Gold pipeline.
- Retrieval integration: RAG overview, Retrievers.
- Bench evaluation hooks: RAG benchmarks.