Skip to content

GraphRAG

GraphRAG builds a heterogeneous chunk–entity graph from Gold artifacts, stores it in Neo4j, and derives multi-scale communities with summaries and embeddings for downstream retrieval.

Graphrag

Inputs & Outputs

Inputs: Gold chunks (gold_subsample_chunk/<dataset>/<date>/chunks_enriched.parquet), OpenAlex topic enrichments (optional joins), Neo4j credentials (NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD).

Outputs:

  • CSV exports next to each Gold bucket: gr_nodes.csv, gr_edges.csv.
  • Neo4j entities: Chunk, person, org, location, tech, subfield, field, domain, topic nodes; CHUNK_MENTIONS_*, CHUNK_IN_*, TECH_IN_SUBFIELD, CHUNK_ABOUT_TOPIC relationships.
  • Community layer: Community nodes with properties (cid, dataset, level, ingest_tag, size, params, summary, summaryEmbedding, embeddingDim, summaryProvider, summaryModel), and (:Chunk)-[:IN_COMMUNITY {level, ingest_tag}]->(:Community) relationships.
  • Inspection helpers: completed_ids.json (enrichment state), visualization_interactive.html (chunk viewer) reused during debugging.

Node schema (gr_nodes.csv)

column dtype example
node_id string chunk:media_2024-02-19_000123_fixed_chunk_0
node_type string chunk, person, org, tech, subfield, ...
display_name string HydrogenCo
props_b64 base64 JSON {"doc_type":"media","chunk_type":"fixed","lang":"en","summary":"…"} encoded as base64.

Edge schema (gr_edges.csv)

column dtype example
src string chunk:media_...
dst string org:hydrogenco
edge_type string CHUNK_MENTIONS_ORG, TECH_IN_SUBFIELD, CHUNK_ABOUT_TOPIC
props_b64 base64 JSON e.g., {"score":0.82} for weighted edges.

Parameters

python -m graphbuild csv / ingest:

  • --dataset {fixed_size|semantic}, --date YYYY-MM-DD (default latest), --ingest-tag, --delete-tag, --batch-size (default 1000).
  • Automatically sets NEO4J_DATABASE from NEO4J_DATABASE_FIXED_SIZE if unset. On the VPS (Community), use a single DB (graph-fixed-size) and do not set NEO4J_DATABASE_SEMANTIC.

python -m communities communities:

  • --levels comma list (e.g., C0:0.6,C1:1.2,C2:1.6), --min-weight (edge weight threshold), --min-size (min entities per community), --ingest-tag (required), --replace / --delete-tag for cleanup.
  • Runs gds.leiden.stream with supplied gamma (resolution) per level on the in-memory projection entityCooc_<dataset>.

python -m communities summaries:

  • --level, --ingest-tag, --refresh, --limit, --estimate-only.
  • Summary provider/model from env: OA_COMM_SUMMARY_PROVIDER, OA_COMM_SUMMARY_MODEL, fallback to OPENAI_KEY (gpt-4o-mini) or GEMINI_KEY (gemini-2.5-flash).
  • Embedding provider/model overrides: OA_COMM_EMBED_PROVIDER, OA_COMM_EMBED_MODEL, OA_COMM_EMBED_BATCH.
  • python -m communities ensure-index ensures the Neo4j vector index on Community.summaryEmbedding (dimension auto-detected or provided via --dim).
  • Retrieval helpers: python -m communities search (--top-k, --dataset, --level), python -m communities retrieve (--k-comms, --top-k, --rerank-mode {dense|summary}, --dense-date, --dense-persist, --hydrate).

Step-by-step Tasks

1. Build CSV exports

  • Preconditions: Gold bucket exists; optional topic joins already written by enrichment.
  • Command:
    Bash
    python -m graphbuild csv --dataset fixed_size --date 2025-09-14
    
  • Expected output: gold_subsample_chunk/fixed_size/2025-09-14/gr_nodes.csv (~#chunks + entities rows) and gr_edges.csv.

2. Ingest into Neo4j

  • Preconditions: Neo4j server reachable; set NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, optional database mapping envs.
  • Command:
    Bash
    python -m graphbuild ingest --dataset fixed_size --date 2025-09-14 --ingest-tag fixed_2025_09_14
    
  • Expected output: Console prints [ingest] ingesting nodes (...) and [ingest] done.; Neo4j contains Chunk nodes with props_b64 metadata, entity nodes, and relationships.
  • Troubleshooting: Use python -m graphbuild csv first to ensure CSVs exist; set --delete-tag to remove outdated ingest tags; confirm Bolt port in NEO4J_URI (defaults to neo4j://127.0.0.1:7687).

3. Run community detection

  • Preconditions: Graph ingested; gds plugin available in Neo4j; choose ingest tag.
  • Command:
    Bash
    python -m communities communities --dataset fixed_size --ingest-tag comm_fixed_C1_g1_2 --levels "C0:0.6,C1:1.2,C2:1.6" --min-weight 1 --min-size 8
    
  • Expected output: Coverage ratios near ≥0.95, printed per level; communities stamped with params (min weight, resolution, min size).
  • Troubleshooting: If No :Chunk with a usable chunk_type found, ensure chunk_type matches dataset or run graphbuild ingest again.
  • Summaries:
    Bash
    python -m communities summaries --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --refresh
    python -m communities ensure-index --dataset fixed_size --level C1
    
  • ✅ Neo4j Community nodes now have summary, summaryEmbedding, embeddingDim; vector index ONLINE.
  • Search & retrieval:
    Bash
    python -m communities search "solid-state battery" --dataset fixed_size --level C1
    python -m communities retrieve "solid-state battery" --dataset fixed_size --level C1 --ingest-tag comm_fixed_C1_g1_2 --k-comms 24 --top-k 50 --rerank --hydrate
    
  • ✅ Retrieval prints ranked chunks with doc IDs and hydrated snippets when --hydrate provided.
  • Troubleshooting: Missing provider keys → set OPENAI_KEY/GEMINI_KEY; ensure dense index exists when using --rerank-mode dense (see rag_retrievers.md).

Validation & Quality Gates

  • CSV sanity:
    Python
    1
    2
    3
    4
    5
    import pandas as pd
    nodes = pd.read_csv("cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/gr_nodes.csv")
    edges = pd.read_csv("cleantech_data/gold_subsample_chunk/fixed_size/2025-09-14/gr_edges.csv")
    assert nodes.node_id.is_unique
    assert (edges.edge_type == "CHUNK_MENTIONS_TECH").any()
    
  • Neo4j checks:
  • MATCH (c:Chunk)-[:CHUNK_MENTIONS_TECH]->(t:tech) RETURN count(*) > 0.
  • MATCH (c:Community {ingest_tag:"comm_fixed_C1_g1_2"}) RETURN count(c) equals CLI-reported communities.
  • Community coverage: Inspect CLI output for coverage_ratio; rerun with adjusted gamma if <0.95.
  • Summary embeddings: MATCH (c:Community) WHERE c.summaryEmbedding IS NOT NULL RETURN count(c) should match processed count; CALL db.indexes() shows vector index ONLINE.

Reproducibility

  • Use stable ingest tags (fixed_2025_09_14, comm_fixed_C1_g1_2) recorded in commits or runbooks.
  • Graph CSVs are deterministic given Gold metadata; committing them provides a frozen snapshot for audits.
  • Community summaries cache embeddings in Neo4j; re-running with --refresh rewrites while maintaining deterministic order via ORDER BY c.size DESC.
  • Capture CLI stdout to log files (graphbuild_ingest.log, communities_C1.log) for traceability.

See also