Skip to content

GRAINS

Graph-aware RAG for Innovation Scouting

GRAINS is a reproducible data → RAG toolkit. It ingests Kaggle and OpenAlex sources (Bronze), curates Silver tables, merges them into a Unified corpus, and produces Gold assets ready for retrieval: subsampled, chunked, and LLM‑enriched. From there, it builds graph & community assets, dense and BM25 indices, and offers benchmarks with TREC metrics and latency logging.

UI status

Streamlit was used as a test/prototype UI. The production UI is now implemented and deployed as a React frontend backed by the API.

See GRAINS in action

Watch the demo: Watch the demo Request access: Contact form


What you get

  • Bronze — Immutable, date‑bucketed snapshots with per‑run manifests; optional JSONL mirrors.
  • Silver — Canonicalized media & patents; OpenAlex Topics normalized and linked.
  • Unified — Single Parquet across sources for downstream tasks.
  • Gold — Subsample → Chunk (fixed/semantic) → Enrich (summaries, entities, topics).
  • RAG assets — Graph CSV export & Neo4j ingest, Leiden communities, dense & BM25 indexes.
  • Benchmarks — Generate QA, run BM25/Dense/Graph/Hybrid adapters, evaluate (MAP, nDCG@10, P@K, Recall@K), log end‑to‑end & stage latency.

Data root defaults to ./cleantech_data. Override with CLEANTECH_DATA_DIR or per‑command flags. Providers: KAGGLE_USERNAME/KAGGLE_KEY, GEMINI_KEY, OPENAI_KEY, NEO4J_* (see FAQ).


Start here


RAG stack

LLM pipeline

App