GRAINS¶
Graph-aware RAG for Innovation Scouting
GRAINS is a reproducible data → RAG toolkit. It ingests Kaggle and OpenAlex sources (Bronze), curates Silver tables, merges them into a Unified corpus, and produces Gold assets ready for retrieval: subsampled, chunked, and LLM‑enriched. From there, it builds graph & community assets, dense and BM25 indices, and offers benchmarks with TREC metrics and latency logging.
UI status
Streamlit was used as a test/prototype UI. The production UI is now implemented and deployed as a React frontend backed by the API.
See GRAINS in action
Watch the demo: Watch the demo Request access: Contact form
What you get¶
- Bronze — Immutable, date‑bucketed snapshots with per‑run manifests; optional JSONL mirrors.
- Silver — Canonicalized media & patents; OpenAlex Topics normalized and linked.
- Unified — Single Parquet across sources for downstream tasks.
- Gold — Subsample → Chunk (fixed/semantic) → Enrich (summaries, entities, topics).
- RAG assets — Graph CSV export & Neo4j ingest, Leiden communities, dense & BM25 indexes.
- Benchmarks — Generate QA, run BM25/Dense/Graph/Hybrid adapters, evaluate (MAP, nDCG@10, P@K, Recall@K), log end‑to‑end & stage latency.
Data root defaults to
./cleantech_data. Override withCLEANTECH_DATA_DIRor per‑command flags. Providers:KAGGLE_USERNAME/KAGGLE_KEY,GEMINI_KEY,OPENAI_KEY,NEO4J_*(see FAQ).
Start here¶
- Read the Guide and the Architecture overview.
- Skim the CLI cheatsheet for end‑to‑end commands: CLI.
- Deep dive into callable modules: API Reference.
RAG stack¶
LLM pipeline¶
App¶
Handy links¶
- CLI · API Reference · FAQ · Changelog · References