GRAINS¶

Graph-aware RAG for Innovation Scouting

GRAINS is a reproducible data → RAG toolkit. It ingests Kaggle and OpenAlex sources (Bronze), curates Silver tables, merges them into a Unified corpus, and produces Gold assets ready for retrieval: subsampled, chunked, and LLM‑enriched. From there, it builds graph & community assets, dense and BM25 indices, and offers benchmarks with TREC metrics and latency logging.

UI status

Streamlit was used as a test/prototype UI. The production UI is now implemented and deployed as a React frontend backed by the API.

See GRAINS in action

Watch the demo: Watch the demo Request access: Contact form

What you get¶

Bronze — Immutable, date‑bucketed snapshots with per‑run manifests; optional JSONL mirrors.
Silver — Canonicalized media & patents; OpenAlex Topics normalized and linked.
Unified — Single Parquet across sources for downstream tasks.
Gold — Subsample → Chunk (fixed/semantic) → Enrich (summaries, entities, topics).
RAG assets — Graph CSV export & Neo4j ingest, Leiden communities, dense & BM25 indexes.
Benchmarks — Generate QA, run BM25/Dense/Graph/Hybrid adapters, evaluate (MAP, nDCG@10, P@K, Recall@K), log end‑to‑end & stage latency.

Data root defaults to ./cleantech_data. Override with CLEANTECH_DATA_DIR or per‑command flags. Providers: KAGGLE_USERNAME/KAGGLE_KEY, GEMINI_KEY, OPENAI_KEY, NEO4J_* (see FAQ).

Start here¶

Read the Guide and the Architecture overview.
Skim the CLI cheatsheet for end‑to‑end commands: CLI.
Deep dive into callable modules: API Reference.

RAG stack¶

LLM pipeline¶

Pipeline overview

App¶

Handy links¶

CLI · API Reference · FAQ · Changelog · References