LLM Pipeline¶
The LLM pipeline consumes retrieval context and produces structured outputs for constraints, discovery, ideation, and prioritization. It also powers follow-up answers and optional chat summaries. The logic lives in src/rag/demo/llm_pipeline.py and is called by the app layer after retrieval (React frontend via API in production; Streamlit in legacy test mode).
See Diagram for a retrieval + synthesis overview.
Diagram¶
The diagram shows how retrieval context feeds synthesis stages, how stage memory is built, and how follow-up chat uses that memory.
Inputs¶
ctx_pack: retrieval output withitems[].chunk_id,chunk_summary,doc_type, and related metadata (seerag.md).user_query: the user question.weights: feasibility, ROI, strategic_fit, and environmental_impact weights (normalized).vllm_cfg: provider settings (URL, model, API key) plus optional per-stage overrides.
Stage flow (run_all)¶
The main pipeline runs in this order:
- constraints
- discovery
- ideation
- prioritization
Each stage emits JSON with evidence_chunk_ids that must map to ctx_pack.items[].chunk_id.
| Stage | Purpose | Key outputs |
|---|---|---|
| constraints | Extract user constraints from the query only. | timeframe, regions, doc_type_preference, keywords, output_intent, confidence |
| discovery | Summarize evidence from ctx_pack using constraints. |
themes, trends, gaps, patent_signals, key_numbers, key_dates |
| ideation | Propose ideas grounded in discovery evidence. | ideas[] with evidence_chunk_ids |
| prioritization | Rank ideas using weights. | ranking[] with scores and evidence_chunk_ids |
Stage memory and follow-up chat¶
Stage memory¶
Stage memory carries context across follow-up turns and combines:
- pinned evidence from prior synthesis outputs (prioritization, ideation, discovery)
- optional chat summary
- optional recent chat messages
The chat summary can be refreshed after follow-up responses and reused for the next turn.
Follow-up answers (chat)¶
Follow-up answers are generated by run_followup_answer and return:
items[]paragraphs withevidence_chunk_idscitations(unique chunk ids)follow_up_questions(4 suggestions)confidence(0 to <1)
The follow-up stage uses:
- selected evidence items (see next section)
- stage memory (chat summary + recent chat messages)
Evidence selection and the optional second rerank¶
Evidence for follow-up answers is selected in rag/demo/followups.py:
- Pinned evidence is taken from prior pipeline stages (prioritization, ideation, discovery).
- If
FOLLOWUP_RERANK=1, evidence items are re-ranked using the standard reranker stack. - The reranker spec comes from
RERANK_SPEC(or defaults to Cohere if a key is present, otherwise HF). - If
FOLLOWUP_RERANKis not enabled, a token-overlap heuristic ranks evidence items.
This follow-up rerank is separate from the retrieval rerank described in rag_rerankers.md.
Defaults (stage planner)¶
Defaults are defined in rag/demo/stage_planner.py:
discovery_max_findings: 8n_ideas: 12top_n: 5- weights (normalized): feasibility=0.25, roi=0.25, strategic_fit=0.25, environmental_impact=0.25
Evidence validation and aliasing¶
- Chunk ids may be aliased for prompt size; outputs are remapped back to original ids.
LLM_PIPELINE_STRICT_EVIDENCE_IDS=1enforces that all evidence ids are known.- Ideation must use discovery evidence ids; prioritization must use discovery ids and ideation titles.
INCLUDE_EVIDENCE_ID_MAP=1adds_evidence_id_mapto the output for alias tracing.
Per-stage generation overrides¶
You can override generation settings per stage using environment variables:
LLM_MAX_TOKENS_{STAGE}LLM_TEMPERATURE_{STAGE}LLM_TOP_P_{STAGE}LLM_TIMEOUT_S_{STAGE}
Stage names are uppercase: CONSTRAINTS, DISCOVERY, IDEATION, PRIORITIZATION, FOLLOWUP, CHAT_SUMMARY.
Provider and fallback behavior¶
VLLM_URL(defaulthttp://localhost:8000/v1/chat/completions)VLLM_MODEL(defaultlocal-model)VLLM_API_KEYorapi_key_envinvllm_cfgLLM_VLLM_RETRY_ON_FAILURE(default true)LLM_VLLM_RETRY_BACKOFF_S(default 1.0)LLM_STICKY_FALLBACK_OPENAI(default true)OPENAI_FALLBACK_MODEL(defaultgpt-4o)VLLM_DISABLE_RESPONSE_FORMAT=1disables response_format for structured outputsLLM_PIPELINE_TRACE=1adds_llm_traceto the pipeline output
Where it is used¶
- Production app (React frontend + API): runs
run_allafter retrieval and uses follow-up answers for chat. - Legacy Streamlit app: runs the same
run_allpipeline for local testing/fallback. - Retrieval and rerank details are documented in
rag.mdandrag_rerankers.md.