LLM Pipeline¶

The LLM pipeline consumes retrieval context and produces structured outputs for constraints, discovery, ideation, and prioritization. It also powers follow-up answers and optional chat summaries. The logic lives in src/rag/demo/llm_pipeline.py and is called by the app layer after retrieval (React frontend via API in production; Streamlit in legacy test mode).

See Diagram for a retrieval + synthesis overview.

Diagram¶

LLM pipeline diagram

The diagram shows how retrieval context feeds synthesis stages, how stage memory is built, and how follow-up chat uses that memory.

Inputs¶

ctx_pack: retrieval output with items[].chunk_id, chunk_summary, doc_type, and related metadata (see rag.md).
user_query: the user question.
weights: feasibility, ROI, strategic_fit, and environmental_impact weights (normalized).
vllm_cfg: provider settings (URL, model, API key) plus optional per-stage overrides.

Stage flow (run_all)¶

The main pipeline runs in this order:

constraints
discovery
ideation
prioritization

Each stage emits JSON with evidence_chunk_ids that must map to ctx_pack.items[].chunk_id.

Stage	Purpose	Key outputs
constraints	Extract user constraints from the query only.	timeframe, regions, doc_type_preference, keywords, output_intent, confidence
discovery	Summarize evidence from `ctx_pack` using constraints.	themes, trends, gaps, patent_signals, key_numbers, key_dates
ideation	Propose ideas grounded in discovery evidence.	ideas[] with evidence_chunk_ids
prioritization	Rank ideas using weights.	ranking[] with scores and evidence_chunk_ids

Stage memory and follow-up chat¶

Stage memory¶

Stage memory carries context across follow-up turns and combines:

pinned evidence from prior synthesis outputs (prioritization, ideation, discovery)
optional chat summary
optional recent chat messages

The chat summary can be refreshed after follow-up responses and reused for the next turn.

Follow-up answers (chat)¶

Follow-up answers are generated by run_followup_answer and return:

items[] paragraphs with evidence_chunk_ids
citations (unique chunk ids)
follow_up_questions (4 suggestions)
confidence (0 to <1)

The follow-up stage uses:

selected evidence items (see next section)
stage memory (chat summary + recent chat messages)

Evidence selection and the optional second rerank¶

Evidence for follow-up answers is selected in rag/demo/followups.py:

Pinned evidence is taken from prior pipeline stages (prioritization, ideation, discovery).
If FOLLOWUP_RERANK=1, evidence items are re-ranked using the standard reranker stack.
The reranker spec comes from RERANK_SPEC (or defaults to Cohere if a key is present, otherwise HF).
If FOLLOWUP_RERANK is not enabled, a token-overlap heuristic ranks evidence items.

This follow-up rerank is separate from the retrieval rerank described in rag_rerankers.md.

Defaults (stage planner)¶

Defaults are defined in rag/demo/stage_planner.py:

discovery_max_findings: 8
n_ideas: 12
top_n: 5
weights (normalized): feasibility=0.25, roi=0.25, strategic_fit=0.25, environmental_impact=0.25

Evidence validation and aliasing¶

Chunk ids may be aliased for prompt size; outputs are remapped back to original ids.
LLM_PIPELINE_STRICT_EVIDENCE_IDS=1 enforces that all evidence ids are known.
Ideation must use discovery evidence ids; prioritization must use discovery ids and ideation titles.
INCLUDE_EVIDENCE_ID_MAP=1 adds _evidence_id_map to the output for alias tracing.

Per-stage generation overrides¶

You can override generation settings per stage using environment variables:

LLM_MAX_TOKENS_{STAGE}
LLM_TEMPERATURE_{STAGE}
LLM_TOP_P_{STAGE}
LLM_TIMEOUT_S_{STAGE}

Stage names are uppercase: CONSTRAINTS, DISCOVERY, IDEATION, PRIORITIZATION, FOLLOWUP, CHAT_SUMMARY.

Provider and fallback behavior¶

VLLM_URL (default http://localhost:8000/v1/chat/completions)
VLLM_MODEL (default local-model)
VLLM_API_KEY or api_key_env in vllm_cfg
LLM_VLLM_RETRY_ON_FAILURE (default true)
LLM_VLLM_RETRY_BACKOFF_S (default 1.0)
LLM_STICKY_FALLBACK_OPENAI (default true)
OPENAI_FALLBACK_MODEL (default gpt-4o)
VLLM_DISABLE_RESPONSE_FORMAT=1 disables response_format for structured outputs
LLM_PIPELINE_TRACE=1 adds _llm_trace to the pipeline output

Where it is used¶

Production app (React frontend + API): runs run_all after retrieval and uses follow-up answers for chat.
Legacy Streamlit app: runs the same run_all pipeline for local testing/fallback.
Retrieval and rerank details are documented in rag.md and rag_rerankers.md.