Skip to content

LLM Pipeline

The LLM pipeline consumes retrieval context and produces structured outputs for constraints, discovery, ideation, and prioritization. It also powers follow-up answers and optional chat summaries. The logic lives in src/rag/demo/llm_pipeline.py and is called by the app layer after retrieval (React frontend via API in production; Streamlit in legacy test mode).

See Diagram for a retrieval + synthesis overview.

Diagram

LLM pipeline diagram

The diagram shows how retrieval context feeds synthesis stages, how stage memory is built, and how follow-up chat uses that memory.

Inputs

  • ctx_pack: retrieval output with items[].chunk_id, chunk_summary, doc_type, and related metadata (see rag.md).
  • user_query: the user question.
  • weights: feasibility, ROI, strategic_fit, and environmental_impact weights (normalized).
  • vllm_cfg: provider settings (URL, model, API key) plus optional per-stage overrides.

Stage flow (run_all)

The main pipeline runs in this order:

  1. constraints
  2. discovery
  3. ideation
  4. prioritization

Each stage emits JSON with evidence_chunk_ids that must map to ctx_pack.items[].chunk_id.

Stage Purpose Key outputs
constraints Extract user constraints from the query only. timeframe, regions, doc_type_preference, keywords, output_intent, confidence
discovery Summarize evidence from ctx_pack using constraints. themes, trends, gaps, patent_signals, key_numbers, key_dates
ideation Propose ideas grounded in discovery evidence. ideas[] with evidence_chunk_ids
prioritization Rank ideas using weights. ranking[] with scores and evidence_chunk_ids

Stage memory and follow-up chat

Stage memory

Stage memory carries context across follow-up turns and combines:

  • pinned evidence from prior synthesis outputs (prioritization, ideation, discovery)
  • optional chat summary
  • optional recent chat messages

The chat summary can be refreshed after follow-up responses and reused for the next turn.

Follow-up answers (chat)

Follow-up answers are generated by run_followup_answer and return:

  • items[] paragraphs with evidence_chunk_ids
  • citations (unique chunk ids)
  • follow_up_questions (4 suggestions)
  • confidence (0 to <1)

The follow-up stage uses:

  • selected evidence items (see next section)
  • stage memory (chat summary + recent chat messages)

Evidence selection and the optional second rerank

Evidence for follow-up answers is selected in rag/demo/followups.py:

  • Pinned evidence is taken from prior pipeline stages (prioritization, ideation, discovery).
  • If FOLLOWUP_RERANK=1, evidence items are re-ranked using the standard reranker stack.
  • The reranker spec comes from RERANK_SPEC (or defaults to Cohere if a key is present, otherwise HF).
  • If FOLLOWUP_RERANK is not enabled, a token-overlap heuristic ranks evidence items.

This follow-up rerank is separate from the retrieval rerank described in rag_rerankers.md.

Defaults (stage planner)

Defaults are defined in rag/demo/stage_planner.py:

  • discovery_max_findings: 8
  • n_ideas: 12
  • top_n: 5
  • weights (normalized): feasibility=0.25, roi=0.25, strategic_fit=0.25, environmental_impact=0.25

Evidence validation and aliasing

  • Chunk ids may be aliased for prompt size; outputs are remapped back to original ids.
  • LLM_PIPELINE_STRICT_EVIDENCE_IDS=1 enforces that all evidence ids are known.
  • Ideation must use discovery evidence ids; prioritization must use discovery ids and ideation titles.
  • INCLUDE_EVIDENCE_ID_MAP=1 adds _evidence_id_map to the output for alias tracing.

Per-stage generation overrides

You can override generation settings per stage using environment variables:

  • LLM_MAX_TOKENS_{STAGE}
  • LLM_TEMPERATURE_{STAGE}
  • LLM_TOP_P_{STAGE}
  • LLM_TIMEOUT_S_{STAGE}

Stage names are uppercase: CONSTRAINTS, DISCOVERY, IDEATION, PRIORITIZATION, FOLLOWUP, CHAT_SUMMARY.

Provider and fallback behavior

  • VLLM_URL (default http://localhost:8000/v1/chat/completions)
  • VLLM_MODEL (default local-model)
  • VLLM_API_KEY or api_key_env in vllm_cfg
  • LLM_VLLM_RETRY_ON_FAILURE (default true)
  • LLM_VLLM_RETRY_BACKOFF_S (default 1.0)
  • LLM_STICKY_FALLBACK_OPENAI (default true)
  • OPENAI_FALLBACK_MODEL (default gpt-4o)
  • VLLM_DISABLE_RESPONSE_FORMAT=1 disables response_format for structured outputs
  • LLM_PIPELINE_TRACE=1 adds _llm_trace to the pipeline output

Where it is used

  • Production app (React frontend + API): runs run_all after retrieval and uses follow-up answers for chat.
  • Legacy Streamlit app: runs the same run_all pipeline for local testing/fallback.
  • Retrieval and rerank details are documented in rag.md and rag_rerankers.md.