Investment analysts face a daunting challenge: synthesizing insights from hundreds of pages of dense research reports with conflicting structures and terminology. As one developer discovered while comparing 2026 macroeconomic outlooks from Goldman Sachs, UBS, and Barclays, manually tracking claims across different page references became unsustainable. This friction inspired an ambitious experiment: Could large language models (LLMs) automate synthesis while maintaining verifiable accuracy?

The developer constructed a six-pass Python pipeline leveraging GPT-5.1 (note: likely referring to a current-gen model despite the version number) to transform unstructured PDFs into comparable insights:

  1. Structural Normalization: Each report was summarized into a standardized template covering macro policies, asset classes, and risk factors
  2. Vector Definition: The LLM proposed comparison axes (like "Policy Path" or "AI CapEx") which were refined through human-guided iterations
  3. Evidence Extraction: For each report-vector pair, the system pulled claims with mandatory page references (e.g., [p.42]), coverage scores, and quantitative data
  4. Synthesis Generation: Cross-report analyses highlighted consensus, disagreements, and conditional risks
  5. Interactive Rendering: Results were output as HTML with accordion-style comparison tables
  6. Citation Linking: Page references became clickable deep links into source PDFs

Crucially, the pipeline enforced mechanical accountability:

# Simplified extraction logic
if claim_uncertain:
    reference = "[p.?]"  # Flags for manual review
else:
    reference = f"[p.{page_number}]"  # Direct PDF deep link

This structure yielded striking results across 140 extractions: Spot checks of 100 citations revealed single-digit error rates. As the developer noted:

"Mandatory format, page markers in source text, and spot-check verification produced very low hallucination rates... Any task requiring traceable claims can now start to use LLMs for the heavy lifting, as long as the system is designed to enforce citations."

Three technical breakthroughs enabled this reliability:

  • Explicit Page Anchors: PDFs were preprocessed with explicit page-break markers (e.g., --- Page 42 ---) to anchor citations
  • Coverage Thresholds: Claims with under 50% confidence were marked "n.a." to avoid misleading low-signal data
  • Verification Workflow: The [?] placeholder flagged uncertain references for human review

Despite success, limitations emerged:

  • Visual data in charts wasn't captured
  • Numeric data remained unstructured text rather than computable values
  • Cross-vector analysis wasn't implemented

This experiment proves that hallucination isn't an inevitable LLM trait—it's a solvable engineering challenge. For developers building research tools, the implications are profound: Rigorous citation architectures could transform domains from academic literature reviews to competitive intelligence. As financial institutions and research teams adopt these techniques, we may soon see AI move from distrusted assistant to verifiable collaborator.

Source: Methodology documented at 2026 Macro Analysis Project