A benchmark analysis of 11 large language models reveals government-facing chatbots provide excessively verbose and factually inconsistent responses, with concise instructions worsening accuracy.

Artificial intelligence chatbots deployed for government services generate excessively verbose responses that bury critical information, while attempts to force conciseness significantly degrade factual accuracy, according to comprehensive benchmarking by the Open Data Institute (ODI). The study evaluated 11 large language models across 22,000 citizen queries modeled after GOV.UK content, exposing fundamental reliability issues for public sector implementations.
Performance Benchmark Methodology
ODI researchers constructed CitizenQuery-UK, a dataset of 22,066 synthetically generated citizen questions aligned with UK government services. Each query received responses from models including Anthropic's Claude 4.5 Haiku, Meta's Llama 3.1 8B, OpenAI's ChatGPT-4.1, and Alibaba's Qwen3-32B. Responses underwent tripartite scoring:
- Verbosity: Word count and information density relative to authoritative GOV.UK sources
- Accuracy: Factual correctness verified against government documentation
- Refusal Rate: Willingness to decline unanswerable queries
Testing revealed systemic verbosity across all models, with Claude 4.5 Haiku producing the most extreme "word salad" outputs. When researchers appended "be concise" instructions to prompts, accuracy dropped by 22-38% across models as key details were omitted.
Critical Accuracy Failures
Despite training on GOV.UK data, models exhibited inconsistent factual errors with potentially severe real-world consequences:
| Model | Erroneous Claim | Correct Government Policy |
|---|---|---|
| ChatGPT-OSS-20B | Guardian's Allowance requires deceased child | Requires deceased parents, not child |
| Llama 3.1 8B | Court order needed for birth certificate changes | Only birth re-registration required |
| Qwen3-32B | £500 Sure Start Grant available in Scotland | England/Wales/NI only; Scotland excluded |
Alarmingly, models attempted to answer 98.7% of queries regardless of competency, displaying near-zero refusal rates for unanswerable questions. This "failure to refuse" creates high-risk scenarios where citizens might act on authoritative-sounding misinformation regarding benefits, legal procedures, or public services.
Efficiency and Deployment Implications
Benchmark data revealed smaller open-source models like Llama 3.1 8B achieved comparable accuracy to proprietary giants like GPT-4.1 at 34% lower inference costs. This demonstrates that model size alone doesn't guarantee government-service reliability and highlights the importance of avoiding vendor lock-in through flexible deployment architectures.
For government implementations like the planned GOV.UK chatbot launching in 2026, ODI recommends:
- Source Anchoring: Strict adherence to verbatim GOV.UK content without extrapolation
- Uncertainty Signaling: Explicit confidence indicators when responses extend beyond core sources
- Refusal Protocols: Mandatory "I don't know" responses for unverifiable queries
- Hybrid Deployment: Combining smaller specialized models with human verification layers
The findings underscore that while LLMs excel at synthesizing information from diverse sources, this strength becomes a critical weakness in contexts requiring absolute factual precision like government services. As the UK Department for Work and Pensions experiments with Universal Credit chatbots, these benchmarks provide essential performance baselines for safe public deployment.

Comments
Please log in or register to join the discussion