OpenAI doubled GPT-5.5 token prices over GPT-5.4 while touting shorter outputs to offset costs, but OpenRouter's analysis of real user switching data finds net cost increases of 49 to 92 percent across all prompt lengths, with most users seeing no benefit from reduced verbosity.

GPT-5.5 Price Increase: What It Actually Costs | OpenRouter

A post published May 4, 2026 by Justin Summerville on OpenRouter breaks down the real-world cost impact of OpenAI's GPT-5.5 model, which launched in late April 2026 with a 2x price increase over its predecessor GPT-5.4. Input tokens rose from $2.50 per million to $5.00, while output tokens climbed from $15 per million to $30. OpenAI also noted that the new model produces shorter completions for identical tasks, a claim that would theoretically offset some of the price hike for users. To separate marketing talking points from real-world impact, the OpenRouter team replicated a cost analysis framework they previously applied to Anthropic's Opus 4.7 model, focusing on users who switched their primary usage from GPT-5.4 to GPT-5.5 in the weeks surrounding the launch.

OpenRouter's analysis relies on a controlled "switcher cohort" to isolate the impact of the model change. The team identified users whose top model by request count was GPT-5.4 in the three days before GPT-5.5 launched (April 21-23, 2026), and whose top model shifted to GPT-5.5 in the three days after launch (April 25-28, 2026, excluding launch day itself). This group provides a direct before-and-after comparison: same users, same workflows, only the model version differs. Because GPT-5.4 and 5.5 use the same tokenizer family, there is no need to adjust for differences in how text is converted to tokens, a common confounding factor in cross-model cost analyses.

The sample includes only text-only, non-cancelled requests with valid token counts, excluding media (images, files, audio, video), cancelled requests, and zero-token responses. Costs are normalized to dollars per million OpenRouter tokens, a metric that counts both input and output tokens independently of OpenAI's internal counting, providing a consistent baseline across model versions. Results are bucketed by prompt token count to identify how context length affects both output verbosity and final cost.

Verbosity Changes Vary Sharply by Prompt Length

OpenAI's claim that GPT-5.5 is less verbose holds only for a subset of use cases. OpenRouter measured median completion lengths for the switcher cohort across prompt size buckets:

Prompt Size	Median Completion (5.4)	Median Completion (5.5)	Change
< 2K tokens	121	129	+7%
2K – 10K	140	213	+52%
10K – 25K	211	143	-32%
25K – 50K	185	150	-19%
50K – 128K	188	136	-28%
128K+	215	143	-34%

For prompts shorter than 10K tokens, GPT-5.5 produces longer completions than GPT-5.4. Prompts under 2K tokens, which cover a large share of typical LLM use cases including chat, short-form content generation, and quick code fixes, see a 7% increase in output length. The 2K-10K bucket, which includes longer code generation tasks, document summarization, and medium-length research queries, sees a 52% jump in completion length, worsening the cost impact of the price hike.

Only prompts over 10K tokens see the promised reduction in verbosity. The 10K-25K bucket sees a 32% drop in output tokens, 25K-50K a 19% drop, 50K-128K a 28% drop, and 128K+ a 34% drop. These longer prompts are used for tasks like analyzing full-length documents, processing large codebases, and multi-turn conversations with extensive context history.

Net Cost Impacts: 49% to 92% Increases Across All Buckets

Even with reduced output lengths for long prompts, every prompt bucket saw a net cost increase. OpenRouter calculated average billed cost per million tokens for the switcher cohort, normalizing for prompt length to isolate model-driven cost changes:

Prompt Size	Avg $/M OR Tokens (5.4)	Avg $/M OR Tokens (5.5)	Change
< 2K tokens	$4.89	$9.37	+92%
2K – 10K	$2.25	$3.81	+69%
10K – 25K	$1.42	$2.15	+51%
25K – 50K	$1.02	$1.65	+62%
50K – 128K	$0.74	$1.10	+49%
128K+	$0.71	$1.31	+85%

GPT-5.5 Price Increase: What It Actually Costs

The lowest cost increases, 49% to 51%, apply to prompts in the 10K-50K token range, where verbosity reductions are most consistent. For these users, the shorter outputs offset roughly half of the 100% headline price increase. However, all other buckets see far steeper increases. Users with prompts under 10K tokens, who make up the majority of typical LLM users, face 69% to 92% higher costs, with no offsetting reduction in output length. Even the longest prompts (128K+) see an 85% cost increase, as the 34% reduction in output tokens is not enough to overcome the doubled base pricing for both input and output tokens.

Context and Implications

This analysis mirrors OpenRouter's prior work on Anthropic's Opus 4.7 model, which found similar gaps between provider efficiency claims and real-world user costs. Across major model updates, providers often pair price hikes with talking points about improved efficiency, but these benefits rarely apply uniformly to all users. For GPT-5.5, the efficiency gain is limited to a narrow slice of long-context use cases, while most users pay far more for no improvement in output length, and some paying more for longer outputs.

Users considering switching to GPT-5.5 should audit their own prompt length distributions to estimate their actual cost exposure. A user whose workloads are entirely in the 50K-128K token bucket may see a ~49% increase, while a user focused on short chat prompts may see costs nearly double. OpenRouter's independent token counting provides a more reliable benchmark than provider-reported metrics, as it avoids discrepancies in how tokenizers count multi-turn context or special characters.

The analysis also highlights the value of normalized cost metrics for comparing model versions. Per-token price hikes are easy to announce, but real cost impact depends on how models behave in practice for specific workloads. For now, GPT-5.5's price increase delivers far less value for most users than the headline numbers suggest, with only a small subset of long-context users seeing meaningful offset from shorter outputs.