Goldman Sachs predicts agentic AI will drive token usage up to 24 times current levels, forcing major tech firms to rethink pricing, hardware refresh cycles and supply‑chain allocations for next‑gen GPUs.

AI Token Consumption Set to Surge 24‑Fold, Pressuring Budgets at Microsoft, Uber and Beyond

Announcement

Goldman Sachs released a report this week warning that the rise of autonomous AI agents could push total token consumption 24 × higher than today’s baseline within the next three years. The forecast comes as Microsoft and Uber publicly disclosed budget overruns linked to token‑based billing for tools such as GitHub Copilot and Claude Code. Both companies are now moving developers onto internal, lower‑cost platforms and re‑evaluating hardware purchase plans.

Technical specs and supply‑chain context

Token demand per workload

Workload type	Tokens per request (average)	Typical daily requests per 1 k engineers	Estimated daily token count
Simple chatbot	150	2 M	300 M
Code‑completion (Copilot)	600	1 M	600 M
Agentic AI (autonomous toolchain)	12 k	200 k	2.4 B

The agentic AI row shows a 20‑fold token increase over standard code‑completion. If the number of agents grows at the 30 % annual rate cited by the report, the daily token volume would reach ≈ 7 B by 2027 – roughly the 24‑times multiplier projected.

Hardware efficiency gap

Current data‑center GPUs such as Nvidia Blackwell (H100‑based) deliver about 0.9 tflops per watt for FP8 inference. Nvidia’s upcoming Vera Rubin platform, slated for a late‑2024 launch, promises 10 × the performance‑per‑watt of Blackwell by moving to a 3 nm process node and integrating a dedicated tensor‑streaming engine.

GPU	Process node	Peak FP8 throughput	Performance/Watt
Nvidia Blackwell (H100)	5 nm	60 TFLOPS	0.9 TFLOPS/W
Nvidia Vera Rubin (projected)	3 nm	600 TFLOPS	9 TFLOPS/W

Even with a tenfold efficiency gain, the raw token increase would still outpace cost reductions. Assuming a linear cost model, a 24‑times token rise would require ≈ 2.4 × the energy budget of today’s AI clusters, dwarfing the savings from Vera Rubin.

Supply‑chain implications

Foundry capacity – TSMC’s 3 nm lines are booked at 85 % utilization through 2025. Scaling Vera Rubin production will compete with high‑end mobile SoCs, tightening lead times for AI‑focused datacenter orders.
Memory bandwidth – HBM3E, the next‑generation memory for Vera Rubin, is limited to ~1.2 TB/s per stack. To sustain the projected token throughput, data‑center designers will need dual‑stack configurations, effectively doubling the memory bill of each server.
Power infrastructure – A 10 MW AI pod built on Blackwell draws ~9 MW under full load. Even with Vera Rubin’s efficiency, the same workload would still require ~3 MW, forcing operators to upgrade cooling and power distribution.

Market implications

Pricing model shifts – Microsoft’s move to token‑based billing for Copilot mirrors a broader industry trend: converting AI usage into a consumable metric to better align revenue with compute cost. Companies that continue offering flat‑rate subscriptions risk margin erosion as token counts climb.
Capital‑expenditure re‑timing – Uber’s 2026 AI budget was exhausted in months, prompting a pause on new GPU purchases until the next fiscal cycle. This mirrors a pattern where firms delay hardware refreshes, opting instead for software‑level optimizations (prompt engineering, model distillation) to stretch existing capacity.
Vendor competition – Nvidia’s Vera Rubin advantage will be most valuable to early adopters that can secure supply. Firms that lock in 2024 orders may achieve 30‑40 % lower TCO versus competitors stuck on Blackwell or older Hopper GPUs.
Potential consolidation – If token costs continue to outstrip hardware efficiency gains, mid‑size AI service providers may be forced to merge with larger players that can amortize the infrastructure spend across broader revenue streams.
Regulatory focus – The EU AI Act draft includes provisions for “energy‑intensive AI systems.” Companies reporting runaway token usage could face additional compliance reporting, adding another layer of operational cost.

Outlook

Goldman Sachs’ 24‑times token growth scenario is not a distant hypothetical; the data‑center metrics above show that even a modest adoption of autonomous agents can drive token consumption into the billions per day. While next‑generation GPUs promise dramatic efficiency gains, the supply‑chain bottlenecks at the foundry and memory levels mean those gains will be realized gradually.

In the short term, we expect two parallel strategies:

Software‑first cost control – Prompt‑tuning, model pruning, and selective token‑caching will become standard practice to keep daily token bills below the $10 M threshold that triggered Uber’s budget crisis.
Hardware‑first acceleration – Companies with deep pockets will pre‑pay for Vera Rubin shipments, securing a performance edge that could translate into a 5‑10 % revenue uplift for AI‑driven products.

The balance between these approaches will determine which firms stay financially viable as token demand explodes.

For further reading on Nvidia’s upcoming platform, see the official announcement.

#Token Consumption #GPU Efficiency #Nvidia #AI_Infrastructure #Cost Optimization

AI Token Consumption Set to Surge 24‑Fold, Pressuring Budgets at Microsoft, Uber and Beyond

AI Token Consumption Set to Surge 24‑Fold, Pressuring Budgets at Microsoft, Uber and Beyond

Announcement

Technical specs and supply‑chain context

Token demand per workload

Hardware efficiency gap

Supply‑chain implications

Market implications

Outlook

Comments