Subquadratic expands context windows to 12 million tokens, reshaping LLM interaction

Subquadratic announced a $45 million Series A round led by A16Z to launch a 12‑million‑token context window for its language model, promising cheaper long‑form reasoning and new use cases in code analysis, legal review, and scientific research.

Subquadratic – tackling the context‑window bottleneck

Subquadratic, a San Francisco‑based startup founded in 2023, builds large language models (LLMs) that keep the computational cost of attention sub‑quadratic. The company’s flagship model, SQ‑12, can ingest up to 12 million tokens in a single prompt – roughly the length of a full‑length novel or a multi‑gigabyte codebase – while staying within the same price envelope as a 4‑token‑window model from a major cloud provider.

The problem: context windows are a hard limit

Most commercial LLM APIs cap prompts at 2 k–8 k tokens. That forces developers to chunk documents, maintain external state, or accept degraded performance on tasks that need a holistic view, such as:

Legal contract review – a single contract can easily exceed 10 k tokens.
Software repository analysis – a medium‑size repo with documentation, tests, and CI configs can be several hundred megabytes of text.
Scientific literature synthesis – meta‑analyses often require reading dozens of papers in one go.

When the model can only see a slice, it must guess about missing context, leading to hallucinations or incomplete answers. Subquadratic’s approach eliminates that guesswork.

How they achieve it

The core of SQ‑12 is a linear‑complexity attention mechanism that approximates full attention using a combination of low‑rank projections and reversible memory buffers. In practice this means:

Memory usage grows linearly with token count, not quadratically, keeping GPU RAM requirements manageable.
Inference latency stays comparable to a 4 k token pass on the same hardware, because the algorithm trades a small amount of precision for massive speed gains.

The trade‑off is a modest drop in per‑token perplexity (about 0.2 % higher than a standard transformer of the same size), which most downstream users find acceptable when the alternative is to split the prompt.

Funding and market positioning

On May 2 2026 Subquadratic closed a $45 million Series A round. The lead investor was Andreessen Horowitz, with participation from Sequoia Capital, DCVC, and Element AI. The round was described by A16Z partner Katie Haun as “a bet on the next generation of LLM infrastructure that removes the artificial ceiling on context.”

The capital will fund:

Scaling the model family – a 30 M‑token variant for enterprise‑grade workloads.
Developer tooling – a CLI and SDK that automatically handle token budgeting, chunk‑aware prompting, and result stitching.
Cloud partnership integrations – early pilots with Microsoft Azure and Google Cloud to offer SQ‑12 as a managed service.

Subquadratic positions itself between the “cloud‑native LLM APIs” that prioritize ease of use and the “research‑grade open‑source models” that require heavy engineering. By offering a pay‑as‑you‑go pricing model that is roughly 0.12 ¢ per 1 k tokens for the 12 M‑token window, they aim to undercut the cost of stitching multiple calls to existing APIs, which can quickly exceed $1 per request for long documents.

Early traction

**LegalTech startup LexiAI reports a 40 % reduction in review time after switching to SQ‑12 for contract analysis.
**Open‑source code‑review tool CodeLens integrated the model and can now generate full‑repo summaries in under a minute.
University of Cambridge’s Bioinformatics group used the 12 M‑token window to run a single‑prompt literature review across 200 papers, cutting manual curation from weeks to hours.

These pilots suggest a market appetite for “single‑prompt” solutions that avoid the engineering overhead of prompt‑chaining.

What it means for the ecosystem

If Subquadratic’s pricing holds, developers may start designing applications that treat the LLM as a stateful knowledge store rather than a stateless function. That could shift architecture patterns away from complex prompt orchestration frameworks toward simpler, more maintainable codebases.

At the same time, the approach raises questions about hardware requirements. While the linear attention algorithm reduces memory pressure, running a 12 M‑token inference still needs GPUs with 40 GB+ VRAM or multi‑GPU setups. Subquadratic’s upcoming managed service will be crucial for smaller teams that lack such hardware.

Risks and skeptics

Precision loss – the approximation may not be suitable for tasks that demand exact token‑level reasoning, such as formal theorem proving.
Vendor lock‑in – early cloud partnerships could tie customers to specific platforms unless the SDK remains truly cloud‑agnostic.
Competitive response – larger players (OpenAI, Anthropic) have hinted at longer context windows; they could release comparable capabilities with deeper integration into existing ecosystems.

Bottom line

Subquadratic’s 12 million‑token window removes a practical ceiling that has limited LLM adoption in many enterprise and research domains. Backed by a solid Series A round and early customer wins, the company is poised to push the industry toward “single‑prompt” workflows. Whether the trade‑offs in accuracy and hardware cost will be acceptable at scale remains to be seen, but the move forces the broader AI community to rethink how context is managed.

#context window #subquadratic attention #Large Language Models #AI_Infrastructure #Tokenization