Moonshot AI ships Kimi K2.7-Code, a trillion-parameter model built to finish long coding jobs

Moonshot AI's new coding model claims it can carry complex software tasks end to end while burning about 30% fewer reasoning tokens than its predecessor. The benchmarks put it ahead of the company's own prior release and within striking distance of GPT-5.5 and Claude Opus 4.8 on several coding tests, though it still trails them on the hardest ones.

Moonshot AI has released Kimi K2.7-Code, a coding-focused model that the Beijing company positions squarely at the part of software work that current models handle worst: long, multi-step tasks where the agent has to stay coherent across dozens of tool calls without losing the thread.

The model is built on the company's earlier Kimi K2.6 and targets what Moonshot calls "real-world long-horizon coding tasks." The pitch is end-to-end completion across complex engineering workflows, the kind of job that involves reading a codebase, planning changes, running commands, and reacting to the results rather than answering a single prompt. The headline efficiency claim is a roughly 30% reduction in thinking-token usage compared with K2.6, which matters because reasoning tokens are where agentic models quietly run up the bill.

What is actually under the hood

Kimi K2.7-Code is a Mixture-of-Experts model with 1 trillion total parameters and 32 billion activated per token. That gap is the whole point of the MoE design. Of its 384 experts, the router selects 8 per token plus one shared expert, so the model only pays the compute cost of a 32B network on any given forward pass while keeping the capacity of a much larger one.

The rest of the spec sheet reads like a serious agentic system. It runs 61 layers, uses Multi-head Latent Attention (MLA) with 64 attention heads, a SwiGLU activation, and a 160K vocabulary. Context length is 256K tokens, enough to hold a substantial chunk of a repository and a long tool-call history in working memory at once. There is also a vision side: a 400M-parameter MoonViT encoder gives the model image and video input, which is unusual framing for something marketed as a coding model. The video chat path is still experimental and only available through Moonshot's own API.

One design decision stands out. K2.7-Code forces both thinking and preserve_thinking to be on, and neither can be turned off. Preserve-thinking retains the full reasoning content across multi-turn interactions, so a coding agent keeps access to its earlier chain of thought instead of discarding it after each turn. Moonshot's own documentation illustrates this with a small trick: ask for three random numbers, and the model's hidden reasoning lists five; ask for the other two, and it recalls the 215 and 222 it only ever "thought" about. For an agent debugging across many steps, that persistence is the difference between remembering why it made a decision and re-deriving it.

The benchmark picture, read skeptically

Moonshot publishes a comparison table against four reference points: its own K2.6, GPT-5.5, and Claude Opus 4.8. The improvement over K2.6 is consistent. On the company's Kimi Code Bench v2, K2.7-Code jumps from 50.9 to 62.0. On Program Bench it moves from 48.3 to 53.6, and on the agentic MCP Atlas test from 69.4 to 76.0. On MCP Mark Verified it posts 81.1, ahead of Claude Opus 4.8's 76.4.

The honest reading is more mixed. On Kimi Code Bench v2, GPT-5.5 (69.0) and Claude Opus 4.8 (67.4) still sit clearly ahead. On Program Bench the gap is wider, with GPT-5.5 at 69.1 against K2.7-Code's 53.6. On MLS Bench Lite, Claude Opus 4.8 leads the field at 42.8 while K2.7-Code manages 35.1. A few of these benchmarks are also Moonshot's own (Kimi Code Bench, Kimi Claw 24/7 Bench), which is worth keeping in mind when a model is evaluated partly on tests authored by the team that built it. The fair summary is that K2.7-Code closes much of the distance to the frontier on coding and pulls ahead on a couple of agentic tool-use measures, without claiming the top spot outright.

Running it

This is an open-weights release under a Modified MIT License, with the weights on Hugging Face at roughly 1.1T parameters in BF16, F32, and I32 tensor types. Moonshot ships a native INT4 quantization, the same method used in Kimi-K2-Thinking, which is the realistic path for anyone hoping to serve a trillion-parameter model without a data center's worth of accelerators.

Because K2.7-Code shares its architecture with Kimi K2.5 and K2.6, existing deployment setups carry over. The recommended inference engines are vLLM, SGLang, and KTransformers, with a transformers requirement of >=4.57.1 and <5.0.0. For teams that would rather not host it, Moonshot offers a hosted API on its platform with OpenAI- and Anthropic-compatible endpoints, which lowers the switching cost for anyone already wired into those SDKs. Recommended sampling for thinking mode is a temperature of 1.0 and top_p of 0.95, and instant (non-thinking) mode is not offered.

Moonshot recommends pairing the model with its own Kimi Code CLI as the agent framework, which tells you where the company expects this to be used: inside a coding agent loop, not a chat window.

Why it matters

The interesting story here is not a single benchmark number, it is the direction of the open-weights coding field. A trillion-parameter MoE model with a 256K context, forced persistent reasoning, and native INT4 quantization, released under a permissive license, is a credible alternative to closed frontier APIs for teams that want to run their coding agents on infrastructure they control. The token-efficiency gain matters more than it sounds, because long-horizon agents fail and overspend in the same place: they think too much and lose track of what they already decided. Cutting reasoning tokens by 30% while improving task completion attacks both problems at once.

Whether K2.7-Code earns a place in real engineering workflows will come down to behavior on private codebases rather than published tests. But Moonshot has put out a model that is specific about what it is for, transparent about where it still trails GPT-5.5 and Claude Opus 4.8, and open enough to actually evaluate. That combination is rarer than another record-breaking score, and it gives the rest of the field something concrete to measure against.