Kimi K2.7 Code HighSpeed: 260 Tok/s & $0.95/M (2026)
Kimi K2.7 Code HighSpeed hits 180-260 tokens/sec with a 256K context window. Open weights, $0.95/$4.00 per 1M tokens, 81.1 on MCP Mark Verified tool use.
Kimi K2.7 Code HighSpeed is Moonshot AI's fast-serving mode for its 1T-parameter open-weight coding model, released June 15, 2026 with 256K context and 180-260 tokens/sec throughput. It costs $0.95/$4.00 per 1M input/output tokens and scores 81.1 on MCP Mark Verified, beating Claude Opus 4.8's 76.4 on tool-call reliability.
Kimi K2.7 Code HighSpeed, released by Moonshot AI on June 15, 2026, is a 1 trillion-parameter open-weight coding model serving mode that reaches 180-260 tokens per second. It costs $0.95 per 1M input tokens and $4.00 per 1M output tokens, and scores 81.1 on MCP Mark Verified, ahead of Claude Opus 4.8's 76.4.
Provider: Moonshot AI · Family: Kimi K2
Context window: 262,144 tokens · Max output: 49,152
Input modalities: text, image, video, tool-calls, code · Output: text, tool-calls, code
About Kimi K2.7 Code HighSpeed
Kimi K2.7 Code HighSpeed is a fast-inference variant of Kimi K2.7 Code, the coding-focused release Moonshot AI shipped on June 12, 2026 as a fine-tune of its Kimi K2.6 base. The HighSpeed mode itself followed three days later, on June 15, 2026, rolling out first to the Kimi Code Beta channel. Architecturally it is unchanged from K2.6: a Mixture-of-Experts transformer with 1 trillion total parameters and 32 billion activated per token, 61 layers, 384 experts with 8 routed plus 1 shared expert per token, Multi-head Latent Attention (MLA) for KV cache compression, SwiGLU activations, a 160K vocabulary, and a 256K (262,144-token) context window. A 400M-parameter MoonViT vision encoder gives it native image and video input alongside text. Moonshot has not submitted K2.7 Code to any independently audited suite: no SWE-bench Verified, no SWE-bench Pro, no LiveCodeBench, no GPQA Diamond score has been published as of late June 2026. The only numbers on the table are Moonshot's own Kimi Code Bench v2, where K2.7 Code scores 62.0 versus K2.6's 50.9 (a 21.8% relative gain), Program Bench at 53.6 versus 48.3 (+11.0%), and MLS Bench Lite at 35.1 versus 26.7 (+31.5%). On MCP Mark Verified, a third-party tool-invocation benchmark covering Notion, GitHub, Postgres, Filesystem, and Playwright environments, K2.7 Code scores 81.1, ahead of Claude Opus 4.8's 76.4 on that specific test, though Opus 4.8 leads on broader coding suites like Terminal-Bench 2.1. Its closest open-weight rival, Zhipu's GLM 5.2, holds a stronger position on audited benchmarks (62.1% SWE-bench Pro), so K2.7 Code's real edge is token efficiency and tool-call reliability rather than raw solve rate. The defining change from K2.6 is a roughly 30% cut in reasoning-token usage per task, which lowers effective cost on reasoning-heavy coding workloads even though the per-token price is nearly identical. K2.7 Code always runs with extended thinking enabled; a request sent with thinking explicitly disabled is silently rerouted to K2.6 rather than erroring; interleaved thinking is preserved across multi-turn tool-calling sessions for coherent long-horizon coding runs. The HighSpeed mode is a serving-side optimization, not a retrained model: Moonshot reports roughly 180 tokens/sec on median-length coding completions and bursts to about 260 tokens/sec on short-context requests, close to six times the throughput of the standard K2.7 Code deployment. Third-party inference provider Crusoe reports over 430 output tokens/sec running K2.7 on its own optimized stack, which shows the ceiling is provider-dependent rather than fixed by the model. On Moonshot's native Kimi API, K2.7 Code HighSpeed is priced at $0.95 per 1M input tokens and $4.00 per 1M output tokens, with cache-hit input at $0.19 per 1M tokens (up from $0.16 on K2.6). OpenRouter lists it cheaper, at $0.74 input and $3.50 output per 1M tokens. The model is also live on Cloudflare Workers AI and the Vercel AI Gateway. A daily coding-agent workload pushing roughly 1M input and 200K output tokens costs about $1.75 on the native API, and a 100K-token document review runs about $0.50. Weights are published under a Modified MIT License on Hugging Face and GitHub: free for research and most commercial use, with an attribution clause that kicks in for very large-scale commercial deployments (the same style of condition Moonshot has used since K2). Self-hosting the FP8 weights needs roughly 1TB of HBM, realistically an 8x H200 SXM5 node (about 1128GB total), leaving around 128GB for KV cache at full 256K context with small batches. Moonshot also ships native INT4 weights trained with quantization-aware training rather than post-hoc quantization, cutting VRAM roughly in half with minimal quality loss; vLLM, SGLang, and KTransformers are the officially recommended INT4 serving engines. Moonshot has not published a system card for K2.7 Code, so there is no documented refusal policy, jailbreak-resistance figure, or agentic-misuse evaluation specific to this release. An independent academic safety evaluation of the prior K2.5 model found it showed a lower overrefusal rate and higher willingness to cooperate with borderline requests than GPT-5.2 and Claude Opus 4.5, a pattern worth assuming carries over until Moonshot publishes its own K2.7 assessment. Training data cutoff is unconfirmed for K2.7 Code specifically; K2.6 was reported at approximately April 2025 but that figure is itself unverified by Moonshot. K2.7 Code HighSpeed fits teams running autonomous coding agents that call tools frequently and want to keep latency and token cost down, especially teams already comfortable self-hosting or routing through OpenRouter/Cloudflare for the cheaper rate. It is a weaker fit for anyone who needs an audited SWE-bench or GPQA number for a procurement process, needs published safety documentation for a regulated deployment, or is doing general-purpose writing and analysis work, where Moonshot itself recommends the base K2.6 model instead.
Pricing
Native Kimi API: $0.95 per 1M input tokens, $4.00 per 1M output tokens, $0.19 per 1M cached input tokens. OpenRouter lists it cheaper at $0.74 input / $3.50 output per 1M tokens. Self-hosting the open weights avoids per-token fees entirely but requires roughly 1TB of GPU memory for FP8.
Key Features
- HighSpeed serving mode: Reaches roughly 180 tokens/sec on median coding completions and up to 260 tokens/sec on short-context requests, about 6x standard K2.7 Code throughput.
- 256K context window: 262,144-token context with Multi-head Latent Attention (MLA) compression to keep the KV cache manageable at self-hosted scale.
- Always-on interleaved thinking: Extended thinking runs on every request and is preserved across multi-turn tool-calling sessions for coherent long-horizon coding runs.
- Native MoonViT vision encoder: 400M-parameter vision encoder gives native image and video input alongside text and code, without a separate captioning step.
- Native INT4 quantization: Quantization-aware-trained INT4 weights ship alongside FP8, cutting self-hosting VRAM roughly in half with vLLM, SGLang, and KTransformers support.
Pros
- MCP Mark Verified score of 81.1 beats Claude Opus 4.8 (76.4) on tool-invocation reliability.
- Open weights under a Modified MIT License with native INT4 quantization, halving self-hosting VRAM versus FP8.
- Roughly 30% fewer reasoning tokens than K2.6 lowers effective cost on reasoning-heavy coding tasks.
Cons
- No independently audited SWE-bench, LiveCodeBench, or GPQA score exists for this release.
- No published system card, so refusal policy and jailbreak resistance are undocumented.
- HighSpeed mode is beta-gated to the Kimi Code Beta channel and may not be available on every account.
Benchmarks
- program bench: 53.6
- mls bench lite: 35.1
- mcp mark verified: 81.1
- kimi code bench v2: 62
Frequently Asked Questions
What is Kimi K2.7 Code HighSpeed and who built it?
Kimi K2.7 Code HighSpeed is a fast-serving mode of Kimi K2.7 Code, an open-weight coding model built by Moonshot AI, a Beijing lab founded in 2023. The base K2.7 Code model launched June 12, 2026 as a fine-tune of Moonshot's K2.6 checkpoint, and the HighSpeed serving mode followed on June 15, 2026, rolling out first to the Kimi Code Beta channel. It uses a Mixture-of-Experts transformer with 1 trillion total parameters and 32 billion activated per token, 61 layers, 384 experts with 8 routed plus 1 shared expert, and Multi-head Latent Attention for KV cache compression. It was built to push agentic coding and tool-use performance past K2.6 while cutting reasoning-token overhead by roughly 30%. It scores 62.0 on Moonshot's own Kimi Code Bench v2 and 81.1 on the third-party MCP Mark Verified tool-use benchmark, ahead of Claude Opus 4.8's 76.4 on that test. The model costs $0.95 per 1M input tokens and $4.00 per 1M output tokens on Moonshot's native API.
How much does Kimi K2.7 Code HighSpeed cost per 1M tokens?
On Moonshot's native Kimi API, Kimi K2.7 Code HighSpeed costs $0.95 per 1M input tokens and $4.00 per 1M output tokens, with cache-hit input priced at $0.19 per 1M tokens. OpenRouter offers it cheaper, at $0.74 input and $3.50 output per 1M tokens, for teams that want third-party routing instead of a direct Moonshot account. A daily coding agent pushing roughly 1M input tokens and 200K output tokens costs about $1.75 on the native API. A 100K-token codebase review runs about $0.50. Because the weights are open under a Modified MIT License, teams can also self-host for the cost of GPU infrastructure alone, roughly 1TB of HBM for FP8 weights or about half that for the native INT4 build, with no per-token fee to Moonshot at all. There is no provisioned-throughput tier publicly listed.
What is Kimi K2.7 Code HighSpeed's context window and max output?
Kimi K2.7 Code HighSpeed shares its 256K (262,144-token) context window with the standard K2.7 Code release, unchanged from the K2.6 base model. The per-step generation limit is 49,152 tokens, though the model can chain multiple generation steps within a single long-horizon agentic session up to the full 262,144-token context budget. Multi-head Latent Attention compresses the KV cache, which is what makes self-hosting at full context feasible: on an 8x H200 SXM5 node with roughly 1TB of HBM used for FP8 weights, about 128GB remains for KV cache at 256K context with small batch sizes. Moonshot has not published an independent needle-in-haystack or long-context recall evaluation for K2.7 Code specifically, so recall quality above 100K tokens is unverified rather than benchmarked. Document and multi-file handling relies on the same context budget as any other input; there is no separate extended-context tier.
How does Kimi K2.7 Code HighSpeed compare on benchmarks vs GLM 5.2 and Claude Opus 4.8?
The comparison is uneven because Moonshot has not submitted K2.7 Code to any independently audited coding benchmark. GLM 5.2 leads on audited suites, scoring 62.1% on SWE-bench Pro versus roughly 58.6% for K2.6 (K2.7 Code's predecessor; no K2.7 number exists), and Claude Opus 4.8 leads Terminal-Bench 2.1 at 85.0 against GLM 5.2's 81.0. Where K2.7 Code does win is MCP Mark Verified, a tool-invocation benchmark, scoring 81.1 against Claude Opus 4.8's 76.4. On Moonshot's own Kimi Code Bench v2, K2.7 Code scores 62.0, up 21.8% from K2.6's 50.9, but that benchmark has no independent verification or cross-vendor comparison points. In practice this means K2.7 Code HighSpeed is a credible choice for tool-heavy agentic workflows where it has a measured edge, but a weaker choice anywhere a team needs an audited SWE-bench or GPQA number to justify the pick.
Is Kimi K2.7 Code HighSpeed open source or proprietary?
Kimi K2.7 Code, and the HighSpeed serving mode built on it, is open-weights: Moonshot AI publishes the model weights on Hugging Face and GitHub under a Modified MIT License. The license permits commercial use and modification, with an attribution requirement that applies specifically to very large-scale commercial deployments, the same style of condition Moonshot has used since the original K2 release. Weights are available in FP8 and native INT4 formats, the latter trained with quantization-aware training rather than post-hoc quantization for better quality retention. Community GGUF conversions are also available via Unsloth on Hugging Face. Self-hosting the FP8 weights needs roughly 1TB of GPU memory, realistically an 8x H200 SXM5-class node, while the native INT4 build needs roughly half that. There is no separate closed-source tier: the HighSpeed serving optimization is a deployment-side change, not a different license.
What modalities does Kimi K2.7 Code HighSpeed support?
Kimi K2.7 Code HighSpeed accepts text, image, and video input through a 400M-parameter MoonViT vision encoder built into the model, alongside native code and tool-call handling. Output is text, code, and structured tool calls; there is no audio input or output in either direction, so voice workflows need a separate ASR/TTS model paired in front of and behind it. Function calling and tool use are a specific strength: the model scores 81.1 on MCP Mark Verified across Notion, GitHub, Postgres, Filesystem, and Playwright environments, and preserves interleaved reasoning across multi-turn tool-calling sessions when preserve_thinking is set. Extended thinking runs on every request by default; there is no way to fully disable it, a request that tries gets silently rerouted to the K2.6 model instead. There is no documented web-browsing or code-execution sandbox built into the model itself.
Does Kimi K2.7 Code HighSpeed train on user data?
Moonshot AI has not publicly disclosed a data retention or training-on-inputs policy specifically for Kimi K2.7 Code HighSpeed, and no system card exists for this release to check against. There is no published SOC 2 Type II, ISO 27001, HIPAA-eligibility, or GDPR-compliance statement for the API, and no stated EU AI Act classification. Because the weights are open under a Modified MIT License, the most concrete way to control data handling is to self-host: run the FP8 or INT4 weights on your own infrastructure and no request data leaves your environment at all. Teams with strict data-governance requirements and no appetite for self-hosting should treat the hosted API as undocumented on this axis and confirm directly with Moonshot before sending sensitive data.
Who is Kimi K2.7 Code HighSpeed best for and who should avoid it?
It is best for teams running autonomous coding agents with heavy tool use, where its 81.1 MCP Mark Verified score and 180-260 token/sec HighSpeed throughput cut both latency and reasoning-token cost on long agentic loops. It also fits infrastructure teams that want to self-host an open-weight coding model, given the Modified MIT License and native INT4 quantization halving VRAM needs. Teams already using Kimi Code CLI as their agent framework get the tightest integration. It is a poor fit for procurement processes that require an independently audited SWE-bench or GPQA score, since none has been published for this release. It should also be avoided for regulated deployments needing documented safety posture, since no system card exists, and for general-purpose writing or chat use cases, where Moonshot itself recommends the base K2.6 model instead. Teams needing native audio should look elsewhere, since there is no audio I/O in either direction.