Mistral Just Shipped a 128B Open-Weight Frontier Coder. The Numbers Make Sonnet Sweat.
Mistral Medium 3.5 went into public preview this weekend. It is a 128B dense model with a 256K context, 77.6% on SWE-Bench Verified, and a price tag of $1.50 per million input tokens and $7.50 per million output. It also ships with open weights under a modified MIT license. That last sentence is the one that should make every API-only frontier lab pay attention.
I've been pulling the launch numbers into our tracker all morning. Here is what is actually different, and why this release matters more than the version bump suggests.
One Model, One Set of Weights
Mistral spent most of 2025 fragmenting its lineup. Magistral for reasoning, Devstral for code, Medium 3.x for general chat. Medium 3.5 collapses all three back into a single set of weights. One 128B dense model handles instruction following, multi-step reasoning, and coding tasks, with a configurable reasoning effort knob you set per request.
That last bit matters for anyone building agentic systems. You can hit the same endpoint with reasoning effort low for a quick lookup, or crank it up for a complex agentic run, without switching models or paying to host two checkpoints. It is the same trick OpenAI and Anthropic do internally with thinking budgets, but exposed cleanly at the API level.
Context window is 256K, larger than Sonnet 4.6's 200K and double GPT-5.4's old 128K. That is enough to fit roughly 500 pages of text or a good chunk of a real codebase in a single pass.
The Benchmarks: Real Coding Work, Not Trivia
Mistral leaned hard on practical benchmarks for this release, and it is the right call. The headline number is 77.6% on SWE-Bench Verified. That is the benchmark that actually matters for coding agents: real GitHub issues from real open-source repos, scored by whether the model's patch passes the hidden test suite.
Claude Sonnet 4.6 sits at about 79.6% on the same benchmark. So Sonnet still wins by two points. But Mistral Medium 3.5 costs half as much per token, ships with open weights, and runs self-hosted on as few as four GPUs through vLLM, SGLang, or Ollama. Two points of SWE-Bench is not nothing, but at this price point it is a comfortable trade.
The other number Mistral is highlighting is 91.4% on tau-cubed Telecom, a domain-specific agentic benchmark that tests tool use and multi-step problem solving in a customer support setting. That is a strong agentic score for a model in this weight class.
| Model | Input (per 1M) | Output (per 1M) | Context | Open |
|---|---|---|---|---|
| Mistral Medium 3.5 | $1.50 | $7.50 | 256K | Yes (modified MIT) |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K | No |
| GPT-5.5 | $5.00 | $30.00 | 1M | No |
| Gemini 3.1 Pro | $1.25 | $5.00 | 2M | No |
| DeepSeek V4 | $0.55 | $2.20 | 128K | Yes (MIT) |
On price-to-capability, Mistral is now sitting in a real sweet spot. Cheaper than Sonnet 4.6, more capable on practical coding than the open-weight runners-up at this size, and with open weights so you can actually rent or own the inference instead of being locked to a single API gateway.
Vibe Goes Async, Le Chat Goes to Work
The model is half the story. The other half is what Mistral built on top of it.
Vibe, Mistral's coding agent, now runs in the cloud. You can spawn a session from the CLI or from Le Chat, and the agent runs asynchronously in an isolated sandbox. Sessions opened locally can be teleported to the cloud without losing history or state. Translation: you can kick off a long-running agentic coding task on your laptop, close the lid, and have it keep working. When you reopen, it picks up exactly where it was.
That is the same shape Anthropic offered with Claude Code's remote sessions and what OpenAI shipped with Codex on Bedrock last week. Mistral is now in that bracket too, and they have done it with an open-weight model underneath, which means anyone can host the agent infrastructure end-to-end on their own iron if they want to.
Le Chat, Mistral's consumer-facing assistant, also got a Work mode. It runs the same Medium 3.5 model and can chain multi-step workflows across email and calendar through built-in connectors. Sensitive actions require explicit user approval before the agent executes them, which matches the consent model we wrote about in our coverage of the Cloudflare-Stripe agent provisioning protocol from Friday.
Why the Open Weights Part Actually Matters
The pricing is good. The benchmarks are good. The agent infrastructure is good. But the piece that resets the conversation is the license.
Mistral Medium 3.5 ships with open weights under a modified MIT license. You can pull the checkpoint from Hugging Face, host it on your own four-GPU node with vLLM, fine-tune it on your domain data, and ship it inside an air-gapped network. None of that is true for Sonnet 4.6, GPT-5.5, or Gemini 3.1 Pro, which are all closed-weight, API-only products.
We covered this dynamic with DeepSeek V4 earlier in the year: when an open-weight model gets close enough to frontier on capability, the locked-in pricing of the proprietary labs starts looking like rent. Mistral is now playing the same card, but at a quality tier where the trade-off versus Sonnet is two points on SWE-Bench instead of ten.
For regulated industries, defense contractors, and anyone in a country where US-hosted AI is a procurement headache, that gap collapses. Self-hosted Medium 3.5 on your own hardware lets you keep the data in your VPC, customize the model with your own fine-tunes, and skip the per-token meter entirely once you have amortized the GPU spend.
The Pricing Criticism
One caveat. The hosted API price of $1.50 input and $7.50 output drew some pushback from the Mistral community on launch day. Compared to DeepSeek V4 at $0.55 in and $2.20 out, or Gemini 3.1 Pro at $1.25 in and $5.00 out, Mistral is sitting on the higher end of the value tier. The argument from cost-sensitive shops: if you are willing to pay for quality, why not just buy Sonnet at $3.00 in?
The answer Mistral is implicitly giving is that you should run it yourself. The hosted API is for evaluation and convenience. The real value proposition is the open weights and the inference economics that come with hosting it on your own infrastructure. If you are running enough volume that the per-token math matters, you should not be hitting the Mistral API endpoint in the first place.
Where This Leaves the Market
The frontier model market is now stratifying in a way that is starting to matter for buyers. The top tier (GPT-5.5 and Claude Opus 4.7) is doubling down on premium pricing for raw capability. The mid-tier (Sonnet 4.6, Gemini 3.1 Pro) is fighting on price-to-capability ratios. And the open-weight tier (Mistral Medium 3.5, DeepSeek V4, Llama 4) is pulling closer to the closed mid-tier on benchmarks while owning the self-hosting and fine-tuning use cases outright.
The harness gap we wrote about last week is also relevant here. SWE-Bench Verified scores depend heavily on the agent harness wrapping the model. Mistral built Vibe specifically to extract maximum performance from Medium 3.5 on agentic coding. If you wire Medium 3.5 into a generic harness, expect a lower number. If you use it through Vibe, you should be able to reproduce close to the published 77.6%.
Our Take
Mistral Medium 3.5 is the most interesting open-weight release of the year so far. It is not topping every leaderboard, and it is not the cheapest option on the market. But it is the first model that meaningfully closes the SWE-Bench gap to Claude Sonnet 4.6 while staying open-weight, runnable on commodity GPUs, and licensed permissively enough for commercial use.
For our own stack, this changes the calculus on a few internal tools we have been drafting against Sonnet. If we can run Medium 3.5 on a Hetzner box for the price of one good dinner per day and keep our data on our own iron, the API cost of Sonnet starts looking like a tax we are paying for two points of benchmark performance.
We are adding Medium 3.5 to our models tracker, cost calculator, and benchmarks page today. We are also wiring it into our edge latency probe so we can compare cold-start and first-token times against Sonnet, GPT-5.5, and Gemini under real load.
Watch this one closely. The open-weight frontier is no longer a quality compromise. It is just a different deployment story, and Mistral just made that story two points more expensive to ignore.