Skip to content
LIVE
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
All systems operational0 AI providers monitored, polled every 2 minutes
Live status
Back to Originals

Microsoft Shipped Seven of Its Own Models. The One That Counts Lives Inside Copilot.

Ripper··6 min read
MODEL RELEASE

At Build on June 2, Microsoft rolled out seven models it built itself, spanning image, voice, transcription, reasoning, and coding. Two of them matter. MAI-Thinking-1 is the company's first in-house reasoning model. MAI-Code-1-Flash is a small coding model that already lives in GitHub Copilot and quietly undercuts Claude Haiku 4.5 on both accuracy and cost. The headline everyone wrote was about benchmarks. The story underneath is Microsoft building a door it can walk through if it ever wants out of the OpenAI deal.

I've spent the day reading the model cards and the Build keynote so you do not have to. Here is what shipped, what the numbers actually say, and why the small coding model is the one I would watch.

What Microsoft Actually Launched

The release came from Microsoft AI's Superintelligence team, and it is a family, not a single flagship. The named models include MAI-Thinking-1 (reasoning), MAI-Code-1 and MAI-Code-1-Flash (coding), MAI-Image-2.5 (image generation and editing), MAI-Transcribe-1.5 (speech to text), and MAI-Voice-2 (text to speech). Microsoft says all of it was trained end to end on clean, appropriately licensed data, and reporting on MAI-Thinking-1 says it was built without OpenAI data. That detail is the whole point.

MAI-Image-2.5 reportedly entered the LMArena image editing board near the No. 2 spot. MAI-Transcribe-1.5 is being pitched on FLEURS and Artificial Analysis accuracy. Those are fine. But the two that move the competitive map are the reasoning model and the small coding model, so let me take them in turn.

MAI-Code-1-Flash: Small, Cheap, and Aimed at Haiku

MAI-Code-1-Flash is the one to study, and not because it is the biggest. It is the opposite. Reported at roughly 5 billion parameters, it is a lightweight, agentic coding model trained directly against the GitHub Copilot harness that developers actually run. Microsoft did not optimize it for a leaderboard. It optimized it for the environment where it ships, which is a different and harder thing to do well.

Microsoft benchmarked it head to head against Claude Haiku 4.5 on four coding evaluations using the same production harness, measuring both pass rate and the average number of tokens spent per task. MAI-Code-1-Flash wins on all four, and the margin on the messy, real-world tasks of SWE-Bench Pro is the one that stands out: 51.2% versus 35.2%, a 16 point lead. It also does it leaner, solving harder problems with up to 60% fewer tokens on SWE-Bench Verified.

MetricMAI-Code-1-FlashClaude Haiku 4.5
SWE-Bench Pro51.2%35.2%
SWE-Bench Verified token useup to 60% fewerbaseline
IF Bench (precise instructions)+28.9 pt leadbaseline
Adversarial reasoning set (186 q)85.8%below MAI
Copilot token pricecheaperbaseline

A couple of caveats before anyone treats this as settled. These are Microsoft's own numbers, run on Microsoft's harness, against a single competitor it picked. Haiku 4.5 is Anthropic's cheap, fast tier, not its strong one, so beating it is the right fight for a 5B model but it is not a claim about frontier coding. Microsoft also notes its own weak spots: on adversarial categories like Einstellung traps, where a familiar problem is twisted to punish pattern-matching, the model still sits below 50%. Honest of them to publish that. Watch for independent reproductions before you rewire your stack.

The reason it still matters: this thing is already in the Copilot model picker and the Auto picker for individual VS Code users, rolling out across paid tiers. It is priced under Haiku 4.5 in Copilot's token billing. If you write code in Copilot, a Microsoft model is about to start quietly handling some of your requests whether you chose it or not, and the per-token bill goes to Microsoft instead of to a third party.

MAI-Thinking-1: The First Reasoning Model Out of Redmond

MAI-Thinking-1 is the bigger sibling and the bigger statement. It is a mixture-of-experts model with about 35 billion active parameters and a 256K context window, and it is Microsoft's first reasoning model of its own. The reported scores are competitive rather than dominant: 97% on AIME 2025 and 53% on SWE-Bench Pro. More interesting than the raw numbers, independent human raters on Surge reportedly preferred it over Claude Sonnet 4.6 in blind side-by-side quality comparisons. Take that with the usual salt, but a midsize model trading blows with Sonnet at lower token cost is a real result.

It is in private preview through Microsoft Foundry, not general availability, so this is a signal of intent more than a product you can build on today. The signal is loud anyway. Microsoft now has its own reasoning model, its own coding models, its own image, voice, and transcription models, and a next-generation GB200 cluster it says is already running them. That is a full stack, owned end to end.

The Real Story Is Independence, Not Benchmarks

Strip away the scores and here is the structural move. Microsoft has spent years as OpenAI's largest backer and its primary distribution channel, paying to run OpenAI's models for hundreds of millions of Copilot and Azure customers. Every one of those calls is a cost line and a dependency. The MAI family lets Microsoft serve a growing share of those calls on its own Azure infrastructure, at a price it sets, with a model it controls.

That is the same playbook every hyperscaler is running right now. Own the layer you used to rent. The economics are simple: as the leading models get more expensive, a good-enough in-house model that you do not pay a third party for goes straight to margin, and you can pass some of the savings to developers to keep them on your platform. Coding is the smartest place to start, because Copilot is a captive, high-volume surface where a small fast model earns its keep on millions of low-stakes completions.

It also lands inside a louder week. Microsoft Build, Nvidia's GTC Taipei, and ServiceNow Knowledge all hit the same window and converged on one message: the agent runtime is the product now, and the model is becoming a component inside it. Google pushed Gemini 3.5 Flash to general availability in the same stretch, and OpenAI confirmed it is retiring GPT-4.5 from ChatGPT on June 27. The frontier labs are still setting the ceiling. The platforms are busy commoditizing the floor underneath them, and Microsoft just planted a flag on the floor.

Our Take

MAI-Code-1-Flash will not show up on a frontier leaderboard, and that is exactly why it is the most important thing Microsoft shipped this week. A 5B model that beats a competitor on real coding tasks, burns fewer tokens, costs less, and is already wired into the editor where developers live is a commercial weapon, not a research flex. The benchmark you should care about is the invoice.

For now, nothing about your model choices has to change. The frontier still belongs to the biggest models from OpenAI, Anthropic, and Google, and MAI-Thinking-1 is a preview, not a dependency you can take. But the direction is unmistakable. The companies that distribute AI are no longer content to resell it. We are adding the MAI models to our models tracker and watching the independent benchmark reproductions over the next week. Microsoft's self-reported numbers are strong. Third party validation is what counts.

One more thing worth saying plainly: a vendor publishing its own losses, like that sub-50% adversarial result, is a small act of credibility in a field that mostly does not bother. I will take more of that.