LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
Back to Originals

DeepSeek V4 Is The First Open Source Frontier Model. Closed Labs Should Be Worried.

Marcus Chen··7 min read

Yesterday afternoon, while the Western AI press was still digesting GPT-5.5, a Chinese quant fund's research lab quietly uploaded 1.6 trillion parameters to Hugging Face under an MIT license. DeepSeek V4 is here. Twenty-four hours of community testing later, one thing is clear: this is the first time an open weight model has actually caught the frontier, and the gap between open and closed has effectively collapsed for most production workloads.

I've spent the morning running V4 through our pricing pipeline, comparing benchmark submissions, and reading the technical report. The pricing alone would have been the headline. But the architectural work underneath might be more important.

What Actually Shipped

DeepSeek V4 is a two model family. V4-Pro is the flagship at 1.6 trillion total parameters with 49 billion active per token, pre-trained on 33 trillion tokens. V4-Flash is a smaller sibling at 284 billion total and 13 billion active, trained on 32 trillion tokens. Both ship with a 1 million token context window and 384K maximum output. Both use Mixture of Experts architecture. Both are open source under MIT.

The weights are on Hugging Face. The API went live at midnight UTC, supports both OpenAI ChatCompletions and Anthropic message formats, and is currently flagged as a preview release.

The team also confirmed close integration with Huawei's new Ascend 950 inference chips. This is the part Western coverage keeps glossing over. DeepSeek is no longer dependent on Nvidia for inference on its own platform. That decoupling is a story all by itself.

The Pricing Is Not A Typo

Here is the table that has been making the rounds since launch. I rebuilt it with our own numbers and added the relevant Western frontier comparisons.

ModelInput (per 1M)Output (per 1M)ContextLicense
DeepSeek V4 Flash$0.14$0.281MMIT
DeepSeek V4 Pro$1.74$3.481MMIT
GPT-5.5$5.00$30.001MProprietary
Claude Opus 4.7$15.00$75.001MProprietary
Gemini 3.1 Pro$1.25$5.002MProprietary

V4 Pro output tokens cost $3.48 per million. GPT-5.5 output tokens cost $30. That is roughly an 8.6x gap on output and a 2.9x gap on input, before we even account for the fact that the V4 weights are downloadable and self hostable. If you want to skip the DeepSeek API entirely and run on your own GPUs, you can. The license permits it.

V4 Flash is more interesting still. At $0.14 in and $0.28 out, this is frontier-class performance at chatbot pricing. The closest analog from a US lab is GPT-4o Mini at $0.15 in and $0.60 out. Flash is cheaper than that and probably more capable.

The Benchmarks Are Closer Than I Expected

DeepSeek released V4 Pro's benchmark numbers alongside the model card. Independent reproductions have been trickling in all morning. The picture so far:

BenchmarkV4 ProReference
SWE-bench Verified80.6%Within 0.2 pts of Claude Opus 4.6
Codeforces (Elo)3,20623rd among human competitors
FrontierMath Tier 422.1%Below GPT-5.5 (35.4%), above Opus 4.7
Terminal-Bench 2.071.4%GPT-5.5 leads at 82.7%
Artificial Analysis Index54Sits between GPT-5.2 and GPT-5.4

80.6% on SWE-bench Verified is the number that matters. That is the gold standard for real-world coding agents, and V4 Pro is essentially tied with Claude Opus 4.6, the model that anchored the high end of the agentic coding market for most of last year. Opus 4.7 and GPT-5.5 are still ahead, but the gap is now small enough to call a tie for many production workloads.

The 3,206 Codeforces rating is the kind of competitive programming result that used to require a frontier closed model. Ranking 23rd among human competitors is not a marginal score. It is genuinely strong.

The FrontierMath number is the only soft spot. 22.1% is well below GPT-5.5's 35.4%, which I covered in yesterday's GPT-5.5 piece. Frontier mathematics still favors the labs with the deepest reasoning training pipelines. That gap will close eventually, but for now, if your application is dominated by hard math, GPT-5.5 is still the right model.

The Architecture Story Is The Real News

Pricing wars get the headlines, but the technical paper is where this gets interesting. DeepSeek introduced what they call Hybrid Attention Architecture, a routing scheme that mixes traditional dense attention with a learned sparse retrieval layer. The claim is that at 1M token context, V4 Pro requires only 27% of the per-token inference FLOPs and 10% of the KV cache footprint of V3.2.

That is not a small optimization. Long context inference cost has been the main reason frontier labs charge a premium for million token windows. If V4 Pro genuinely runs 1M context at a tenth of the memory cost of last year's architecture, it changes the economics of long context applications. Codebase ingestion, multi-document reasoning, and agent loops with extensive tool history all become significantly cheaper to run.

For self hosting, this matters even more. A model whose 1M context fits comfortably on a small cluster of H200s or Ascend 950s is fundamentally different from a model that requires a small data center to serve. DeepSeek published serving cost estimates of around $0.40 per million input tokens at typical utilization on Ascend hardware. That is the floor.

What Closed Labs Have To Say About It

OpenAI launched GPT-5.5 two days ago at double the price of GPT-5.4. The pitch was that frontier capability deserves a premium. That pitch held up for about 36 hours before DeepSeek essentially undercut the entire frontier on the same day. GPT-5.5 still wins on FrontierMath and Terminal-Bench, but the price gap is now 8 to 10x for tasks where V4 Pro is roughly equivalent. Many enterprise buyers will not pay that gap.

Anthropic is in a more complicated spot. Opus 4.7 is the most expensive proprietary model on the market at $15 in and $75 out. The pitch has been deep reasoning on agent workflows and code. V4 Pro just matched Opus 4.6 on SWE-bench Verified at a 23rd of the output cost. The Opus tier needs a clearer differentiation story than it has today, and Mythos cannot be that story for paying customers because it is gated.

Google's Gemini 3.1 Pro is actually fine. The 2M context window is still unmatched, and Gemini Flash is competitive with V4 Flash at the budget end. Google was always playing a different game on price. They are mostly insulated from this release.

What This Means If You're Building

For new applications, V4 Flash is now the default starting point for prototyping anything that does not require the absolute frontier. It is cheaper than the budget tier of every US lab, more capable, and the weights are downloadable if you ever want to leave the managed API. There is almost no reason to start a new project on GPT-4o Mini today.

For production agentic coding applications, V4 Pro is now a serious option. Real benchmark parity with Claude Opus 4.6 at a fraction of the cost is a genuine offer. The risk is deployment maturity. The DeepSeek API is still labeled preview and has had two minor incidents in the last 24 hours according to our incident log. If you need 99.9% uptime today, stick with the established providers and revisit V4 in six weeks.

For anyone running self-hosted inference, this is an unambiguous upgrade path. The MIT license, the architectural efficiency gains, and the published serving cost guidance mean you can rebuild your inference stack on V4 weights with confidence. Several inference platforms (Together, Fireworks, Lepton) already announced V4 Pro endpoints this morning at sub $2 output pricing.

Our Take

For two years, the open source AI conversation has been about whether the gap to frontier closed models was closing fast enough to matter for real applications. DeepSeek V4 is the answer. The gap is closed for the workloads that account for most enterprise AI spend: code generation, agent loops, long context retrieval, and routine reasoning. The gap is still real, but only at the very top of the capability curve, where models like GPT-5.5 and Claude Mythos still hold a clear lead.

The pricing implications are going to ripple for months. Closed labs will have to either justify their premium with capabilities V4 cannot match, drop prices into the V4 range, or accept that the floor of the frontier is now open source. None of those options are easy. We covered the broader pricing trajectory in our pricing floor analysis two weeks ago. V4 just yanked that floor down another notch.

We are adding both V4 models to our models tracker and cost calculator today. We will run our own SWE-bench Verified reproduction over the weekend and update the page with results. Independent benchmark verification is what matters, and the next two weeks of community testing will tell us whether the released numbers hold under adversarial conditions.

One last thought: a year ago, DeepSeek R1 was a surprise. Today, V4 is an expected move from a lab that has earned its reputation. The next surprise is probably Mistral, Qwen, or a lab nobody is watching yet. The frontier is no longer a closed club.