LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
Back to Originals

Anthropic Just Ran the First Real-Money AI Agent Marketplace. The Results Reveal a Coming Inequality.

Kira Nolan··7 min read

Anthropic just published the results of Project Deal, an internal experiment where 69 employees handed their wallets to Claude and let the model buy and sell physical goods on their behalf. The agents made 186 trades. They moved more than $4,000 in real money. And Anthropic ran a secret A/B test on top of all of it, the kind of test that should make anyone planning to deploy agents at scale stop and read it twice.

The headline finding is not that AI agents can negotiate. We knew that. The finding that matters is what happened when people were given different model tiers without their knowledge. Some got Opus 4.5. Some got Haiku 4.5. The Opus users won, every time. None of the Haiku users noticed they were losing.

What Project Deal Actually Was

Anthropic ran the experiment in December 2025 inside its San Francisco office. 69 employees opted in. Each one got a $100 gift card balance and a small marketplace where they could list items they owned and bid on items their coworkers had listed. Snowboards. Ping-pong balls. Office plants. Real stuff, real money, real ownership transfers when the experiment ended.

The catch: every participant was represented by a Claude agent. The agent posted listings, set asking prices, made offers, ran counteroffers, and closed deals. Humans set high-level preferences ("I want to sell this snowboard, target $80") and then stepped back. The agents talked to each other. The market cleared.

Anthropic released full results on April 25. You can read their summary on the Project Deal page they published. The numbers are interesting. The methodology is the part that should land harder.

The Hidden A/B Test

What participants did not know is that Anthropic was running four parallel marketplaces in the background. One was the "real" marketplace where every participant got the same model and the trades were honored. The other three were study markets, where Anthropic randomly assigned participants to Opus 4.5 or Haiku 4.5.

This was a clean experiment. Same participants. Same items. Same starting cash. Different agents. The results were not subtle.

MetricOpus 4.5 advantageNotes
Seller revenue per item+$2.68Opus sellers got higher final prices
Buyer savings per item+$2.45Opus buyers paid less per deal
Deals closed+2.07Opus completed more transactions overall
User-perceived performanceNo differenceHaiku users did not notice they were losing

Two and a half dollars per item does not sound like much. Scale it. If your agent runs hundreds of decisions a day, and each one costs you a couple of dollars in expected value because it is reasoning with a smaller model, you are bleeding cash. The other side is bleeding it onto a competitor with deeper pockets.

The user-perception part is the one that rattles me. Every Haiku participant came away from the experiment thinking their agent did fine. They had nothing to compare it to. They saw their listings sell, their bids accepted, their balance move. From the inside, the experience felt complete. From the outside, they were systematically getting worse outcomes.

Why This Matters Outside an Office

Project Deal is the first real-money agent-to-agent commerce study at scale that I am aware of, and Anthropic published it because they want to highlight that this kind of market is coming. It is. MCP just crossed 97 million installs. OpenAI shipped workspace agents this week. Every major lab is putting agents into the loop on tasks that used to be human-only. Procurement. Travel. Negotiation. Hiring funnels. Insurance.

When those agents start meeting other agents, the negotiation dynamics from Project Deal stop being a curiosity and start being a market structure. If your supplier's agent is running on a $0.25 input model and yours is running on a $5 input model, there is now a measurable expectation that the better-resourced agent extracts more value, every time, on every line item.

That is a kind of inequality the agent era has not had to confront yet. The cheap models are good enough for most consumer chat. They are clearly not good enough for adversarial negotiation against a frontier-tier counterparty.

The Pricing Math, Made Concrete

Pull up the actual API costs and it gets uglier. Here is what the gap looks like at current list prices.

ModelInput (per 1M)Output (per 1M)Tier
Claude Opus 4.7$15.00$75.00Frontier
Claude Sonnet 4.6$3.00$15.00Mid
Claude Haiku 4.5$0.25$1.25Budget
GPT-5.5$5.00$30.00Frontier
Gemini 3.1 Pro$1.25$5.00Mid

Opus output costs 60x more than Haiku output. The Project Deal numbers say that 60x cost gap turned into roughly $5 of edge per closed deal (the seller side plus buyer side). On a household-level shopping agent that closes one deal a week, the Haiku version saves you about $80 a year on inference and loses you about $260 in worse outcomes. The cheap option is not the cheap option.

This is the real argument for the premium tier. Not bigger context windows. Not nicer prose. Negotiation outcomes. The model that is smart enough to bluff, anchor, and walk away when the counterparty is doing the same.

The Consent Question

I want to flag something about how this experiment was designed. Anthropic ran a hidden A/B test on paid employees with real money on the line. Some of them lost real value because they were quietly assigned the cheaper model. The participants opted into Project Deal. Anthropic disclosed the methodology only after the experiment ended.

For an internal study at a research lab, that is reasonable. For the same setup deployed in a consumer product, it is the start of a regulatory conversation. If a marketplace assigns me an inferior agent and does not tell me, while routing my counterparty to a better one, that is the algorithmic version of yield-managing the customer. The FTC has shown an appetite for chasing this kind of thing in airline pricing. Agent assignment will get there.

46% of Project Deal participants said they would pay for a similar service. That is a strong signal for product-market fit. It is also a signal that the consumer agent layer is going to ship before the regulatory framework around it does. Which is normal for the internet. It is just noticeable when the stakes are real cash on real items in real time.

Our Take

Project Deal is a quiet announcement with a loud subtext. The tech works. Agents can negotiate real deals with real money and the markets clear cleanly. That part is settled.

The harder finding is that the model you assign your agent matters more than anyone has been publicly admitting. Cheap models lose to expensive ones in negotiation, and the people running cheap models cannot tell. We are about to live through several years of agent-to-agent commerce where the price-performance curve looks one way (cheap is fine, smart is luxury) and the actual outcome curve looks completely different (cheap pays a tax, you just cannot see it).

For developers building agent products: do not default to the cheapest model for tasks that involve adversarial negotiation, contract terms, or pricing decisions. Run the math the way Anthropic ran it. Hold every input constant. Swap only the model. Measure outcome. The right tier is the one that breaks even on inference cost plus expected delta in outcomes, not the one that wins on raw API price.

You can compare frontier model pricing on our models tracker and run your specific workload through the cost calculator if you want to model the break-even yourself.

The agent economy is not arriving with a press release. It is arriving with experiments like this one. Quiet. Methodologically tight. Findings that change how you should price your stack. If you missed Project Deal in the noise this week, do not. It is the most important thing Anthropic has published in a month.