OpenAI Just Shipped Voice Models That Reason Mid-Sentence. ElevenLabs Has a Pricing Problem.

Kira Nolan·May 9, 2026·6 min read

On Tuesday, OpenAI quietly reset the voice agent stack. Three models, one launch post, and a price sheet that turns most of the voice vendor middle into a margin question. GPT-Realtime-2 is the headliner. It is the first OpenAI voice model with GPT-5-class reasoning, a 128K context window, and the ability to keep a conversation moving while it thinks, calls a tool, and recovers from the user interrupting it.

The other two models are doing more strategic work than the launch post made obvious. GPT-Realtime-Translate handles 70+ input languages into 13 output languages with speaker-pace streaming. GPT-Realtime-Whisper is a streaming speech-to-text model that transcribes live as the speaker talks. Together the three of them form a stack that covers most of what the voice agent layer was paying middleware vendors to do.

I've spent the last 36 hours running the API against our voice latency probes and re-running the unit economics for an agent doing real phone work. The headline is simple: the price floor for a useful voice agent just dropped roughly 4x for inference, and the inference is now smarter than what most ElevenLabs and Deepgram customers were stitching together a week ago.

What Actually Shipped

Three models, all in the Realtime API. None of them are research previews. All three are generally available with documented rate limits and production SLAs. That matters, because OpenAI has historically used the Realtime API to ship preview-quality voice features and let them sit in beta for months. Not this time.

GPT-Realtime-2 is the first voice model OpenAI has shipped that inherits the reasoning stack from GPT-5. The earlier gpt-realtime model was trained on a smaller multimodal base and topped out around GPT-4o-class quality on tool use and multi-turn coherence. GPT-Realtime-2 expands the context window from 32K to 128K, which is the practical unblock for any voice agent that has to hold a transcript, a system prompt, a CRM payload, and a tool history in the same session without summarizing aggressively.

GPT-Realtime-Translate is the model I underestimated on first read. It is not just a translator. It is a real-time translator that streams output at the speaker's pace across a 70-to-13 language matrix. That is the configuration that breaks the standard three-vendor pipeline of STT plus translation plus TTS, because the model is doing all three steps in one forward pass without going through text as an intermediate representation.

GPT-Realtime-Whisper is the rounded-out streaming transcription primitive. The original Whisper API was non-streaming, which meant production voice agents had to either chunk audio and live with the latency or move to Deepgram or AssemblyAI for streaming STT. That gap is now closed.

The Pricing: Where It Hurts

OpenAI did not bury the price sheet. The numbers are aggressive enough that they read as a market-share play, not a cost-recovery release.

Model	Input	Output	Notes
GPT-Realtime-2	$32 / 1M audio tokens	$64 / 1M audio tokens	128K context, $0.40 cached input
GPT-Realtime-Translate	$0.034 / minute	included	70 input langs, 13 output langs
GPT-Realtime-Whisper	$0.017 / minute	streaming text	Live STT, low latency
ElevenLabs Conversational	$0.08 / minute (Business)	bundled	Per-minute, model not included
Deepgram Streaming STT	$0.0043 / minute	n/a	Cheaper, transcription only

Translate at $0.034 per minute is the line that breaks the most spreadsheets. The vendor stacks I've seen for cross-language voice agents typically charge $0.10 to $0.18 per minute when you sum up STT, machine translation, and TTS with a passable voice. OpenAI is delivering the same job at roughly a third of the floor and the translation quality is already at parity with the standalone NMT services that dominated this category for a decade.

Whisper streaming at $0.017 per minute is more interesting strategically than economically. Deepgram is still cheaper on raw streaming STT. The OpenAI pitch is that you don't have to integrate a separate vendor when GPT-Realtime-2 is doing the conversational work in the same session. The bundle is what they are selling, not the unit price.

GPT-Realtime-2 itself at $32/$64 per million audio tokens is in the same neighborhood as the original gpt-realtime, which is the right move. If they had hiked the price to match the GPT-5.5 reasoning premium, the launch would have read very differently. Instead they held the line and added the reasoning lift. That is a deliberate signal to the market that voice is now a volume category, not a premium-pricing one.

What This Does to the Voice Vendor Middle

ElevenLabs is the company most directly squeezed. Their Business plan at $0.08 per minute for Conversational AI assumed customers would pay a premium for voice quality and the convenience of a single-vendor stack. GPT-Realtime-2 collapses both assumptions in the same release. The voice quality is good enough for production agents, the bundle includes reasoning, and the per-minute math is roughly a quarter of ElevenLabs at typical usage profiles.

ElevenLabs still has a real wedge in branded voices and emotional expressiveness for audio content, audiobooks, and games. That is a smaller TAM than the conversational agent category they were building toward. The next quarter will be the first time customers do real production-volume comparisons, and the comparison is no longer flattering.

Deepgram is in a different position. Their pure streaming STT is cheaper than GPT-Realtime-Whisper and they have an enterprise compliance story OpenAI cannot match on day one. But the gravitational pull of a single-vendor voice stack is real, and customers buying GPT-Realtime-2 for the reasoning will tend to pick up Whisper for the STT in the same purchase order. That is the squeeze.

The bigger structural shift is that OpenAI is now selling voice the way AWS sells compute. Three primitives, one bill, one set of API keys. The middleware story (we stitch together STT plus translation plus an LLM plus TTS for you) gets harder to tell when the platform vendor ships those primitives natively at one third the price.

The Reasoning Mid-Sentence Detail

The capability OpenAI led with in the launch post is the one that is hardest to benchmark and easiest to underestimate: GPT-Realtime-2 can keep talking while it reasons. Earlier voice models would either pause, hallucinate filler, or commit to a wrong answer when a complex request hit. GPT-Realtime-2 reportedly handles tool calls, mid-turn corrections, and multi-step reasoning without the conversational stall that gave away the older voice agents as soon as the customer asked a hard question.

This is the feature that finally makes a voice agent usable for things like outbound sales discovery calls, tier-one support escalations, and patient intake. Not because it sounds better, but because the cognitive load is no longer hidden by latency tricks. If the model needs to think for two seconds, it can think for two seconds while saying "let me check that for you" in a way that doesn't feel scripted.

The honest caveat: I haven't seen third-party benchmarks on this yet. OpenAI is self-reporting the capability, and voice agent quality is notoriously hard to measure. We are running our own evaluation pipeline against it this week and the early signal is positive but not proven. Treat the reasoning claim as plausible and worth testing, not yet validated.

Where This Lands in the Stack

GPT-Realtime-2 fits into a larger pattern. OpenAI is consolidating the agent stack one modality at a time. Text reasoning (GPT-5.5). Coding (the Workspace Agents launch). Voice (this week). Each release knocks out a layer of vendors that were thriving when the platform layer was thinner. The Realtime stack is the voice version of the same move.

For TensorFeed, the practical impact is that we are adding GPT-Realtime-2 to our models tracker and reworking our voice leaderboards to track per-minute economics across model plus middleware combinations rather than treating the categories separately. The voice agent layer is no longer a stack of decoupled vendors. It is a single product question now: which platform owns the session.

And for anyone building a voice agent product on the older Whisper plus GPT-4o plus ElevenLabs stack: rerun your numbers. The bill of materials probably just dropped 60% and the reasoning quality probably just went up. That is not a porting decision you want to put off until your competitor ships first.

Our Take

This is the most aggressive voice release OpenAI has done. Three models, one stack, pricing that reads like a land grab. The reasoning lift is real if the self-reported capability survives third-party benchmarking. The translation model is the sleeper. And the squeeze on ElevenLabs and Deepgram is structural, not cyclical.

The interesting question is which lab answers next. Anthropic doesn't have a voice product in market, and the absence is starting to look strategic rather than incidental. Google has Gemini Live, but the pricing and capability story has not been updated since January. xAI is in the conversation only because Grok Voice exists, not because anyone is building production agents on it. If a real competitive answer shows up in the next 30 days, it will most likely come from Google. If it doesn't, OpenAI owns the voice layer the way they once owned the chatbot layer, and the implications for downstream voice vendors get harder from here.

Back to Originals Back to Feed