Back to Originals

I Tracked AI Service Outages for a Month. Here's What I Found.

Ripper··4 min read

TensorFeed polls the status of every major AI service every two minutes. We've been doing this since launch, and the data is starting to tell a clear story. I pulled the numbers from February 20 to March 20, 2026, and the results were more interesting than I expected.

The short version: AI services are less reliable than you think, outages cluster in predictable patterns, and the differences between providers are significant enough to matter for your architecture decisions.

The Scoreboard

Over 30 days, here's how the major services performed:

ServiceIncidentsTotal DowntimeAvg ResolutionUptime %
Claude API32h 14m45 min99.69%
OpenAI API75h 38m48 min99.22%
Gemini API43h 22m50 min99.53%
AWS Bedrock21h 05m32 min99.85%
Mistral Platform54h 10m50 min99.42%
Replicate67h 45m78 min98.92%

AWS Bedrock was the most reliable, which makes sense. It's running on Amazon's infrastructure with their decades of ops experience. The direct-to-provider APIs (Claude, OpenAI, Gemini) were in the middle. Replicate had the roughest month, with the longest total downtime and the slowest recovery times.

When Outages Happen

This was the most interesting finding. Outages are not randomly distributed across the week. They cluster heavily around two patterns.

Tuesday and Wednesday afternoons (US Pacific). This is when most providers do their deployments. New model versions, infrastructure updates, scaling changes. The deployment window is the riskiest period, and Tuesday/Wednesday catch most of it. Of the 27 total incidents I tracked, 14 happened on Tuesdays or Wednesdays.

Monday mornings (US Eastern). Usage spikes at the start of the work week, and services that were fine over the weekend sometimes buckle under the sudden load increase. Five incidents happened during the Monday 8am to 11am EST window.

Weekends were nearly spotless. Only two incidents happened on Saturday or Sunday across the entire month. If you're planning a critical demo or deadline, Saturday is statistically your safest bet.

The Cascade Effect

Something I didn't expect: when one major provider goes down, others often degrade within the next hour. Not because they're technically linked, but because traffic shifts. When OpenAI goes down, developers switch to Claude or Gemini. The surge in requests to the fallback provider can push it past its capacity limits.

I saw this happen twice during the tracking period. OpenAI had a significant outage on a Wednesday afternoon, and within 40 minutes, Claude's API response times doubled. Not a full outage, but enough degradation to affect production workloads. The providers know this happens, but the surge capacity isn't always there to absorb it.

What Developers Should Do

Based on a month of data, here are my practical recommendations.

Always have a fallback provider. If your production app uses Claude as the primary model, configure an OpenAI or Gemini fallback. Not as a theoretical plan, but as working code that's been tested. The switch should be automatic.

Monitor proactively. Don't wait for your users to tell you the AI API is down. Use our status dashboard or set up outage alerts to get notified the moment a service degrades. The average time between incident start and official status page update was 12 minutes. That's 12 minutes your users are getting errors if you're only watching the provider's status page.

Implement circuit breakers. When you detect elevated error rates or latency, stop sending requests and switch to your fallback immediately. Don't wait for the provider to confirm an outage. Your error rate is all the confirmation you need.

Cache aggressively. If your AI responses can be cached (and many can), cache them. A cache hit that serves a slightly stale response is infinitely better than a failed request during an outage.

What Comes Next

I'm going to keep tracking this data and publishing regular updates. The one-month snapshot is useful, but the real value will come from tracking trends over quarters. Are services getting more reliable as they scale? Are deployment windows getting safer? Does competition drive better uptime?

All of the raw data feeds into our incident database, which is open and queryable through the API. If you want to run your own analysis or build monitoring into your infrastructure, it's all there.

The AI services we depend on are still young. Their reliability will improve. But right now, treating any single provider as a guaranteed always-on service is a mistake. Plan for failure, build in redundancy, and keep watching the data.