I Tracked AI Service Outages for a Month. Here's What I Found.

Ripper·March 27, 2026·4 min read

TensorFeed polls the status of every major AI service every two minutes. We've been doing this since launch, and the data is starting to tell a clear story. I pulled the numbers from February 20 to March 20, 2026, and the results were more interesting than I expected.

The short version: AI services are less reliable than you think, outages cluster in predictable patterns, and the differences between providers are significant enough to matter for your architecture decisions.

The Scoreboard

Over 30 days, here's how the major services performed:

Service	Incidents	Total Downtime	Avg Resolution	Uptime %
Claude API	3	2h 14m	45 min	99.69%
OpenAI API	7	5h 38m	48 min	99.22%
Gemini API	4	3h 22m	50 min	99.53%
AWS Bedrock	2	1h 05m	32 min	99.85%
Mistral Platform	5	4h 10m	50 min	99.42%
Replicate	6	7h 45m	78 min	98.92%

AWS Bedrock was the most reliable, which makes sense. It's running on Amazon's infrastructure with their decades of ops experience. The direct-to-provider APIs (Claude, OpenAI, Gemini) were in the middle. Replicate had the roughest month, with the longest total downtime and the slowest recovery times.

When Outages Happen

This was the most interesting finding. Outages are not randomly distributed across the week. They cluster heavily around two patterns.

Tuesday and Wednesday afternoons (US Pacific). This is when most providers do their deployments. New model versions, infrastructure updates, scaling changes. The deployment window is the riskiest period, and Tuesday/Wednesday catch most of it. Of the 27 total incidents I tracked, 14 happened on Tuesdays or Wednesdays.

Monday mornings (US Eastern). Usage spikes at the start of the work week, and services that were fine over the weekend sometimes buckle under the sudden load increase. Five incidents happened during the Monday 8am to 11am EST window.

Weekends were nearly spotless. Only two incidents happened on Saturday or Sunday across the entire month. If you're planning a critical demo or deadline, Saturday is statistically your safest bet.

The Cascade Effect

Something I didn't expect: when one major provider goes down, others often degrade within the next hour. Not because they're technically linked, but because traffic shifts. When OpenAI goes down, developers switch to Claude or Gemini. The surge in requests to the fallback provider can push it past its capacity limits.

I saw this happen twice during the tracking period. OpenAI had a significant outage on a Wednesday afternoon, and within 40 minutes, Claude's API response times doubled. Not a full outage, but enough degradation to affect production workloads. The providers know this happens, but the surge capacity isn't always there to absorb it.

What Developers Should Do

Based on a month of data, here are my practical recommendations.

Always have a fallback provider. If your production app uses Claude as the primary model, configure an OpenAI or Gemini fallback. Not as a theoretical plan, but as working code that's been tested. The switch should be automatic.

Monitor proactively. Don't wait for your users to tell you the AI API is down. Use our status dashboard or set up outage alerts to get notified the moment a service degrades. The average time between incident start and official status page update was 12 minutes. That's 12 minutes your users are getting errors if you're only watching the provider's status page.

Implement circuit breakers. When you detect elevated error rates or latency, stop sending requests and switch to your fallback immediately. Don't wait for the provider to confirm an outage. Your error rate is all the confirmation you need.

Cache aggressively. If your AI responses can be cached (and many can), cache them. A cache hit that serves a slightly stale response is infinitely better than a failed request during an outage.

What Comes Next

I'm going to keep tracking this data and publishing regular updates. The one-month snapshot is useful, but the real value will come from tracking trends over quarters. Are services getting more reliable as they scale? Are deployment windows getting safer? Does competition drive better uptime?

All of the raw data feeds into our incident database, which is open and queryable through the API. If you want to run your own analysis or build monitoring into your infrastructure, it's all there.

The AI services we depend on are still young. Their reliability will improve. But right now, treating any single provider as a guaranteed always-on service is a mistake. Plan for failure, build in redundancy, and keep watching the data.

Back to Originals Back to Feed