Back to all posts

What Really Happened During the Cloudflare Outage on November 18, 2025? A Simple Explanation

November 24, 20258 min read

On November 18, 2025, at 11:20 AM UTC, millions of people around the world suddenly couldn't access some of the internet's most popular websites. ChatGPT, Spotify, X (Twitter), Shopify, and Canva all went dark. The immediate reaction? "The cloud is down!" But here's the surprising truth: the cloud was fine. The apps were healthy. What broke was something most people have never heard of: Cloudflare.

This incident perfectly demonstrates why understanding the difference between a network outage and a cloud outage matters—especially if you run a business that depends on the internet (which is basically every business today).

What Is Cloudflare, and Why Does It Matter?

Think of the internet like a massive highway system. Your website or app is like a store at the end of the road. Cloudflare is like the highway, traffic lights, and security checkpoints that sit between your customers and your store.

Specifically, Cloudflare provides:

  • Speed: It caches (stores copies of) your website content in 330+ locations worldwide, so users get faster load times
  • Security: It blocks hackers, bots, and DDoS attacks before they reach your servers
  • Routing: It directs traffic efficiently, like a GPS for internet requests
  • DNS: It translates website names (like "google.com") into the numerical addresses computers understand

Here's the key point: Cloudflare sits in front of your application. It's the front door. So when Cloudflare goes down, users can't reach your app—even if your app is running perfectly fine behind the scenes.

What Happened on November 18, 2025?

The outage wasn't caused by a cyberattack or a hardware failure. It was triggered by something surprisingly simple: a file got too big.

The Technical Breakdown (In Plain English)

Cloudflare uses a system called Bot Management to identify and block automated bots (the bad kind that scrape data or launch attacks). This system relies on a special "feature file" that gets updated every few minutes with information about new bot behaviors.

Here's what went wrong:

  1. A database permissions change: Cloudflare's team was updating permissions in their ClickHouse database (a system that stores and analyzes data)
  2. Duplicate data appeared: Due to this change, the database started outputting duplicate rows of information into the bot feature file
  3. The file doubled in size: What was normally a manageable file suddenly became twice as large
  4. The system couldn't handle it: Cloudflare's software had a size limit built in. When the file exceeded that limit, the software crashed
  5. Global propagation: Because this file gets distributed to all 330+ Cloudflare data centers worldwide every few minutes, the problem spread globally within minutes

The Confusing Part

What made this outage particularly difficult to diagnose was that it kept flickering on and off. Every five minutes, the system would generate a new file. Sometimes it was good (when the query ran on an old database node), sometimes it was bad (when it ran on an updated node). This made Cloudflare's engineers initially think they were under a massive DDoS attack, not experiencing an internal configuration error.

The Timeline

  • 11:05 UTC - Permissions change applied to database
  • 11:20 UTC - Outage begins; users worldwide see error pages
  • ~14:30 UTC - Root cause identified; fix deployed (rolled back to a known good file)
  • 17:06 UTC - All systems fully recovered

Total impact duration: ~6 hours, with the worst period lasting about 3 hours.

Cloudflare Outage vs. Cloud Outage: What's the Difference?

This is where many people get confused. When websites go down, we often hear "the cloud is down." But there are actually two very different types of outages:

AspectCloudflare Outage (Network/Edge)Cloud Outage (AWS/Azure/GCP)
What FailedThe "front door" - routing, security, DNSThe "house itself" - servers, databases, storage
What Users See"Cloudflare Error 500/502" pages, can't reach the siteApp errors, slow performance, data loss, system crashes
Your App StatusUsually healthy and running fineDegraded, offline, or experiencing failures
Typical DurationMinutes to hoursHours to days (may require data recovery)
Real-World AnalogyThe highway to your store is blockedYour store is on fire

During the November 18 outage, AWS, Azure, and Google Cloud were all running perfectly. ChatGPT's servers were fine. Spotify's databases were healthy. But users couldn't reach them because the path was blocked.

What Services Were Affected?

The outage impacted multiple Cloudflare services:

  • Core CDN and Security: The main issue - websites returned HTTP 500 errors
  • Turnstile (Cloudflare's CAPTCHA replacement): Failed to load, preventing logins
  • Workers KV (edge storage): Returned errors as requests couldn't reach the gateway
  • Dashboard: Most users couldn't log in due to Turnstile being down
  • Access (Zero Trust authentication): Authentication failures for most users
  • Email Security: Temporary loss of IP reputation data, reducing spam detection accuracy

Key Lessons for Businesses

1. The Internet Has Single Points of Failure

Cloudflare powers a massive portion of the internet. When it goes down, the impact is enormous. This incident shows that even with the best engineering teams and infrastructure, no system is immune to failure.

2. Multi-CDN Architecture Is No Longer Optional

The companies that survived this outage with minimal disruption used a multi-CDN strategy:

  • Primary CDN: Cloudflare (for normal operations)
  • Secondary CDN: Fastly, Akamai, or AWS CloudFront (for failover)
  • Health Checks: Automated monitoring that switches traffic when the primary fails
  • DNS Failover: Intelligent routing that redirects users to healthy endpoints

This approach is similar to having multiple suppliers for critical business components—you never want to depend entirely on one vendor.

3. Workarounds Exist (But Aren't User-Friendly)

During the outage, some users found ways around the problem:

  • Mobile apps: Many continued working because they bypass Cloudflare's proxy
  • Direct API access: Developers could still call APIs directly (e.g., OpenAI's API worked fine)
  • Direct IP access: Technical users could connect directly to origin servers if they knew the IP addresses

However, these aren't practical solutions for average users, which is why architectural resilience is so important.

4. Small Changes Can Have Massive Consequences

A simple database permissions change caused a global outage affecting millions. This highlights the importance of:

  • Thorough testing of configuration changes
  • Gradual rollouts rather than instant global deployments
  • Size limits and validation on critical files and configurations
  • Better monitoring to catch anomalies before they propagate

What Cloudflare Is Doing to Prevent This

In their post-incident report, Cloudflare acknowledged the failure and outlined several improvements:

  • Better file size validation: Implementing checks before propagating configuration files
  • Gradual rollouts: Testing changes on a small subset of servers before global deployment
  • Enhanced monitoring: Detecting anomalies in configuration file sizes earlier
  • Improved error handling: Ensuring systems degrade gracefully rather than crashing completely

Final Thoughts: Building Resilient Systems

The November 18, 2025 Cloudflare outage is a reminder that the internet is fragile by design. It's a chain of interconnected systems, and a single weak link can cause widespread disruption.

For businesses, the lesson is clear:

  • Diversify your infrastructure: Don't rely on a single CDN, cloud provider, or DNS service
  • Implement failover strategies: Automate the switch to backup systems when primary ones fail
  • Monitor at every layer: Track performance at the edge, network, and application levels
  • Test your disaster recovery: Regularly simulate outages to ensure your failover actually works
  • Communicate clearly: Have a plan to inform customers when third-party services fail

The good news? With the right architecture and planning, you can minimize the impact of these inevitable outages. The companies that came through November 18 unscathed weren't just lucky—they were prepared.


Build Resilient Infrastructure for Your Business

Don't let the next major outage take your business offline. Our team specializes in designing multi-layered, resilient infrastructure with automated failover, intelligent monitoring, and disaster recovery planning. We help you build systems that stay online even when the internet breaks.