Skip to main content

Cloudflare explains the mistake that took down large chunks of the internet yesterday

Huge chunks of the internet were completely unavailable yesterday, with many other websites and services experiencing slow performance. It was immediately clear that the problem was with the Cloudflare network, but it took some time for the company to establish the true cause.

Cloudflare says that it initially believed it was experiencing a massive cyber-attack, but subsequently realized the problems were caused by a “painful” error with a software update …

As we reported yesterday, the outage was a massive one.

A large number of apps and websites are currently taken entirely offline, or experiencing significant outages, due to an issue with the popular Cloudflare infrastructure network provider. The Cloudflare CDN powers the websites behind many high-profile apps, so any outage at Cloudflare has wide-reaching implications. That includes social media site X (formerly Twitter), where users are currently unable to publish new posts or refresh their timelines. The problem appears to be impacting web users worldwide.

Why Cloudflare thought it was under attack

Cloudflare said the pattern it saw was connections being taken offline for around five minutes at a time before being restored and then taken offline again. This pattern led the company to believe that it was experiencing what it described as a hyperscale DDoS attack, since a technical error would not normally fix itself.

A distributed denial of service attack is when a malicious actor directs a very large volume of requests to a server in order to use all its available capacity, meaning that genuine users are unable to access the service.

What appeared to be further evidence for a cyber attack turned out to be pure coincidence.

Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page. 

The true cause was a Cloudflare error

However, it subsequently discovered that the problem was it had messed up an update to a file used by its bot management system.

There’s an unwritten rule in IT that if you’re experiencing a problem with weird symptoms, it will be a permissions issue – and that was the case here.

It was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

There was also a simple explanation for the odd five-minute cycle.

The file was being generated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management. Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.

The company issued an apology, describing its mistake as “deeply painful.”

We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare’s importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today.

You can read a more detailed explanation in a Cloudflare blog post.

Highlighted accessories

Photo by David Pupăză on Unsplash

FTC: We use income earning auto affiliate links. More.

You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

Comments

Author

Avatar for Ben Lovejoy Ben Lovejoy

Ben Lovejoy is a British technology writer and EU Editor for 9to5Mac. He’s known for his op-eds and diary pieces, exploring his experience of Apple products over time, for a more rounded review. He also writes fiction, with two technothriller novels, a couple of SF shorts and a rom-com!


Ben Lovejoy's favorite gear