Share this article on:
Breaking news and updates daily. Subscribe to our Newsletter
Network delivery and security giant Cloudflare has experienced a multi-day outage on its control plane and analytics services.
From 2 November, 22:44 AEST (11:44 UTC), until 4 November, 11:25 AEST (04:25 UTC), the systems suffered an outage, resulting in some customers being unable to access certain services.
Beginning on Thursday, November 2, 2023 at 11:43 UTC Cloudflare's control plane and analytics services experienced an outage. Here are the details https://t.co/WIoobciSYm— Cloudflare (@Cloudflare) November 4, 2023
The control plane outage meant the customer-facing interface for all services, such as website and APIs, was unavailable. Most of these issues were resolved, and services were restored by 2 November, 04:57 AEST (17:57 UTC), through the company’s disaster recovery facility.
Similarly, all reporting and other services as part of the company’s analytics services suffered from outages.
“Many customers would not have experienced issues with most of our products after the disaster recovery facility came online,” said Cloudflare.
“However, other services took longer to restore, and customers that used them may have seen issues until we fully resolved the incident.
“Our raw log services were unavailable for most customers for the duration of the incident.”
Cloudflare said the incident was a result of a physical power outage at one of its data centres, which are all spaced out so that the chances of an outage would be minimised in the event of a natural disaster.
One of its three Oregon facilities, “PDX-DC04”, houses the company’s “largest analytics cluster as well as more than a third of the machines for our high availability cluster”.
Cloudflare rents space in this facility, which is run by Flexential, a data centre solution and facilities organisation.
The issue arose when Portland General Electric, which supplies utilities to the site, had unplanned maintenance, which shut down a power feed into PDX-DC04. While Flexential diverted power to its emergency generators, it failed to inform Cloudflare that it had done so.
“It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centres to drop off the grid when power demands are high and run exclusively on generators,” said Cloudflare, which revealed that Flexential should have been able to run the facility from the utility feed and did not need the generators.
While the exact reason for the outage and the events that followed the power issue are still inconclusive, Cloudflare has speculated that an issue with Portland General Electric and its DSG services, which enables local utility to run data centre generators and supply additional power to the grid, with the power company assisting in maintenance and fuel for the generators.
Cloudflare found that there was a ground fault on one of the transformers at the site and that it believes it was this transformer that “stepped down power from the grid for the second feed that was still running as it entered the data centre”.
It also said that, while unconfirmed, the ground fault was caused by unplanned maintenance.
Due to the danger caused by ground faults with high-voltage power lines, these systems are designed to shut down and prevent damage. In this scenario, this shut down the facility’s generators, leaving both the generators and the utility lines offline.
While the facility had additional emergency battery power, which is supposed to last for 10 minutes, allowing for issues to be fixed, the batteries failed after only four minutes. It also took Flexential much longer than 10 minutes to fix the issues.
For the full breakdown of the outages, head to the Cloudflare blog.
Comments powered by CComment