How a Critical to Reliability (CTR) Approach Enhances Colocation Provider and Hyperscaler Uptime
The spike in demand for IT and data center capacity is unprecedented. Colocation and hyperscale data centers are operating at their optimal levels, but reliability and uptime are still a concern. The Uptime Institute’s 2019 Global Data Center Survey, for example, indicates that a third of the 1,600 survey respondents experienced a downtime incident or severe service degradation. A number of these incidents resulted in serious business-related financial losses. Nearly 80 percent of respondents indicated their most recent outage could have been prevented. The time to full recovery for most outages was one to four hours, with over a third reporting a recovery time of five hours or longer.
For the colocation centers and hyperscalers whose customers require predictable reliability, such a situation is untenable. Innovative approaches have to go beyond just, assuring that individual data center facility physical infrastructure products are reliable. Although a product like an Uninterruptible Power Supply (UPS) might be capable of running for an extended period of time without breaking down, a more holistic approach is needed to guarantee a higher degree of operational reliability.
This is where the Critical to Reliability (CTR) approach adds value. The CTR approach incorporates many physical components of data center infrastructure systems which ensure uptime, like UPS, switchgear, SCADA systems, breakers, power monitoring software, and Programmable Logic Controllers (PLCs) and manages them as a collective whole. If properly implemented, the CTR approach helps to boost overall reliability of colocation and hyperscale operations.
How CTR Provides a More Detailed Uptime Forecast
In order to implement a CTR approach, colocation data center stakeholders must first recognize the difference between product quality and product reliability. A quality UPS, for instance, might work fine after it is manufactured, tested and commissioned. However, once the UPS is operating in a live production environment, there is an element of time–how long that UPS will operate in the field—that comes into play. That element of time, which is critical to the notion of product reliability, also factors in when a customer or others are impacted by a failure (i.e. how fast that failure is rectified).
The CTR approach embraces a formulaic approach that enables hyperscalers and colocation providers to meet the reliability standards that they’re promising their customers. “If they install 10 UPSs and 10 sets of switchgear, for example, they know that all of these systems have to work together without a problem for five years in order to deliver on their reliability promise,” says Andy Durand, Strategic Account Customer Advocate for Internet Giants with Schneider Electric’s Customer Satisfaction & Quality team.
“Today, a lot of data is being gathered from data center physical infrastructure assets and their performance in the field,” Andy says, “That data allows stakeholders who are analyzing a fleet of assets, for example, to know how long systems and groups of systems are running without failures.” Establishing such a baseline of time-until-failure metrics deepens the understanding of true system reliability.
Failure Analysis Adds to the Predictive Equation
Failure analysis is also an important factor. A sub-CTR process, called issue-to-prevention incorporates automated repair work orders and mechanisms to dispatch and coordinate services to systems. These services also rely on KPIs to measure the effectiveness and speed of each case dispatched and the data collected and analyzed to, once again, improves the accuracy of reliability forecasts. Once a problem is fixed, a final phase investigates why the system failed. This technical assessment considers the defective parts and compares with other incidents to determine if a systemic issue exists, such as an increase in capacitor failures.
The intent of the CTR process is to preempt failures through more accurate predictions, to document issues as they occur, and to rank them in terms of criticality. As these issues are addressed through the process, they are either better controlled or product design defects are eliminated to support data center power stability.
Read more about ways colocation providers focus on reliability for hyperscaler and enterprise customers, with this blog, Why Hyperscalers Count on Colocation Providers to Accommodate High Demand for Data Center Capacity and Services.