ASHBURN, VA – MAY 9: People walk through the hallways at Equinix Data Center in Ashburn, Virginia, on May 9, 2024. (Amanda Andrade-Rhoades for The Washington Post via Getty Images)
The Washington Post via Getty Images
Artificial intelligence is achieving utility‑like status. Virtually every company, organization and institution is adopting AI to some degree in their business processes. As reliance on AI grows, so does the need for companies to ensure their AI is as resilient as the rest of their critical infrastructure.
Behind every chatbot, image generator or enterprise application lies a vast, mostly invisible network of data centers, power plants, transmission lines, batteries, cooling systems and security layers. With such complexity, how can businesses and society avoid losing their new, increasingly critical capacity to be more efficient, and sometimes even “think” smarter?
Power is the Biggest Constraint
In a country already at risk of electricity shortfalls, U.S. data center power demand is projected to double or triple by 2028, driven by AI’s soaring computing needs. That raises big questions. As developers build increasingly large data centers, how will they be powered? Traditional offsite generation and long-distance transmission to the point of use will clearly be strained, potentially at the expense of current power needs for businesses and homes.
Co-locating power generation facilities (whether hydrocarbon-fueled, renewables with battery storage, or even small modular nuclear reactors in the future) has the advantage of eliminating the need for transmission. But it places data center operators (or their partners) in the position of operating as, or developing direct partnerships with, power generators.
This trend introduces risk. More than half of data center operators surveyed in 2024 had outages in the past three years – most often from power failures and cooling issues. Downtime is expensive: business interruptions can cost as much as $9,000/minute – up to $5M/hour in critical sectors. Both the power and computing aspects of AI are also vulnerable to the same hazards affecting every business: flood, fire, storms and cyberattacks.
AI resilience is too important to ignore. Governments worldwide – including the U.S. – are framing AI leadership as a national security issue, further intensifying global competition and rapid growth.
So how can we collectively ensure our AI tools will work when needed?
Where Your AI Comes From
Your AI comes from two different types of data centers: training data centers and inference data centers.
Training data centers are where large language models are schooled on enormous volumes of data so they can later respond to user prompting. They often use the most powerful chips (GPUs and other accelerators) and house large, concentrated clusters of 24/7 power demand. Since consumers never interact with models during training, there’s little benefit in distributing these centers closer to users.
Inference data centers are where chatbots and AI tools generate real‑time answers using the often trillion different parameters developed when the models were trained. Like conventional cloud data centers, inference data centers need to be dispersed geographically to improve the response time for end users.
Training requires a massive upfront surge of electricity, while inference sustains a continuous but lower‑level draw. Interestingly, there’s a tradeoff: The more resources you invest in training the model, the less you need for inference. And vice versa.
In addition to large amounts of power, all data centers need water (for effective cooling and fire protection) and robust network connections. These factors, plus the ever-present specter of natural hazards, complicate efforts to bring inference data centers close to users. And given the substantial resource demand, some communities are questioning their value and limiting new data center construction.
Points of Failure—And What You Can Do
Engineering teams like those at FM help data center owners design for and operate with resilience. Here are some key risks and resilience strategies for owners and operators. They constitute key requirements for any business assessing the resilience of their AI provider:
- Power interruptions: Strategies include onsite backup systems, alternate providers, and disaster recovery plans.
- Heat and fire risks: Mitigate risk with compartmentalized fire barriers, water‑based cooling, sprinklers or water mist systems, and thermal runaway-detection tools for batteries.
- Network limitations: AI response requires instant bandwidth. Invest in high‑speed, redundant connections.
- Physical hazards: Wildfires, floods and severe weather can damage AI infrastructure. Use hazard maps to understand those that pose a risk at vital locations and take measures to reduce the consequences. These might include low-flammability construction, hail‑resistant and tiltable solar panels, ample water supply for fire protection, or flood protection such as flood barriers and stormwater management.
- Cybersecurity: AI is only as secure as its weakest link. Conduct thorough physical and digital security audits and harden defenses accordingly.
- Human error: Even with automation, people remain in the loop. Operators need rigorous training, clear priorities and regular drills to ensure continuous operations.
- Revenue loss: Disruptions are costly. Protect investments with adequate insurance from design and construction through operation.
The Bottom Line
Although AI may sometimes feel like magic, it is more like a utility that relies on complex infrastructure with built‑in risks and extraordinary pressure to scale.
If your business depends on AI, whether you’re a user or provider, prioritize understanding and ensuring resilience. Companies that prepare now will be best positioned to thrive, sustaining continuity that supports growth, profitability and investor confidence – while others are left waiting for their systems to return to normal, or a response from their prompt.