The unprecedented FAA outage that resulted in all domestic flights being grounded has everyone asking questions:
How did this happen?
Who is responsible?
How do we prevent something similar from occurring again?
This outage has put us on notice, highlighting that even the systems that we consider the most secure, trusted and validated can fail.
While this type of outage that reaches to the level of public awareness is rare, when one does occur in a life critical system it can lead to an avalanche of catastrophic results affecting safety, security and the economy. We are seeing this now with disruption of transportation and the ramifications to the overload in web/app services inundated with thousands of passengers scrambling to get to their destinations.
While today’s FAA outage is considered a system failure, it was a graceful degradation failure. That means, luckily, no deaths resulted from the failure and the system effectively shut down before more damage was done.
This is fortunate, but not encouraging.
Testing has always been used in manufacturing to detect defects – fault simulation, for example, was a method of artificially “breaking” a device to see if diagnostic tests would detect and isolate failures down to their root causes. When designing software, engineers are taught to design to the specification of what it should do functionally. Much less effort was spent on looking for the catastrophic scenarios or a “perfect storm” of conditions that need to occur that lead to system failure. Anticipating these conditions can help us with proactively building in mechanisms to proactively detect and prevent catastrophic failure.
Preventing Future Outages and Other Critical Infrastructure Failures
With the proliferation of cloud computing and Artificial Intelligence solutions, we now have efficient enough computational power to evaluate millions of operational scenarios to detect what cases might result in catastrophic scenarios.
For the FAA, it should now be possible to proactively analyze the conditions and data from all domestic airports, aircraft in the sky and on the ground, as well as those scheduled for future use, control tower communications and related infrastructures, passengers, weather, and security to game out scenarios that may result in system failure.
If one considers the complexity of the interactions and interdependencies of this system, it is clear that it is a daunting proposition to look at all the points of failure.
Artificial Intelligence can help to analyze this overwhelming amount of data to proactively look for patterns and behaviors that might pose challenges to FAA systems.
This is not unprecedented, as Artificial Intelligence has been utilized to better examine traffic patterns for optimized scheduling and logistics.
The technology can also be deployed as a powerful defense mechanism to provide early detection of cyberattacks and/or abnormal behaviors in systems. The key to effectively deploying such systems will be isolating those specific outliers and conditions so they can be vetted by human experts.
There are many lessons to be learned from the FAA outage, and in time we’ll have a clearer picture of what occurred. But, for now, it is apparent that emerging technologies, such as Artificial Intelligence, that enable the proactive detection of system failures and other challenges that may arise have a prominent role to play in how we maintain our critical infrastructure moving forward.
Source: https://www.forbes.com/sites/karenpanetta/2023/01/11/the-perfect-storm-of-the-faa-outage-why-catastrophic-scenario-testing-beyond-manufacturing-is-essential-for-critical-infrastructure-security/