The recent massive British Airways data centre outage highlights just how damaging data centre issues can be. Data centre managers are constantly replacing ageing infrastructure systems and components in order to avoid failures and downtime. However, most data centre outages aren’t actually caused by tech failure, but by issues with people and processes.
In the case of BA, a power outage caused the initial downtime but it was the fact that the staff working in the data centre didn’t know how to safely restore all the systems when switching them back on that led to the wider outage. This knowledge gap due to outsourcing and insufficient training made the initial outage so much worse, turning it into a full-blown disaster event. Data centre managers have to focus a lot of energy and resources on keeping infrastructure up to date, but there needs to be just as much importance placed on the processes around these systems to remain resilient.
Two-thirds of data centre outages are related to process, not infrastructure systems. The costs related to an outage are far-reaching. Initial costs include damage to mission-critical data, lost productivity and equipment damage. Longer term consequences could be legal and regulatory impacts and lost confidence and trust from stakeholders and customers. The losses of reputation and market share for BA have been significant.
If people and processes are so important in reducing downtime, data centre managers should take the following actions:
Make maintenance coherent and repeatable
When you constantly have to patch servers, you need a way to store regular maintenance routines so that they are repeatable, sustainable and updatable processes that don't rely solely on human knowledge. This will reduce the risk of an outage when maintenance is being run.
Constantly learn and improve
When an issue does occur, it is essential to update processes and procedures accordingly. If you gather knowledge and information whenever there is an issue or outage and use this to regularly update processes, the amount of downtime will decrease over time.
Provide status visualisation
Infrastructure is often the root cause of failed implementations when people on the software side of the business try to release a piece of software without knowing that maintenance is being carried out by those running the infrastructure. If we can make maintenance more predictable and surface reliable information we can reduce the number of failures that occur. The ability to view real-time information about what is being done to each system will help to avoid collisions between software updates and infrastructure maintenance.
Good communication and status visibility will improve response times and better enable teams to deal with issues. People need the best information available to make informed decisions and everyone needs to be aware of current status. There should also be a clear policy for how and when clients are notified, but communicating internally is just as important.
Have recovery plans ready
Storing data centre recovery test plans as flexible templates means that you can build subsequent tests on existing capabilities and knowledge rather than starting from scratch. It also means that when a real-life disaster recovery event occurs, the necessary information is readily available, reducing the response time in a crisis.
With the average costs of data centre outages rising (38% since 2010) managers need to do everything they can to avoid the risk of an outage and ensure they have the right processes in place to deal with them quickly when they do occur. Reducing the risks involved in routine maintenance, increasing visibility for teams working on and using the services and having repeatable recovery processes in place can significantly reduce downtime.