In 2017, British Airways had a massive data centre outage that highlighted just how damaging data centre issues can be. Data centre managers are constantly replacing ageing infrastructure systems and components in order to avoid failures and downtime. However, most data centre outages aren’t actually caused by tech failure, but by issues with people and processes.
In the case of BA, a power outage caused the initial downtime but it was the fact that staff didn’t know how to safely restore all the systems when switching them back on that led to the wider outage. This knowledge gap due to outsourcing and insufficient training made the initial outage so much worse, turning it into a full-blown crisis management event. Data centre managers have to focus a lot of energy and resources on keeping infrastructure up to date, but there needs to be just as much importance placed on the processes to maintain and support these services to remain resilient.
Two-thirds of data centre outages are related to process, not infrastructure systems. The costs related to an outage are far-reaching, including not only initial costs such as damage to mission-critical data, lost productivity and equipment damage but also legal and regulatory impacts, loss of reputation in the eyes of customers and loss of market confidence.
If people and processes are so important in reducing downtime, data centre managers should take the following actions:
- Make maintenance coherent and repeatable
When servers are constantly being patched, there needs to be a way to store regular maintenance routines so that they are repeatable, sustainable and updateable processes that don’t rely solely on human knowledge. Stored repeatable recovery plans will reduce the risk of an outage due to human error when maintenance is being run.
2. Constantly learn and improve
When an issue does occur and recovery is needed, it is essential to update processes and procedures accordingly. If knowledge and information are gathered whenever there is an issue or outage and processes are regularly updated, the processes will be continuously improved and downtime will be reduced. The proper source of record will provide the analytics for this improvement.
3. Provide status visualisation
Even with complex Configuration Management Databases (CMDBs), infrastructure is often the root cause of failed implementations when people on the software side of the business try to release a piece of software without knowing that there is maintenance being carried out by those running the infrastructure. If we can make maintenance events more predictable and surface more reliable information we can reduce the number of failures that occur. Being able to visualise change in real time will help to massively reduce the likelihood of event failures, and long-term visualisation of change across the organisation can prevent clashes between changes.
4. Improve communication
Good monitoring and visualisation are key to effective communication during both maintenance events and incidents. The impacted service owners and managers need to quickly understand their own recovery options and have the best information available to make informed decisions.
5. Have recovery plans ready
Storing data centre recovery test plans as flexible templates will mean that subsequent tests are built on existing capabilities and knowledge rather than starting from scratch. It also means that when real-life disaster recovery plans are needed, the necessary information will be readily available, reducing the response time in a crisis.
With the average costs of data centre outages rising (38% since 2010) managers need to do everything they can to ensure they have the right processes in place to deal with them quickly when they do occur. Reducing the risks involved in routine maintenance, increasing visibility for teams working on and using the services and having repeatable recovery processes in place can significantly reduce downtime.