5 Ways to Avoid Critical IT Event Failure

Critical IT events are the primary enablers of essential business change. Events are at the heart of realising business benefits, whether rolling out new revenue-generating products and services, entering a new market, looking to reduce the cost and risk of existing services or satisfying the evolving needs of regulators. In this article, we look at 5 ways to avoid critical event failure.

However, these critical events are high-cost and high-risk in nature. Technical transitions have to be executed seamlessly to avoid disruption to key clients and core services.

Banking is currently undergoing one of the biggest technological overhauls in its history, replacing and integrating decades of evolved “legacy” systems while continuously implementing and expanding services to meet the needs of the digital age. This technological transition is driving unprecedented levels of change, both in terms of frequency and the amount of change required to remain market relevant.

The adoption of agile and DevOps processes and tools are revolutionising this space. However, they have not eradicated the need for the highly orchestrated critical IT events executed by global teams, across a finite supply of weekend deployment windows.

These live weekends are in themselves critical assets and the volume of change being deployed across them is significant. This, coupled with the complex, integrated architectures that IT services now form a part of, is steadily increasing the risk and complexity of these events, increasing the likelihood of execution failures.

Strict regulation also means that there is the additional risk of legal action being taken against a bank and fines being imposed due to the impacts of a failed event. A bank’s reputation is also at stake when a critical event fails, especially with the speed that information can travel on social media.

75% of all major IT outages have their root cause in change and the consequences of these outages can be profound at both an industry and global level. Failure of a payment system can directly impact market liquidity.

It is vital that financial institutions look at how they can address the risks linked to the critical event failure, for themselves and for their customers. Some of the most common causes of critical event failure are poor communication, undefined goals and inaccurate time estimates. Addressing these causes allows banks to avoid both financial losses and paying out significant fines, simultaneously maintaining their reputation and protecting their customers.

Five best practices for avoiding critical event failure:

1.  Break up key areas into manageable chunks

Look for opportunities to de-risk the overall event by fully understanding the exact periods that each impacted application or service will be suspended for. Make sure that your release plan includes the full restoration of each service and not just that the release activity has been completed. Ensure that these services are brought back to full operational readiness at the earliest possible opportunity during the event and that the restoration status of each is clearly reported.

2.  Dry Runs and Dress Rehearsals

Test and rehearse the release process throughout the development lifecycle. Dry runs provide early assurance that sequencing and dependencies are in a logical order. The dress rehearsals provide a real-time opportunity to validate task timings, which are critical when working to a fixed end time.

3. Ensure environment consistency

Often software release failures are not due to the software itself, but because of configuration differences between the testing and production environments. To avoid this, ensure that all integrated system testing and QA activities are performed in production replica environments. Make sure that environment comparison tools are in use to ensure that consistency is maintained throughout the pre-release period.

4. Prevent recurring issues

Whenever there are issues raised during repeatable release mechanisms, it is critical that they are escalated into formal Problem Management tickets. These should be addressed by the technical or business teams to fix the issue properly and ensure it does not happen again. Often, Problem Management tickets are not addressed before the next release, which leads to a repeat of the same problems.

5. Have an exit strategy

Critical event failure can be assessed at two grades of severity. The highest severity is when the failed event goes on to have significant and wide-scale operational impact. Significant time and money will then be required to carry out “post-mortems” rather than more straightforward Post-Implementation Reviews. The lower severity is when unforeseen events occur during the event and the Release Manager is able to invoke pre-prepared "exit Runbooks" that restore all services to a go-forward business-ready state for the next working day. The robust ability to safely exit when things start to go wrong is the difference between an event failing gracefully and a catastrophic failure that impacts customers, reputation and ultimately the potential for regulatory penalties.

Find out more about Cutover and resilience.

Search