Tackling Operational Risk When Performing Surgery on Critical IT Systems

Have you ever wondered why IT outages result from making changes to critical IT systems? Why CEOs of major institutions are regularly on the news apologising? It's not due to these institutions lacking capable people or not taking things seriously, it's due to the fact that the act of making a change to a critical IT system exposes institutions to a complex, and sometimes unpredictable, landscape of operational risk.

Due to ever-increasing system integration, automation, architectural complexity and functional capability, it becomes less and less achievable to truly replicate the live environment in testing and predict all potential outcomes. Under these circumstances, it becomes very difficult to lead a team through a very precise and interlinked set of activities to migrate to new systems or updates whilst remaining ready to manage the unexpected. This is further complicated by the diverse global operating models in place maximising the time zones, skills and cost benefits around the globe.

To mitigate operational risk, today's Release Managers need more sophisticated tools than email and spreadsheets. They need real-time visualisation and collaboration tools and the ability to direct proceedings with accuracy and auditable transparency.

A cutover quickly becomes like trying to get a team to put on a Broadway musical via emails and conference calls. The further complication is the inevitable issue that changes the production from My Fair Lady to West Side Story in the middle of the night on a Saturday.

This is a fundamental problem of team orchestration that is only going to become more prominent. Globally distributed teams with a list of tasks that cannot get context in real time of who is doing what are prone to behaviours that are problematic but hard to avoid: they locally optimise their work, leading to doing things out of order, or they sometimes don't do anything whilst awaiting instruction from the centre.  Who could blame anyone for locally optimising their work? It looks good, people get a lot of things done and look efficient but this leads to running roughshod over critical dependencies.  Not doing anything has similarly issue-ridden consequences.

This problem requires a number of strategies to counteract the operational risk:

1. Organisations in battle versus on parade

The rapid and changing nature of performing a cutover is like moving from supporting an organisation on parade doing BAU activities day-to-day to one that is in battle requiring coordination on a minute-by-minute basis. The emphasis on planning moves to one of supporting the organisation in the cutover theatre and ensuring that support and help move into a real-time hyper care basis. Getting an organisation fit for these kinds of operations requires recognition that a cutover is a different way of operating and ensuring that enough practice takes place.

2. Intensity of Communication

In a rapidly changing complex situation, it's all too easy to want to add detail to plans and generate complexity to match complexity. This becomes increasingly difficult to keep up to date. An investment in communications really helps. When you have a communications ticker that captures peoples' queries, issues and status updates you gain a better picture of the whole and avoid local work optimisation issues.

3. Timecode

The difference between a cacophony and symphony for the last two decades of modern music has been a black box that spits out the time code to tell instruments when to play their part in an orchestration. People involved in critical events need this too. When you have a system of record that gives you prompts on when to start things in real time you have a central trigger that avoids local optimisation issues and maintains the latest picture of dependencies. This removes the need for participants to reference a complex plan and know it intimately. It permits the talent on the team to focus on solving problems rather than asking whether they can begin rote tasks. It unlocks the next step in the maturity of critical event management.

The cost of an aborted event is enormous in terms of failing to release revenue-generating or regulatory-driven change and, even more importantly, the exponential impact on other scheduled business-critical release slots. In these cases, all of the operational risk is experienced but with none of the reward. It is critical to use strategies like the above to protect against the significant downside.