What
I led the technical aspects of ERP system upgrade for a customer which had fallen 4 versions behind the current version and which had a large data set. The goal was to upgrade the software to the current version which contained new features that the client required and which contained long standing bug fixes to make the software more stable. The success criteria was to complete the upgrade within budget, inside the weekend outage window and with less than a specified number of low, medium and high priority system functional issues.
The unique challenges identified prior to starting the project included:
The Customer was based in Sydney and the Product Engineering team was based in London which required teamwork to communicate across time zones where there was no working day overlap.
An upgrade of this magnitude regarding both the number of versions and the volume of data had not been attempted by a remote deployment team before.
I asked a team member skilled with the product to perform a trial upgrade to form a baseline for time and effort which we could use to scope the work, mindful of the success criteria mentioned above. The trial upgrade took more than a month to complete, during which time I became increasingly anxious and demanding about the duration. Once complete, the system exhibited a net increase in the functional issues, the team member was not able to accurately identify the time or effort required due to having to stop-start a large number of times as a result of issues encountered, and accurate records of the activities hadn’t been kept.
Engineering suggested that we may have missed some steps or some initial data was corrupt, so we ran another trial upgrade on a fresh set of data with the same process, but the results were very similar. Two months into the project and we hadn’t even been able to provide a solid effort estimate. The customer had very low confidence that the upgrade could be run to meet the success criteria and decided to delay the upgrade. I felt like we had failed to understand the job and forged ahead, wasting time and effort without having a plan to reach the scoping objectives.
So What
A post mortem found that the upgrade package quality was poor, instructions for manual steps were ambiguous and didn’t provide sufficient detail to be repeatable and there were a high number of them slowing our efforts, the build process introduced unnecessary repeated steps and there were no sanity checks to validate the consistency of progress.
I hadn’t appreciated the volume, effort or skills required to address the quality issues. I had made the assumption that the upgrade could be carried out with the materials distributed in isolation from other inputs.
My initial reaction was to approach Engineering for solutions, but they refused as they had other priorities and we weren’t able to detail all issues found. I was frustrated at this but realised the current trajectory would not lead to success. I wondered how I could be responsible for issues supplied in the upgrade distribution as they seemed out of my control.
The division leader suggested I read Extreme Ownership: How US Navy Seals Lead and Win by J Willink and L Babin Published October 20th 2015 by St. Martin's Press. This suggested: leadership is responsible for everything and is the decisive factor for success; simplify; prioritize and execute; measure and analyse the methods so they can be refined with lessons learned.
After some thought, I realised that rather than applying more pressure to complete the upgrade, I should have stopped the initial upgrade, re-set the project structure towards the scoping objectives and started again. I applied some of the strategies suggested in Extreme Ownership:
Simplify - Break up the project into a series of logically atomic units and then documenting the individual steps allowed us to ensure that they were consistently repeatable and allowed us to understand opportunities for sequencing changes and parallelisation of tasks to optimise the overall duration and ensure we could meet the outage window objective.
Measure and Analyze - each step allowed us to prioritize opportunities for performance tuning. Longer steps would offer the highest potential for improvement. This would also identify repeat steps that could be eliminated and opportunities to parallelize other steps.
Documenting and then categorizing similar issue resolution actions allowed us to prioritize the issues for resolution to ensure the highest progress was achieved earlier in the process.
Documenting identified manual steps which could be automated.
Documenting provided details necessary to feed these back to Engineering for fixes.
On further reflection, I realised that I had failed in a number of areas:
I hadn’t provided parameters for the team members to determine when to escalate or take initiative.
I hadn’t provided sources to seek further information if needed.
I had essentially asked the individuals to work in isolation.
We re-scheduled the upgrade with 3 trial iterations during which we refined the upgrade to a point where it fit within the upgrade window and to the quality metrics, however the budget was exceeded by 50% due to the failed initial iterations. The duration and quality dimensions of this project were very difficult to meet, and I was proud of the efforts of the team and the process we had started to define for upgrades.
Now What
The initial assumption I made that an upgrade should be able to be run successfully by a skilled engineer blinded me to the reality of the situation, and I held onto that assumption too long. We got our heads together and came up with a better process to address continuing inconsistencies with upgrades which included the following:
Plan for a number of upgrade iterations, estimating the number of iterations based on complexity metrics such as data size, upgrade steps, and specific organisational experience with the process
Automate the manual steps to ensure repeatability
Including sanity checks to ensure major issues are identified early
Identify unnecessary repeat steps and remove them
A documentation set including:
Upgrade Runsheet - a summary list of tasks showing brief description, resources, start and end times and durations to identify the longest tasks to target any performance tuning activities - the prioritisation.
Step by Step Detailed Instructions - a detailed procedure to follow during the upgrade with sufficient detail that any Applications Engineer could follow it consistently.
Errors and Issues Document - A detailed description of each issue encountered and the resolution of those issues to feed back to Engineering.
A retrospective for each upgrade cycle.
These lessons have been applied many times since that upgrade, with great success.