There is a ubiquitous saying “never let a crisis go to waste” which has been applied across many fields and topics. While some may regard a crisis as an opportunity to advance an agenda, technology professionals creating and operating products and services regard crises as something to be systematically eradicated like termites or polio. Crises distract technologists from their main purpose – creating value for their customers.
Given we should strive to eradicate crises, how should we go about that? It starts with an effective incident management program. What outcomes should an incident management process achieve?
Incident Management Outcomes
- Continuously improve the quality of service
- Maximize application availability
- Minimize incident impact and duration
- Drive underlying causes to resolution, reducing recurrence
What type of things should incident management include?
Incident Management Program Components
- Incident severity classification criteria
- Defined and rehearsed response actions
- Communications protocols addressing internal and external communications during and after an incident
- A problem management program including a postmortem process with exit criteria to ensure the team is preventing recurrence, reducing impact, and improving responsiveness
We will focus on defined and rehearsed response actions for this post.
Why defined and rehearsed?
- Defining actions means thinking about them and documenting them
- Rehearsing actions means practicing them and learning from that process
- Responding to emergencies is one of the five types of work an engineering team should be expected to do on a regular basis
Defined and rehearsed reminds me of a piece of advice I received over 30 years ago, advice that sticks in my mind because it is both wry and salient. I was attending the US Army’s Airborne School during a hot and humid summer. The school was three weeks long. The first two weeks included hundreds of repetitions of what you need to do to successfully complete a static line parachute jump. The actions and the order in which they were done became muscle memory. As we nervously awaited our first jump – intentionally jumping out of a perfectly good airplane – one of the senior instructors said,
“if you encounter a main chute malfunction, do not panic, you have the rest of your Airborne life to deploy your reserve chute”
Gulp! The senior instructor was quite right. Trying to recall a PowerPoint slide on what to do after you have jumped out of the door and began a 9 second journey to the ground is not a recipe for success.
A customer impacting crisis is not the time for a clean sheet brainstorming exercise.
How many crisis situations should be prepared for and rehearsed? Each one consumes engineering resources, so it would be best to restrain those with overactive imaginations.
Sample Crisis and Rehearsal Frequency
|Crisis||Impact||Probability of Occurrence||Rehearsal Frequency||Note|
|Zombie apocalypse||Total||None||None||An entertaining distraction for 10 minutes|
|Single server failure||Low||Guaranteed||None||If a single server failure is a crisis, focus on your resume|
|Data center/cloud site failure||High||Medium||Annually||Strive to run production workload from DR or alternate site|
|P1 production bug||High||High||Quarterly||Isolate, rollback, restore|
|DB data corruption||Medium||Low||Annually||Restore from clean backup|
Remember, if it’s worth doing, it’s worth measuring. Did your rehearsal results meet your RTO (recovery time objective) and RPO (recovery point objective)? Conducting a retrospective would be productive as well.
Professionals in many fields rehearse planned actions exhaustively. A Formula 1 pit crew, an NBA basketball team, the medical team in a trauma center – all define what is to be done in which order by whom and practice those actions repeatedly. Do your customers matter enough to you to do the same?Are you suffering from recurring crises that erode customer satisfaction and sap the velocity of your development teams? Is the daily tactical grind distracting you from strategic planning? Contact us, we’ve been in your shoes, having managed numerous technical crises and possibly even causing a few. We can evaluate your existing capabilities, conduct training sessions for your team, or even provide interim leadership to help you get where you need to go.