Throw a Disaster Party
For the next sprint have a disaster recovery test. Of course, you have a plan for that! Having a plan for a disaster is nice but remember the old saying that backups are irrelevant if you can’t restore. So do test your recovery plans as well.
Why run disaster recovery tests, You might ask, unfortunately, it’s not a matter of if recovery is needed - but of when and how much recovery you need to do. When will your complete system be out and in need of recovery? The big impacts of CrowdCast, SolarWinds and US energy sector outrages should be examples enough. Besides that, there are plenty of lessons to learn from such an exercise. Lessons that can help your onboarding activities and improve your delivery cadence. The better you are at disaster recovery the more mature and maintainable your system is.
Before a disaster recovery exercise try to establish how long can you be out of business before it hurts. How is the company's cash flow from the customers? This goes for at least two viewpoints - how long can your product/solution be out towards our customers? And how long can your (development) team be out of a platform to build from? In the cyber security vocabulary, this would be the difference between disaster recovery and business continuity. But enough semantics.
I once did a project for customer A, switching their whole IT platform (infrastructure and application) from one data centre to another. While the final system cutover had to fit a 12-hour window - another angle was equally important. As we were switching all development, all testing environments and all staging environments- the challenge was really to design a cadence of shifts that would fit the agile sprint calendar.
Have Plans for All Your Systems
You might assume that disaster recovery is relevant only for custom build systems running on VMs in a data center. Unfortunately, disasters don't care about the technical solutions - and no platform is really safe. Each technical platform has its trade-offs. You have to consider disasters and unavailability for all the systems you build - native apps, web applications and even if you build based on Cloud and "As-A-Service" parts. Besides your business's "out of order" time frame, above, all platforms have things to consider from a recovery perspective:
- Do you have a fallback option if your build and deploy environment is out?
- If your hosting vendor has been ransom-wared, do you have off-site ways to rebuild your website?
- Do you have contingency plans for if the APIs you depend upon are out - for a longer time? Do you have off-line versions for API's for postal codes or even financial gateways?
- Do you have a log of all the customizations to your Salesforce solution, so that you can rebuild from a vanilla instance?
- If Slack is out (or wherever you communicate & collaborate) - how do you communicate? And how easy is the switch? Perhaps exercise to meet on another communications platform if the primary is out.
Make a Runbook
To throw the best disaster party - have a plan, as with any other party - it consists of three parts:
- Setting up the party: Decide which systems to take out, take appropriate backups, find the licence keys and change logs.
- Having the party: Make it happen - take out the server, and shut down the database. Switch to the secondary instance and delete PROD!
- Recovering from the party: Safely recover, and roll in the backup. Restore the world and take two asparin before you load the real data in again.
Throughout your party write down the timestamps for each event. And compare both the time to recover the platform (infrastructure and applications) and the time to load the data and verify. You will be interested in two metrics time to recover the system (RTO) and the data loss (RPO), as I covered in Aligning with Management. You need the timestamps to confirm your assumption from above - can your system actually recover in the time necessary for your business purposes? And what do you do in the meantime?