Your App Just Crashed – Will You Survive?

It’s rare, particularly in early stage companies, to find any kind of disaster recovery plan. It’s a common misconception that disaster recovery (or DR for short) is what big companies do, and that smaller companies have no need for such bureaucratic nonsense. After all, nothing ever breaks. Right?

If I were to walk into your company today and ask you “What would happen if you lost your application server?” or “Is your database being backed up?”, would you know the answer definitively?

If you’re squirming right now, that’s OK. As my high school math teacher used to say “There’s hope if you know.”

Let’s create a simple DR plan that will help you sleep a little better tonight.

A Simple DR Plan

  • Keep documentation up to date – As I talked about in a previous post, knowing where everything is and how to login to all of your systems is a simple, but often overlooked, first step. In the fog of an outage, you’ll be surprised how simple things can cause massive delays.
  • Back up key databases – A simple rule of thumb for how often you back up is: How much business can you afford to lose? If you can afford to wind the clock back a day, do them once a day. If you can’t afford any data loss, you have to put real-time replication in place. All of the major databases support both regular backups and real-time replication.
  • Make offsite backup copies – If you only store your backups on the same machine where your app is running, and you lose that machine, you’ll probably be out of business. Take the additional step of moving your backups to a different machine in addition to the local copy
  • Test your backups – Having lived through the terrifying moment of a backup being invalid or incomplete on numerous occasions, I strongly recommend testing your backups at least once a month.
  • Create a production replica – Given how cheap virtual machines are now, there’s no reason to not have a stand-by environment that looks exactly like your production environment ready to go. If anything happens on your production instance, you can quickly cut over to the replica, restore any data, repoint your DNS and be back up and running.
  • Have a communication strategy – Think through how you’ll communicate with your customers / users if you have an outage. By now everyone understands that outages occur. Where you run into trouble is when customers don’t know what’s happening. Have a process written down for how you’ll reach out to your users both during and after any event. Services like StatusPage.io become invaluable during unplanned outages.

There are numerous ways to accomplish the above tasks. For example, if you use Amazon RDS to host your databases, they take care of most of the above for you.

Also, as your product infrastructure scales, DR gets both easier and more difficult. It’s easier because you will not have single points of failure. It’s more difficult because testing for possible failures is more complex. Of course, you’ll also have additional resources and expertise at that point to help solve the problem.

Just Remember

  • It’s your responsibility to have a DR plan in place. Don’t assume it’s being handled by your development team
  • A simple DR plan, like the one above, takes a day or two to set up and maybe an hour a month to test. Seems like a small price to pay, doesn’t it?

Your Assignment

Meet with your technology team and get a firm understanding of your current DR status. Have your team walk you through how you’d recover from the most common scenarios. Identify any gaps and put a basic plan in place to ensure continuity during any unforeseen outage.

If you’re starting from scratch, use the above plan as an initial checklist for your team. You can always build on it over time as you learn, but spending a little time to think about it now could be the difference between survival and going out of business.