Pager Duty

If you’re building a product that users will expect to be up and working most if not all of the time (which you are if you’re building a software product), you’re going to need a plan for handling after-hours support of customers and the system.

Let’s talk about how you handle system support after regular business hours, and we’ll talk about customer-facing support in a separate post some other time.

Rotation

When you’re just getting started and maybe have one developer (either a co-founder or lead developer), the bad news for them is that they get all of the support rotations all of the time. They’re on call 24 hours a day, 7 days a week. There’s no other way to slice it.

Understanding this requirement is actually a key qualifying question as you’re interviewing early technology hires, particularly if they’re coming from larger companies where support is someone else’s job. Many people will opt out of the hiring process if they understand this requirement up front, and you’re not being fair if you don’t make it explicit at the beginning.

As the team grows, everyone should be a part of the support rotation, even if they’re more junior and can’t fix everything. Having to take tier one support calls and deal with them in the middle of the night is a part of living the startup life, and everyone needs to share in the suffering.

Even if all the person can do is research the issue and call someone else to help get it resolved, you’re building knowledge into the team.

What if you’re working with a contractor? They share the same burden. They have to be willing to make someone available to you during off hours once your system goes into production. As with the interview questions, this should be a part of the screening process for hiring a contractor. Not willing to handle off hours support? Not your contractor. Period.

Establishing the Procedures

You’ll next need to establish your after hours support procedures, particularly if you’re dealing with a contractor. Here’s some key questions you’ll want to answer:

  • What constitutes an “emergency” situation vs. an issue that can wait?
  • What’s the expected response time?
  • How is after hours support compensated?
  • What equipment is necessary to effectively support your product?

I’m not going to get into the details of each of these here, but suffice to say you need to set all of these expectations clearly, and then follow through on your part of the deal. You might need to supply a mobile hotspot, for instance, to ensure that there’s network coverage wherever someone might be.

My rules of thumb are:

  • The critical system components have to be unusable to wake someone up / bug them after hours
  • After hours response time is no more than an hour (particularly given how connected everyone is now)
  • After hours support should be compensated either with a small bit of extra pay or perhaps an occasional half day off

Monitoring

If the tech team is doing their job and has alerts and monitoring set up correctly, they should know about the issue before you do. There are many tools for ensuring your system is up and visible to the outside world and for properly handling and triaging errors that occur. There are always cases where something completely unexpected happens (Amazon’s down!), but it’s getting more and more rare.

You can spend as much time and money as you want on monitoring your system. You’ll need to do the cost-benefit analysis and figure out what’s the right amount of monitoring and uptime measurement for your product / service.

Making sure your system is stable and available for your customers is a part of the price of admission. The tolerance levels for downtime, particularly if you’re selling a critical service, are very small. People expect tech to work all the time.

Having a basic on-call system in place before you go into production will eliminate some of the chaos when things inevitably break.