We had some downtime on a server at work over the holiday weekend, and as a result my boss is interested in having that never happen again. He asked about establishing, basically, an on-call system.
It sounds pretty straightforward – just have someone who will keep an eye on things. “Keep the lights on” is how my manager described it. Well, nothing is ever quite that simple, is it?
I’m of two minds on this.
First, obviously, in this day and age it is embarrassing at best to have your website down. I’m an internet software professional and I hate the thought of a down website. So, in that sense, yes, I’m all for having a way to have it not happen or resolve it quickly.
Second, however, I’m a person recovering from over a decade working in the news industry. In terms of on-call and after-hours responsibilities, I was taken advantage of in ways I am still coming to terms with. (That sounds melodramatic, I know – let me buy you a drink & tell you about it, then you tell me if it still sounds that way). One of the reasons I work where I work is that I don’t have to be on call.
I had to deal with on-call issues as an employee on call and later as an on call manager who also had employees on call. A lot of questions came up over the years, along with a lot of intentional and unintentional abuses of the system.
I put down all the questions I could remember, and the list is staggering. It reminds me of Steve Yegge’s seminal blog post on complexity.
There are no right or wrong answers to most of these questions; that is what makes them so hard. That, and the fact that there are nearly 100 of them.
Without further ado:
Are weeknights included in the on-call schedule, or just weekend days?
What about weekend nights? Holidays?
What hours are responses expected?
What is the required response time?
What happens if the response time is not met?
What happens if it’s not met repeatedly – is that a performance issue?
What should the caller do if the phone is not answered? Leave a voicemail?
How quickly should the caller expect a response to a voicemail?
If they’ve called and left a voicemail, are they allowed to call back? How many times?
What should happen if the caller doesn’t ever receive a response to a voicemail?
How long before it’s “ever”?
What should happen if the caller thinks the problem is not being resolved quickly enough?
How quickly is quickly enough?
What should the on-call person do if they can’t resolve the problem on their own?
Escalate? To who? Is there a secondary on-call person?
What if they can’t reach anyone? Are they allowed to keep trying ad infinitum?
How should an issue be reported – can someone call the on-call person, or only email them?
If so, who is allowed to call?
What do we do if someone who is not allowed to call does call?
If not, what do we do when someone does call?
Are callers allowed to attempt to use a home number to reach someone for support?
If not, how should that be handled?
Is only the web site included in this on-call duty? If yes, which web sites? All?
If no, what other systems are supported after hours? Who will troubleshoot those?
How should they be contacted?
Is the on-call person expected to deal with problems during workday as well as during on-call hours?
How often is a person on call?
Does it rotate through the team, i.e. A B C D A B C D … ?
Or is the schedule set for a month or a quarter or a year at a time?
If the schedule is set, when will it be published?
Can scheduled on-call duties be changed or traded? If so, what’s the procedure for doing so?
What if head count changes (up, or down)?
What if a person’s on-call duty falls during a desired vacation?
If multiple vacations conflict with on-call, whose vacation request wins?
Is it first-come first-served, or by seniority?
What if the on-call person uses a sick day – are they still on call?
How will holidays be handled?
If scheduled in advance, how far out will the schedule be available?
If rotated, is it within the year or year over year?
What if the on-call person has no internet access to enable them to solve the issue?
Or are they required to stay at an internet connection, i.e. at home?
If allowed to travel, is the on-call person expected to have a laptop with them at all times?
How far away is considered “travel”?
Is the employee required to have a home internet connection?
Will the company pay for a portion of that connection?
If allowed to travel, what about a mobile internet connection?
Is the on-call person expected to use their personal cell phone for on-call-related communication?
If no, then what? A company phone or pager?
What is the procedure for handing off a company device to the next person?
If yes, then will the company pay for their plan? Or a portion of it?
What if on-call issues cause the person to incur overage charges with their carrier?
If yes, how does the new number get communicated to those who may need to call?
Is on-call duty something the employee gets paid for?
If yes, is that hourly, per incident, or per diem?
If hourly, is it only hours you’re actively solving a problem?
Are there different hourly rates for passive vs active on-call duty?
Can the on-call person flex time spent solving after-hours issues?
How should the time be accounted for?
What hours are “after hours”?
What day and time does on-call duty shift to the next person?
What if that day ond time cannot work for some reason?
If not paid, is on-call duty something the company is allowed to require of existing employees?
If you’re not on call and you get a call anyway, do you get paid for it?
What if it’s call escalated from the on-call person? Do you get paid? Do they?
If you’re on vacation and you get a call, do you get the day back?
Is the on-call person required to log calls? How? What do they need to record?
How soon are they required to record it?
Does the on-call person need to inform the rest of the team or department or their manager during an outage?
After how much time?
What happens if they don’t?
Does the on-call person need to send an update during an outage? Just one? Or how often?
Does the on-call person need to send a summary after an outage? To who? How soon afterward?
Does it need to be reviewed by management before it’s sent?