Incident response

Building an on-call rotation that doesn't burn out your team

On-call is the part of operations engineering most likely to make people quit. Here's a pragmatic playbook for rotation structure, escalation, and recovery time.

Why on-call burns people out

On-call is the most-cited reason engineers leave operationally-heavy roles. The complaints are remarkably consistent across companies and seniority levels:

  • Alerts that aren\'t actionable.
  • Alerts during the night for things that could have waited until morning.
  • Same alert firing repeatedly without anyone fixing the root cause.
  • No handoff — you inherit a full alerts queue from the previous on-call.
  • No recovery time after a brutal week.
  • No pay differential despite being on-call effectively 24/7.
  • Single-person rotations where you can never truly disconnect.

The fix isn\'t "on-call is just hard, deal with it." Each of these is a system design problem with a solution.

Picking a rotation cadence

The standard options:

Daily rotation

Sounds gentle — one day at a time. In practice, exhausting. You\'re always either preparing for on-call, on-call, or recovering. Context never builds because every incident hand-off is to someone else within hours.

Weekly rotation

The pragmatic default. One week is long enough to build context (you remember the alert that fired Monday when it fires again Friday) and short enough to recover. Handoff happens once a week with shared context.

Bi-weekly or monthly

Common at larger companies but tends to destroy off-week context. By week 3 of "off-call" you\'ve forgotten what the production system looks like. When you come back on-call, you\'re relearning.

Recommendation: weekly. It\'s the standard for a reason.

The handoff ritual

The transition from one week\'s on-call to the next is where context lives or dies. A good handoff is a 15–30 minute structured conversation, not a Slack message saying "you\'re up."

What to cover:

  • What incidents fired this week and how they were resolved.
  • Open issues or known weirdness still in flight.
  • Anything currently degraded or under elevated risk.
  • Scheduled maintenance, deploys, or external events in the coming week.
  • Any alert thresholds that were temporarily silenced and should be reviewed.

Document the handoff in writing (a simple shared doc or PR works). The written record helps the on-after-next-week person who didn\'t see the verbal handoff.

Escalation policy design

Two questions design the entire escalation policy:

  1. If the primary doesn\'t ack within X minutes, who gets it next?
  2. What\'s the maximum chain length before someone definitely answers?

A reasonable default for a small team:

  • Page primary on-call. 5 minutes.
  • If unack\'d, page secondary on-call. 5 minutes.
  • If unack\'d, page entire team Slack channel + tech lead.
  • If unack\'d after 15 more minutes, page CTO/founder.

The 5-minute first delay is important: it lets primary actually look at their phone, walk to a computer, and ack. Shorter and you\'re paging secondary while primary is putting on pants.

Document the escalation policy publicly in your wiki. Surprise escalations breed resentment.

The pay question

This is contentious but the data is clear: explicit compensation for on-call duty significantly improves retention.

Options:

  • Weekly on-call stipend. $100–500 per week of on-call. Paid regardless of incident count.
  • Per-incident pay. 1.5x or 2x normal hourly rate for time worked outside business hours.
  • Comp time. A day off for a "rough" on-call week.
  • Equity adjustment. Higher equity for roles with on-call responsibility.

The best pattern combines an explicit stipend (signals "we value this time") with comp time after rough weeks (signals "we know it sucked"). The total dollar amount matters less than the explicit acknowledgement.

Recovery time after rough weeks

If on-call had real overnight incidents, the next day or two off shouldn\'t require justifying. Build this into policy:

  • Any incident that involved > 1 hour of overnight work: comp time the following day.
  • A week with multiple overnight incidents: half-day off the following week, no questions asked.
  • Any incident that pulled someone off vacation: their next on-call rotation gets swapped to someone else as compensation.

The rule of thumb: people should not be net-worse-off after a tough on-call week. If they are, you\'re creating a system that selects for "people who tolerate bad treatment."

The weekly alert review

The single most effective practice for keeping on-call sustainable: a 30-minute weekly review of every alert that fired.

For each alert:

  1. Was it a real incident?
  2. If yes: what\'s the root cause? What\'s the fix to prevent recurrence?
  3. If no: why did it fire? Tune the threshold, add multi-region confirmation, delete the alert entirely.
  4. Did it fire at a reasonable time? If at 3 AM for something that wasn\'t time-sensitive, downgrade priority.
  5. Was the runbook helpful? Update it.

This is mostly about preventing the slow drift toward alert noise. Without a forcing function, alerts only ever get added — never tuned or removed.

Team-size realities

Different team sizes have different on-call realities, and pretending otherwise is dishonest:

1–2 people

You\'re always on. There\'s no rotation. Mitigate by minimizing alert volume aggressively and being transparent in hiring that on-call is part of the job. Don\'t pretend otherwise.

3–5 people

Real rotation is possible but rough. One week on, two or three weeks off. Pay explicitly. Have an off-rotation backup ready to swap.

6–10 people

The sustainable zone for most teams. Weekly primary + secondary, six-week intervals. Manageable.

10+ people

You can split into product or service teams with separate rotations. Avoid having one team "cover everything" — on-callers can\'t reasonably know all the systems.

The larger insight: on-call isn\'t something you "scale through" by adding people. You scale through reducing alert volume, improving runbooks, automating recovery, and treating the on-call role with the respect (and pay) it deserves.

Frequently asked questions

Should we do follow-the-sun on-call?

Only if you have engineers in actually different timezones. Faking it (one team, "covering" all hours) just spreads pain further. Real follow-the-sun requires staffing in multiple regions.

How much should we pay for on-call?

Industry rates vary, but $100–500 per week of on-call duty is a common range, plus 1.5x or 2x rate for actual incidents worked outside hours. Even a small explicit stipend signals that you value the time, which matters more than the dollar amount.

What if our team is too small for a rotation?

Below 3 people, "rotation" is fictional — you're always on. Mitigate by reducing alert volume aggressively, paying explicitly for the burden, and hiring or promoting toward a real rotation as a priority.

Should the same person be primary every week?

No — rotate everyone (including managers). One-person primary creates a single point of failure and burns them out. Engineers who don't do on-call lose context for what production looks like.

How do we handle vacation overlapping with on-call?

Build the rotation schedule a quarter ahead so swaps can happen with notice. Allow no-questions swaps. The system should be flexible — what matters is that someone is always covering.

Start watching your sites in 5 minutes.

14-day free trial. No credit card required. Cancel anytime.