Incident response

Reducing alert fatigue: practical patterns

When everything is an alert, nothing is. Six patterns we've seen work for cutting alert volume by 60%+ without missing real incidents.

The fatigue cascade

The pattern is depressingly consistent across teams:

  1. Set up monitoring. Add alerts.
  2. Add more alerts whenever something breaks ("we should have caught that earlier!").
  3. Alerts start firing for non-incidents (network blips, transient failures, regional issues).
  4. People start ignoring alerts during normal work hours ("just check it later").
  5. Alerts at night get silenced more aggressively ("I can\'t deal with this right now").
  6. A real incident fires alongside three false alarms; people miss it.
  7. Postmortem says "we should have caught that earlier!"
  8. Add more alerts. Loop.

Breaking the cycle requires going against the natural instinct ("more visibility good") and being aggressive about deleting noise.

Pattern 1: Multi-region confirmation

Single biggest source of alert noise: regional connectivity issues being treated as outages.

The fix: require 2 or 3 regions to confirm a failure before declaring an incident. Most quality monitoring tools support this; turn it on for every monitor.

Expected impact: 60–90% reduction in false-positive alerts. (Covered in depth in our multi-region monitoring post.)

Pattern 2: Failure-count thresholds

Even with multi-region confirmation, single-check failures happen. Don\'t alert on the first failure — require 2 or 3 consecutive failures before alerting.

Trade-off: detection delay increases by (N-1) check intervals. With 30-second checks and a 2-failure threshold, your detection delay grows by ~30 seconds. With 5-minute checks, it grows by 5 minutes — which is why this pattern works much better with fast checks than slow ones.

The safe default: 2-of-3 confirmations + 2 consecutive failures. Eliminates almost all transient noise.

Pattern 3: Incident bundling

When something major breaks, lots of monitors fail at once. Without bundling, you get 50 separate pages within seconds.

Good monitoring tools bundle related failures into a single incident. The first failure creates the incident; subsequent failures within a window join it rather than creating new ones.

Things to look for in your tool:

  • Time-window bundling (failures within N seconds become one incident).
  • Component-aware bundling (all "API" monitor failures group together).
  • Single notification per incident, not per monitor.
  • Single resolution notification when everything recovers.

Pattern 4: Urgency tiering

Not every alert is urgent enough to wake someone up. Define explicit tiers:

Critical (page immediately, any hour)

  • Customer-facing transactional functionality is down (login, checkout, payments).
  • Data loss in progress.
  • Security incident.

High (page within business hours, Slack at night)

  • Marketing site or non-transactional features down.
  • Significant performance degradation.
  • Internal tools broken.

Medium (Slack, no page)

  • SSL cert expiry warnings.
  • Error rate elevated but not critical.
  • Cron heartbeat missed once (give it a chance to recover before paging).

Low (daily digest)

  • Disk usage trending up.
  • Slow responses in non-critical areas.
  • Things to look at, but not now.

The mistake most teams make is treating everything as "high" because it feels safer. The result is everyone ignoring everything because the signal-to-noise ratio collapses.

Pattern 5: Quiet hours and routing rules

Per-recipient quiet hours: "Don\'t page me between 10pm and 7am unless it\'s critical." Most tools support this; configure it.

Per-monitor routing: the marketing site outage shouldn\'t page the same person as the payments outage. Different systems, different on-calls.

Per-time-of-day routing: night-time alerts go to a smaller "real on-call" group; daytime alerts go to whoever\'s in the office.

Don\'t set up alerting that pages the same person on every channel for every alert. That\'s the fastest path to ignoring all of them.

Pattern 6: The weekly audit

Once a week, 30 minutes, the on-call team (or whoever owns alerting) reviews every alert that fired:

  1. Real incident? Keep, write down root cause.
  2. False positive? Why did it fire? Add multi-region, raise threshold, narrow scope.
  3. Real but not actionable? Downgrade priority or delete.
  4. Real but the runbook didn\'t help? Update the runbook.
  5. Same alert fired multiple times? Bundle, deduplicate, or fix the underlying flapping.

Without this forcing function, alerts only ever get added — never tuned or removed. The audit is what keeps the system livable long-term.

The honest conversation: when to delete monitors

The hardest part of alert hygiene is deleting alerts that "might catch something someday." It feels like reducing safety. It\'s actually the opposite.

An alert that fires 200 times a year and is real twice has a 99% false positive rate. The two real ones get lost in the noise. Deleting the alert and accepting that the two real cases will be caught some other way (customer report, downstream alert, manual check) is almost always net safer than keeping it.

Questions to ask before deleting:

  • How many times has this alert fired in the last 90 days?
  • How many of those fires were real incidents?
  • For the real ones, was this alert the only signal? Or did something else also fire?
  • What\'s the cost of missing the next real one vs the cost of N more false fires?

If the alert is providing <20% real signal and isn\'t the only detection path, delete it. Your on-call will thank you. Real incident detection will improve, not degrade, because the remaining alerts will get the attention they deserve.

Frequently asked questions

What's an acceptable alert volume per week?

For a small team running a moderately busy SaaS, 0–5 actionable alerts per week is sustainable. Above 10/week and people start ignoring things. Above 20/week and alert fatigue is doing real harm to your team.

How do we get buy-in to reduce alerts?

Track the ratio: real incidents / total alerts. If real incidents are 20% or less of total alerts, you have 80% noise. Show this number to leadership; they'll prioritize fixing it.

What if removing an alert means missing a real incident?

This is the right concern, but it's usually unfounded. The alerts that fire 50 times a week and are never real are the ones to delete. The alerts that fire rarely but are always real are the ones to keep.

Should alerts be opt-in or opt-out by team member?

Defaults should be opt-in by team. Each team member opts into the categories they're responsible for. "Everyone gets every alert" creates the fastest path to mass ignore-mode.

How do we handle alerts that are genuinely informational, not actionable?

Don't use the alerting system for them. Pipe them to a Slack channel marked "informational" or to a daily digest email. Keep the alerting system reserved for things that require action.

Start watching your sites in 5 minutes.

14-day free trial. No credit card required. Cancel anytime.