The fatigue cascade
The pattern is depressingly consistent across teams:
- Set up monitoring. Add alerts.
- Add more alerts whenever something breaks ("we should have caught that earlier!").
- Alerts start firing for non-incidents (network blips, transient failures, regional issues).
- People start ignoring alerts during normal work hours ("just check it later").
- Alerts at night get silenced more aggressively ("I can\'t deal with this right now").
- A real incident fires alongside three false alarms; people miss it.
- Postmortem says "we should have caught that earlier!"
- Add more alerts. Loop.
Breaking the cycle requires going against the natural instinct ("more visibility good") and being aggressive about deleting noise.
Pattern 1: Multi-region confirmation
Single biggest source of alert noise: regional connectivity issues being treated as outages.
The fix: require 2 or 3 regions to confirm a failure before declaring an incident. Most quality monitoring tools support this; turn it on for every monitor.
Expected impact: 60–90% reduction in false-positive alerts. (Covered in depth in our multi-region monitoring post.)
Pattern 2: Failure-count thresholds
Even with multi-region confirmation, single-check failures happen. Don\'t alert on the first failure — require 2 or 3 consecutive failures before alerting.
Trade-off: detection delay increases by (N-1) check intervals. With 30-second checks and a 2-failure threshold, your detection delay grows by ~30 seconds. With 5-minute checks, it grows by 5 minutes — which is why this pattern works much better with fast checks than slow ones.
The safe default: 2-of-3 confirmations + 2 consecutive failures. Eliminates almost all transient noise.
Pattern 3: Incident bundling
When something major breaks, lots of monitors fail at once. Without bundling, you get 50 separate pages within seconds.
Good monitoring tools bundle related failures into a single incident. The first failure creates the incident; subsequent failures within a window join it rather than creating new ones.
Things to look for in your tool:
- Time-window bundling (failures within N seconds become one incident).
- Component-aware bundling (all "API" monitor failures group together).
- Single notification per incident, not per monitor.
- Single resolution notification when everything recovers.
Pattern 4: Urgency tiering
Not every alert is urgent enough to wake someone up. Define explicit tiers:
Critical (page immediately, any hour)
- Customer-facing transactional functionality is down (login, checkout, payments).
- Data loss in progress.
- Security incident.
High (page within business hours, Slack at night)
- Marketing site or non-transactional features down.
- Significant performance degradation.
- Internal tools broken.
Medium (Slack, no page)
- SSL cert expiry warnings.
- Error rate elevated but not critical.
- Cron heartbeat missed once (give it a chance to recover before paging).
Low (daily digest)
- Disk usage trending up.
- Slow responses in non-critical areas.
- Things to look at, but not now.
The mistake most teams make is treating everything as "high" because it feels safer. The result is everyone ignoring everything because the signal-to-noise ratio collapses.
Pattern 5: Quiet hours and routing rules
Per-recipient quiet hours: "Don\'t page me between 10pm and 7am unless it\'s critical." Most tools support this; configure it.
Per-monitor routing: the marketing site outage shouldn\'t page the same person as the payments outage. Different systems, different on-calls.
Per-time-of-day routing: night-time alerts go to a smaller "real on-call" group; daytime alerts go to whoever\'s in the office.
Don\'t set up alerting that pages the same person on every channel for every alert. That\'s the fastest path to ignoring all of them.
Pattern 6: The weekly audit
Once a week, 30 minutes, the on-call team (or whoever owns alerting) reviews every alert that fired:
- Real incident? Keep, write down root cause.
- False positive? Why did it fire? Add multi-region, raise threshold, narrow scope.
- Real but not actionable? Downgrade priority or delete.
- Real but the runbook didn\'t help? Update the runbook.
- Same alert fired multiple times? Bundle, deduplicate, or fix the underlying flapping.
Without this forcing function, alerts only ever get added — never tuned or removed. The audit is what keeps the system livable long-term.
The honest conversation: when to delete monitors
The hardest part of alert hygiene is deleting alerts that "might catch something someday." It feels like reducing safety. It\'s actually the opposite.
An alert that fires 200 times a year and is real twice has a 99% false positive rate. The two real ones get lost in the noise. Deleting the alert and accepting that the two real cases will be caught some other way (customer report, downstream alert, manual check) is almost always net safer than keeping it.
Questions to ask before deleting:
- How many times has this alert fired in the last 90 days?
- How many of those fires were real incidents?
- For the real ones, was this alert the only signal? Or did something else also fire?
- What\'s the cost of missing the next real one vs the cost of N more false fires?
If the alert is providing <20% real signal and isn\'t the only detection path, delete it. Your on-call will thank you. Real incident detection will improve, not degrade, because the remaining alerts will get the attention they deserve.