The silent failure problem
Of all the things that can break in production, scheduled jobs are uniquely bad at telling you they failed.
A web service that\'s down generates failed user requests, support tickets, and obvious metrics blips. Customers notice within minutes. A scheduled job that doesn\'t run produces nothing — no traffic, no errors, no obvious symptoms. The only thing missing is the result, and the result was something nobody was watching for.
Common silent failures:
- Nightly backups that didn\'t run for two weeks. Discovered when someone needed to restore.
- Customer reports that should have been emailed daily; users complain after a week of no reports.
- Queue processors that died; tasks pile up but the metric nobody set up doesn\'t fire.
- Sync jobs between systems; data drifts silently until someone notices.
- Certificate renewal automation that hasn\'t run; cert expires (related: our cert post).
How heartbeat monitoring works
Heartbeat monitoring inverts the normal monitoring pattern. Instead of the monitoring tool checking your service from the outside, your service pings the monitoring tool when it does its job.
The mechanic:
- You configure a heartbeat monitor in your tool, set to expect a ping every (say) 24 hours, with a grace period of (say) 1 hour.
- Your job, when it completes successfully, sends an HTTP request to a URL the tool gives you.
- The tool resets its countdown.
- If the countdown ever hits zero (no ping received in the expected window), the tool fires an alert.
It\'s sometimes called "dead man\'s switch" monitoring — the alert fires when the heartbeat stops.
What jobs deserve heartbeat monitoring
The simple test: does anything bad happen if this job doesn\'t run? If yes, monitor it.
Common candidates:
- Database backups (the textbook example).
- Data syncs between systems (CRM ↔ warehouse, billing ↔ accounting).
- Scheduled email sends (digests, reports, billing notifications).
- Cleanup jobs that prevent disk-full conditions.
- Certificate renewal automation.
- Index rebuilds, cache warmers.
- Recurring billing cycles.
- Compliance / audit log archival.
Skip: ad-hoc cleanup of temp files, debug noise generators, anything where "didn\'t run" is harmless.
Implementing heartbeats: the simplest version
The implementation is trivially simple. Most monitoring tools give you a unique heartbeat URL. Your job hits it on success.
Cron
0 3 * * * /usr/local/bin/backup.sh && curl -fsS https://anyping.com/heartbeat/abc123 > /dev/null
The && ensures the heartbeat only fires if the backup succeeds. The > /dev/null stops cron from emailing the heartbeat response.
Bash script with start + complete pings
#!/usr/bin/env bash
HEARTBEAT_URL="https://anyping.com/heartbeat/abc123"
# Tell monitoring we started
curl -fsS "$HEARTBEAT_URL/start" > /dev/null
if /usr/local/bin/backup.sh; then
curl -fsS "$HEARTBEAT_URL" > /dev/null
else
curl -fsS "$HEARTBEAT_URL/fail" > /dev/null
exit 1
fi
Application code (Python example)
import requests
def daily_report_job():
try:
send_daily_reports()
requests.get("https://anyping.com/heartbeat/abc123", timeout=5)
except Exception:
requests.get("https://anyping.com/heartbeat/abc123/fail", timeout=5)
raise
Kubernetes CronJob
Add the heartbeat curl as the last command in the container, or as a sidecar that fires after the main container exits successfully.
Handling real-world noise
Real cron jobs don\'t run on perfect schedules. Things to plan for:
Jitter in execution time
A "1am" cron job might run at 1:00:30 one night and 1:01:45 the next. Set the grace window 10–20% larger than the longest expected runtime so normal variation doesn\'t trip the alert.
DST transitions
Cron jobs scheduled in local time can skip or double-fire across DST. Either schedule in UTC or accept that twice a year you\'ll have an oddity.
Network blips on the heartbeat send
The heartbeat ping itself can fail due to network issues, even when the job ran fine. Most monitoring tools allow a few missed pings before alerting; configure 1–2 missed pings as the threshold rather than 1.
Job runtime growing over time
Backup that took 45 minutes when you set up monitoring takes 2 hours after a year of data growth. Set up a separate alert when actual runtime approaches your grace window so you can adjust before it starts firing.
Common pitfalls
Sending the heartbeat unconditionally
Bad: backup.sh; curl heartbeat. The semicolon means heartbeat fires whether backup succeeded or not. You\'ll never know about backup failures. Use &&.
Heartbeat URL hardcoded in source control
Treat the heartbeat URL like a secret. Anyone with the URL can ping your monitor and silence alerts. Pass it via environment variable, not committed to git.
Same heartbeat URL for multiple jobs
Each job needs its own heartbeat. Sharing means one job running covers for another that\'s broken. The whole point is independent verification per job.
No alert routing for heartbeat failures
"Backup didn\'t run" should page someone on the data team, not the general on-call. Set up routing so heartbeat alerts go to the team that owns the job, not to whoever happens to be on rotation.
Forgetting to monitor the cron daemon itself
If cron itself dies, all your heartbeats fail simultaneously. Worth a separate (rough) check that the cron process is running — or even simpler, have a "cron is alive" job that runs every 5 minutes and pings a heartbeat.
Heartbeat monitoring is one of those patterns that costs almost nothing to implement and prevents a class of incidents that nothing else catches. Set it up for every important scheduled job; thank yourself later.