Why this still happens in 2026
Every operations engineer has at least one expired-cert story. It\'s a rite of passage. Surely with Let\'s Encrypt, automated renewal, and a decade of TLS being mainstream, this should be a solved problem. It is not.
Why expired certs still take sites down regularly:
- Renewal automation runs on a schedule that itself can fail (cron daemon dies, container gets cycled, IAM permissions change).
- Let\'s Encrypt rate limits silently throttle renewals during incidents that loop retries.
- DNS-01 challenges break when DNS records get cleaned up by another team.
- Certs on intermediate infrastructure (load balancers, reverse proxies, internal CAs) get forgotten.
- Manually-renewed certs from years-old vendor relationships get missed when the responsible person leaves.
- The cert renews fine but doesn\'t get deployed to all the places it needs to be.
Cert expiry is the kind of incident where everyone afterward says "we should have known." With monitoring, you do.
Why standard uptime checks miss it
Most monitoring tools check HTTP endpoints. By default, many of them ignore TLS errors — they\'re checking whether your application responds, not whether the connection itself is valid.
So your monitor happily reports "200 OK" while customers see this:
Your connection is not private
NET::ERR_CERT_DATE_INVALID
Customers can\'t click past it (without scary "advanced" steps). Your monitor doesn\'t know anything is wrong. You find out when the support tickets arrive.
The fix is configuring your monitor to either:
- Reject connections with invalid/expired TLS, or
- Use a dedicated SSL certificate check that inspects the cert directly.
What real cert monitoring looks like
A proper SSL cert monitor doesn\'t just check whether the connection works — it inspects the certificate itself.
What it should look at:
- Notice expiry date. The "not after" timestamp on the cert.
- Issuer. Did the cert change CAs unexpectedly? Sometimes the only signal of a misconfigured renewal.
- Subject and SANs. Does the cert cover the hostname you\'re hitting?
- Chain validity. Is the cert signed by a trusted root and is the chain complete?
- OCSP status. Has the cert been revoked?
- Algorithm. Is it using deprecated crypto (e.g. SHA-1)?
Most teams just need expiry monitoring. Cert chain validation is a useful belt-and-suspenders for catching misconfigurations.
The 30/14/3 alert pattern
The standard cadence we recommend is three escalating alerts:
- 30 days out: low-priority. Email or Slack notice. "Heads up, you should plan to renew."
- 14 days out: medium-priority. Slack ping. "Why hasn\'t this renewed yet? Investigate."
- 3 days out: high-priority. SMS or PagerDuty. "This is going to break customer connections in 72 hours. Drop what you\'re doing."
The 30-day window is most important: it gives you time to investigate and fix without panic. The 3-day alert is the safety net for when you missed the earlier ones.
Some teams add a 1-day alert as a final escalation. We\'d argue if you\'re still not on it at 1 day out, the same urgency channel won\'t help.
Common renewal-automation failures
Patterns we\'ve seen break otherwise-working renewal pipelines:
The IAM permissions drift
A renewal job uses an IAM role to update Route53 records or write to S3. Six months later someone tightens the IAM policy and removes a permission. Renewal cron starts failing silently. Three months later: outage.
The container that was never restarted
Renewal succeeds. New cert is written to the right path. But the load balancer or web server is running in a container that loaded the cert at startup — it\'s still serving the old one until restarted. Outage when the old cert expires.
The DNS record someone deleted
DNS-01 challenge needs a TXT record. A team cleaning up "stale" DNS records deletes it. Renewal fails. The error is buried in a log nobody reads.
The Let\'s Encrypt rate limit
You hit a transient deploy issue, cert renewal retries 50 times in an hour. You hit Let\'s Encrypt\'s rate limit. The next legitimate renewal attempt is throttled. Cert expires.
The forgotten manual cert
One service uses a manually-purchased EV cert from a year ago. The person who bought it left. Nobody renews it because nobody knows it\'s there.
Certs people forget to monitor
The certs most likely to bite you:
- Internal-only services (admin tools, dashboards) — hidden from customers but breaks employee workflows.
- Mail server certs (SMTP, IMAP). Mail clients show terrible errors when these expire.
- Custom domain certs on status pages. Status page is the worst place to have a cert error during an incident.
- Webhook receiver endpoints. Webhooks can fail silently if your cert breaks.
- API endpoints behind separate hostnames. Easy to forget if api.acme.com isn\'t in your main monitoring config.
- Marketing campaign landing pages on subdomains.
The pattern: anything with a public hostname that does TLS needs a cert monitor. Inventory all of them. Add monitoring for each.
A practical setup checklist
- Inventory every public hostname your business owns. Subdomains too.
- For each, add an SSL certificate check (separate from your HTTP uptime check).
- Configure 30/14/3-day expiry alerts.
- Route 30-day alerts to email/Slack; 3-day alerts to SMS or pager.
- Quarterly: audit the inventory. New subdomains? New services? New marketing sites?
- Set up a calendar reminder to verify your renewal automation actually ran in the last 30 days.
- For wildcards, set 45-day alerts — the blast radius justifies the earlier warning.
This is one of the lowest-effort, highest-value monitoring patterns. The day it saves you from a public TLS error, it pays for the entire monitoring stack.