Active certificate monitoring/notifications

Let’s Encrypt have ended support for certificate expiration emails about a month ago (Ending Support for Expiration Notification Emails - API Announcements - Let's Encrypt Community Support, Ending Support for Expiration Notification Emails - Let's Encrypt), so the default method of notifying folks that certs are about to expire no longer works. There are third-party services (e.g., Never miss an expiring certificate with Red Sift Certificates Lite) that can monitor this for you, but then of course you’re at the mercy of that third party.

NS8 knows what certificates it’s obtained, and it knows when they expire. At a minimum it should raise a warning in the /cluster-admin pages, and really should send out an email, if

  • Renewal is failing for some reason (and, of course, describe that reason), or
  • A cert is due to expire in under 30 days (which would likely result from the above).

Failure to implement this leads to threads like Certificateupdate from an existing Certificate, where users have no idea there’s a problem until the cert expires.

3 Likes

@hucky read this. Could that be your situation?

1 Like

I admit I’m assuming this, based on the stated fact in that thread that his cert is expired since a day ago, and this is the first time he’s raised the issue.

Yes, of course, I say it for @hucky

1 Like

i did not recognized it because of vacation :frowning:

I think it’s a good idea to get notified when certs are not working BEFORE the users are affected.
Traefik renews the certs automatically so I don’t know if there’s a hook to catch it.
Here is a first draft of a script that checks the certs from traefik if they’re valid and expiring in under 30 days:

#!/bin/bash
ACMEPATH=/home/traefik1/.config/state/acme/acme.json
for i in $(jq -r '.[] .Certificates | .[] | .domain.main' ${ACMEPATH} | sort | uniq); do
  cert_end=$(echo -n Q | openssl s_client -servername ${i} -connect ${i}:443 2>/dev/null | openssl x509 -noout -dates | grep notAfter | cut -d "=" -f 2)
  days_left=$(( ($(date -d "$cert_end" +%s) - $(date +%s)) / 86400 ))
  # echo "Days left: $days_left"
  if true | openssl s_client -connect ${i}:443 </dev/null 2>/dev/null | openssl x509 -noout -text | grep -q ${i}; then
    if [[ $days_left -lt 30 ]]; then
      echo "${i} is valid but renewal doesn't work. Days left: $days_left"
    else
      echo "${i} is valid. Days left: $days_left"
    fi
  else
    echo "${i} is NOT valid. Days left: $days_left"
  fi
done

As we saw in the other thread, it will at least log that it tried and failed–though it doesn’t look like it logs any detail about why it failed (I assume–and hope–it logs that info somewhere, but it doesn’t seem to go into the main system log, making troubleshooting a challenge).

1 Like

Using following log searches should provide a failure reason:

2025-06-30T20:45:39+02:00 2025-06-30T18:45:39Z ERR Error renewing certificate from LE: {wiki.domain.tld []} error="error: one or more domains had a problem:\n[wiki.domain.tld] invalid authorization: acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: 1.2.3.4: Invalid response from http://wiki.domain.tld/.well-known/acme-challenge/63nMJK_Q5oeZci_U1cq-cRx6JZNKsdfafasdfasdf: 404\n" acmeCA=https://acme-v02.api.letsencrypt.org/directory providerName=acmeServer.acme

We could also check if the right port is opened and if DNS is setup correctly.

I agree this is an important feature. We will actively monitor certificate expiration dates with the existing core Prometheus/Grafana stack – Redesign of the TLS Certificates Page · Issue #7544 · NethServer/dev · GitHub

1 Like