Nethserver 8 backup recap and failure notifications v.2

Hi I’ve made a new thread since I’ve modified the previus version a lot. [Last Thread]

Nethserver 8 backup recap and failure notifications

The initial script shared by @giacomo caused NS8 to send one mail per module with the backup status.

Since I currently have around 13 modules active, receiving 13 mail in the middle of the night quickly became annoying.
So I decidced to come up with better solution by creating a single summary mail that includes the backup status of all modules. Additionally, in case the repository is unavailable, a dedicated “full failure” mail is sent.

This was the old 20notify script, located in /var/lib/nethserver/cluster/events/backup-status-changed/ :

#!/bin/bash

# Change the following variables to match your environment
MAIL_FROM="no-reply@nethserver.org"
MAIL_TO="giacomo@nethesis.it"
MAIL_SUBJECT="Backup status changed:"
MAIL_TEMPLATE="The backup status for {BACKUP_NAME} on {MODULE_ID} has changed to {STATUS}. Please check the system for details."

# WARNING - DO NOT EDIT BELOW THIS LINE (unless you know what you're doing)

# Redis command
rdb="redis-cli --raw"

# Read event data from stdin
read -r event_data
if ! echo "$event_data" | jq . >/dev/null 2>&1; then
    echo "Failed to parse JSON input" >&2
    exit 1
fi

# Extract necessary fields from event_data
module_id=$(echo "$event_data" | jq -r '.module_id')
backup_id=$(echo "$event_data" | jq -r '.backup_id')

leader_id=$($rdb hget cluster/environment NODE_ID)
self_id=$NODE_ID

if [[ "$self_id" != "$leader_id" ]]; then
    exit 0 # LEADER ONLY! Do not run this procedure in worker nodes.
fi
backup_name=$($rdb hget "cluster/backup/$backup_id" "name")

errors=$($rdb hget "module/$module_id/backup_status/$backup_id" errors)
if [[ -z "$errors" ]]; then
    echo "INFO: Status unknown, exiting." >&2
    exit 0
fi

if [[ "$errors" == "0" ]]; then
    status="SUCCESS"
else
    status="FAIL"
fi

# Send email
subject="$backup_name ($module_id): $status"
msg="$(echo "$MAIL_TEMPLATE" | sed "s/{BACKUP_NAME}/$backup_name/g; s/{STATUS}/$status/g; s/{MODULE_ID}/$module_id/g")"
echo "$msg" | runagent ns8-sendmail -s "$subject" -f "$MAIL_FROM" "$MAIL_TO"

The new logic

Previously the script was executed immediately after the completion of the single module backup.

Now there is a new logic:

    1. The script launched immediately after the run-backup command is now just a simple wrapper:
    • In the /var/lib/nethserver/cluster/actions/run-backup directory, where the 50run_backup and 80upload_cluster_backup scripts are located, create a new script called 90notify containing the following:

      #!/bin/bash
      set -euo pipefail
      
      event_data="$(cat)"
      backup_id="$(printf '%s\n' "$event_data" | jq -r '.id // empty')"
      
      [ -n "$backup_id" ] || exit 0
      
      exec /usr/local/bin/ns8-backup-notify/ns8-backup-recap "$backup_id"
      
      
    • Make it executable with sudo chmod +x /var/lib/nethserver/cluster/actions/run-backup/90notify.

    • When the two previous scripts complete successfully, 90notify is executed and simply calls the next script using values passed from the previous steps.

  1. The actual scripts are stored outside the NethServer core, which makes the setup somewhat safer across updates.

    • In the /usr/local/bin directory, create a sub-directory caleldns8-backup-notify.

    • Inside create a file called ns8-backup-recap and put this inside:

      #!/usr/bin/env python3
      
      import argparse
      import html
      import json
      import subprocess
      import sys
      import time
      
      # Edit "MAIL_FROM_PLACEHOLDER" and "MAIL_TO_PLACEHOLDER" with corrisponding real values.
      
      MAIL_FROM = "MAIL_FROM_PLACEHOLDER"
      MAIL_TO = "MAIL_TO_PLACEHOLDER"
      MAIL_SUBJECT_PREFIX = "Backup recap"
      
      # WARNING | Do not edit below this line or it will does not work anymore. 
      
      def esc(value):
          if value is None:
              return "-"
          return html.escape(str(value), quote=True)
      
      
      def run_cmd(cmd, input_text=None, check=True):
          proc = subprocess.run(
              cmd,
              input=input_text,
              text=True,
              capture_output=True
          )
          if check and proc.returncode != 0:
              raise RuntimeError(
                  proc.stderr.strip()
                  or proc.stdout.strip()
                  or f"command failed: {' '.join(cmd)}"
              )
          return proc
      
      
      def get_backup_data(backup_id):
          raw = run_cmd(["api-cli", "run", "list-backups"]).stdout
          data = json.loads(raw)
          for backup in data.get("backups", []):
              if str(backup.get("id")) == str(backup_id):
                  return backup
          return None
      
      
      def human_size(num):
          units = ["B", "KB", "MB", "GB", "TB", "PB"]
          n = float(num or 0)
          for unit in units:
              if n < 1024 or unit == units[-1]:
                  if unit == "B":
                      return f"{int(n)} {unit}"
                  return f"{n:.2f} {unit}"
              n /= 1024.0
      
      
      def fmt_ts(ts):
          if not ts:
              return "-"
          return time.strftime("%Y-%m-%d %H:%M:%S %Z", time.localtime(int(ts)))
      
      
      def summarize_backup(backup):
          instances = backup.get("instances", [])
          rows = []
          failed_modules = []
          total_size = 0
          total_files = 0
          started = []
          ended = []
      
          for inst in instances:
              module_id = inst.get("module_id", "_")
              status = inst.get("status") or {}
              success = status.get("success") is True
              state = "SUCCESS" if success else "FAIL"
      
              if not success:
                  failed_modules.append(module_id)
      
              total_size += int(status.get("total_size", 0) or 0)
              total_files += int(status.get("total_file_count", 0) or 0)
      
              if status.get("start"):
                  started.append(int(status["start"]))
              if status.get("end"):
                  ended.append(int(status["end"]))
      
              rows.append({
                  "module_id": module_id,
                  "state": state,
                  "size": int(status.get("total_size", 0) or 0),
                  "files": int(status.get("total_file_count", 0) or 0),
                  "snapshots": int(status.get("snapshots_count", 0) or 0),
              })
      
          return {
              "overall": "SUCCESS",
              "has_module_failures": bool(failed_modules),
              "rows": sorted(rows, key=lambda x: x["module_id"]),
              "failed_modules": sorted(failed_modules),
              "total_instances": len(instances),
              "total_size": total_size,
              "total_files": total_files,
              "start": min(started) if started else None,
              "end": max(ended) if ended else None,
          }
      
      
      def build_subject(backup_name, summary):
          if summary["failed_modules"]:
              return (
                  f"{MAIL_SUBJECT_PREFIX}: SUCCESS with module errors - "
                  f"{backup_name} - {', '.join(summary['failed_modules'])}"
              )
          return f"{MAIL_SUBJECT_PREFIX}: SUCCESS - {backup_name}"
      
      
      def build_body(backup, summary):
          name = backup.get("name", "backup")
          backup_id = backup.get("id", "")
          repository = backup.get("repository", "-")
          retention = backup.get("retention", "-")
          schedule = backup.get("schedule", "-")
      
          overall_color = "#1f7a1f"
          overall_bg = "#eaf7ea"
      
          rows_html = []
          for row in summary["rows"]:
              status_color = "#1f7a1f" if row["state"] == "SUCCESS" else "#b42318"
              status_bg = "#eaf7ea" if row["state"] == "SUCCESS" else "#fdecec"
      
              rows_html.append(f"""
                  <tr>
                  <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;">{esc(row['module_id'])}</td>
                  <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;">
                      <span style="display:inline-block;padding:4px 10px;border-radius:999px;font-weight:600;color:{status_color};background:{status_bg};">
                      {esc(row['state'])}
                      </span>
                  </td>
                  <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;text-align:right;">{esc(human_size(row['size']))}</td>
                  <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;text-align:right;">{esc(row['files'])}</td>
                  <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;text-align:right;">{esc(row['snapshots'])}</td>
                  </tr>
              """)
      
          if summary["failed_modules"]:
              failed_html = "".join(
                  f"<li style='margin:4px 0;'>{esc(mod)}</li>"
                  for mod in summary["failed_modules"]
              )
              failed_block = f"""
                  <div style="margin-top:24px;padding:12px 14px;border-radius:8px;background:#fff7ed;color:#b45309;">
                  <h3 style="margin:0 0 8px 0;font-size:16px;color:#111827;">Modules backup jobs failed:</h3>
                  <ul style="margin:0;padding-left:20px;color:#374151;">
                      {failed_html}
                  </ul>
                  </div>
              """
              overall_label = "Overall status: SUCCESS (with module errors)"
          else:
              failed_block = """
                  <div style="margin-top:24px;padding:12px 14px;border-radius:8px;background:#eaf7ea;color:#1f7a1f;font-weight:600;">
                  All modules backup jobs were successfully completed.
                  </div>
              """
              overall_label = "Overall status: SUCCESS"
      
          return f"""<!DOCTYPE html>
      <html lang="it">
      <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1.0">
      <title>Backup recap</title>
      </head>
      <body style="margin:0;padding:24px;background:#f3f4f6;font-family:Trebuchet MS,Segoe UI,sans-serif;color:#111827;">
      <div style="max-width:860px;margin:0 auto;background:#ffffff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;">
          <div style="padding:24px 28px;background:#161616;color:#ffffff;">
          <h1 style="margin:0;font-size:24px;line-height:1.2;">Backup status recap for job: {esc(name)}</h1>
          <p style="margin:8px 0 0 0;font-size:14px;color:#d1d5db;">
              Final status recap of backup job: {esc(name)}
          </p>
          </div>
      
          <div style="padding:24px 28px;">
          <div style="margin-bottom:20px;">
              <span style="display:inline-block;padding:6px 12px;border-radius:999px;font-size:14px;font-weight:700;color:{overall_color};background:{overall_bg};">
              {esc(overall_label)}
              </span>
              {failed_block}
          </div>
      
          <table style="width:100%;border-collapse:collapse;margin-bottom:24px;">
              <tr>
              <td style="padding:2px 0;color:#6b7280;width:180px;">Backup name</td>
              <td style="padding:2px 0;font-weight:600;">{esc(name)}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Backup ID</td>
              <td style="padding:2px 0;">{esc(backup_id)}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Repository</td>
              <td style="padding:2px 0;">{esc(repository)}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Schedule</td>
              <td style="padding:2px 0;">{esc(schedule)}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Retention</td>
              <td style="padding:2px 0;">{esc(retention)}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Start time</td>
              <td style="padding:2px 0;">{esc(fmt_ts(summary['start']))}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">End time</td>
              <td style="padding:2px 0;">{esc(fmt_ts(summary['end']))}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Instances</td>
              <td style="padding:2px 0;">{esc(summary['total_instances'])}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Total size</td>
              <td style="padding:2px 0;">{esc(human_size(summary['total_size']))}</td>
              </tr>
              <tr>
              <td style="padding:2px 0;color:#6b7280;">Total files</td>
              <td style="padding:2px 0;">{esc(summary['total_files'])}</td>
              </tr>
          </table>
      
          <h2 style="margin:0 0 12px 0;font-size:18px;color:#111827;">Modules details</h2>
      
          <table style="width:100%;border-collapse:collapse;border:1px solid #e5e7eb;border-radius:8px;overflow:hidden;">
              <thead>
              <tr style="background:#f9fafb;">
                  <th style="text-align:left;padding:12px;border-bottom:1px solid #e5e7eb;">Module</th>
                  <th style="text-align:left;padding:12px;border-bottom:1px solid #e5e7eb;">Status</th>
                  <th style="text-align:right;padding:12px;border-bottom:1px solid #e5e7eb;">Size</th>
                  <th style="text-align:right;padding:12px;border-bottom:1px solid #e5e7eb;">Files</th>
                  <th style="text-align:right;padding:12px;border-bottom:1px solid #e5e7eb;">Snapshots</th>
              </tr>
              </thead>
              <tbody>
              {''.join(rows_html)}
              </tbody>
          </table>
          </div>
      </div>
      </body>
      </html>
      """
      
      
      def build_failure_subject(label="repository unavailable"):
          return f"{MAIL_SUBJECT_PREFIX}: FAIL - {label}"
      
      
      def build_failure_body(error_msg):
          return f"""<!DOCTYPE html>
      <html lang="it">
      <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1.0">
      <title>Backup failure</title>
      </head>
      <body style="margin:0;padding:24px;background:#f3f4f6;font-family:Trebuchet MS,Segoe UI,sans-serif;color:#111827;">
      <div style="max-width:760px;margin:0 auto;background:#fff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;">
          <div style="padding:24px 28px;background:#161616;color:#fff;">
          <h1 style="margin:0;font-size:24px;">Backup recap: FAIL</h1>
          <p style="margin:8px 0 0 0;font-size:14px;color:#fff;">The backup job failed before completion.</p>
          </div>
          <div style="padding:24px 28px;">
          <p style="margin:0 0 12px 0;"><strong>Error</strong></p>
          <pre style="white-space:pre-wrap;background:#fdecec;border:1px solid #e5e7eb;padding:16px;border-radius:8px;">{esc(error_msg)}</pre>
          </div>
      </div>
      </body>
      </html>
      """
      
      
      def send_mail(subject, body):
          cmd = [
              "runagent", "ns8-sendmail",
              "-s", subject,
              "-f", MAIL_FROM,
              MAIL_TO
          ]
          proc = subprocess.run(cmd, input=body, text=True, capture_output=True)
          if proc.returncode != 0:
              raise RuntimeError(
                  proc.stderr.strip()
                  or proc.stdout.strip()
                  or f"ns8-sendmail failed with exit code {proc.returncode}"
              )
      
      
      def main():
          parser = argparse.ArgumentParser()
          parser.add_argument("backup_id", nargs="?")
          parser.add_argument("--failed", default="")
          args = parser.parse_args()
      
          if args.failed:
              subject = build_failure_subject()
              body = build_failure_body(args.failed)
              send_mail(subject, body)
              return
      
          if not args.backup_id:
              return
      
          backup = get_backup_data(args.backup_id)
          if not backup:
              subject = f"{MAIL_SUBJECT_PREFIX}: FAIL - backup data not available"
              body = build_failure_body("Backup data not available")
              send_mail(subject, body)
              return
      
          summary = summarize_backup(backup)
          backup_name = backup.get("name", f"backup-{args.backup_id}")
          subject = build_subject(backup_name, summary)
          body = build_body(backup, summary)
          send_mail(subject, body)
      
      
      if __name__ == "__main__":
          main()
      
    • Make it executable with sudo chmod +x /usr/local/bin/ns8-backup-notify/ns8-backup-recap


With this approach, email notifications are sent only when the backup is successfully completed, and the report highlights only failed modules.

For safety, I also wanted to receive a notification if the entire backup job fails, so this is the solution I came up with.


Failed backup job monitoring

The idea is that, as of today (May 2026), there is no “full failed job” hook available in NethServer.
Therefore, an external watcher is required.

  • Create a new ns8-backup-fail-watch script in the /usr/local/bin/ns8-backup-notify/ directory containing:
    #!/usr/bin/env python3
    import json
    import re
    import subprocess
    import sys
    from pathlib import Path
    
    BASE_DIR = Path('/usr/local/bin/ns8-backup-notify')
    STATE_FILE = BASE_DIR / 'last_notified_tasks.json'
    RECAP_BIN = str(BASE_DIR / 'ns8-backup-recap')
    
    cmd = ['journalctl', '--since', '10 minutes ago', '-o', 'json']
    proc = subprocess.run(cmd, capture_output=True, text=True, check=False)
    if proc.returncode != 0:
        print(proc.stderr, file=sys.stderr)
        sys.exit(1)
    
    entries = []
    for line in proc.stdout.splitlines():
        line = line.strip()
        if not line:
            continue
        try:
            obj = json.loads(line)
        except Exception:
            continue
        msg = obj.get('MESSAGE', '')
        if msg:
            entries.append(msg)
    
    task_re = re.compile(r'task/module/([^/]+)/([a-f0-9-]+):?\s+action\s+"?run-backup"?\s+status\s+is\s+"?aborted"?\s+\(1\)\s+at\s+step\s+50run[_-]?backup', re.I)
    error_patterns = {
        'repository init failed': re.compile(r'restic init failed|unable to open repository|failed to create file system', re.I),
        'SMB unreachable': re.compile(r'no route to host|network is unreachable|connection refused|transport endpoint is not connected|host is down', re.I),
        'credentials/auth': re.compile(r'access denied|authentication failed|logon failure|invalid credentials|permission denied', re.I),
    }
    
    state = {}
    if STATE_FILE.exists():
        try:
            state = json.loads(STATE_FILE.read_text())
            if not isinstance(state, dict):
                state = {}
        except Exception:
            state = {}
    
    found = {}
    last_context = None
    for msg in entries:
        for label, rx in error_patterns.items():
            if rx.search(msg):
                last_context = label
                break
        m = task_re.search(msg)
        if not m:
            continue
        module, task_id = m.group(1), m.group(2)
        token = f'{module}:{task_id}'
        if token not in found:
            found[token] = {
                'module': module,
                'task_id': task_id,
                'token': token,
                'category': last_context or 'backup failure',
                'message': msg,
            }
    
    new_items = [v for k, v in found.items() if k not in state]
    if not new_items:
        sys.exit(0)
    
    category_order = ['repository init failed', 'SMB unreachable', 'credentials/auth', 'backup failure']
    for item in new_items:
        if item['category'] not in category_order:
            item['category'] = 'backup failure'
    
    lines = ['Cluster backup failures:']
    for item in sorted(new_items, key=lambda x: (category_order.index(x['category']) if x['category'] in category_order else 99, x['module'], x['task_id'])):
        lines.append(f"- {item['module']} ({item['task_id']}): {item['category']} | {item['message']}")
    
    recap_text = '\n'.join(lines)
    subprocess.run([RECAP_BIN, '--failed', recap_text], check=False)
    
    state.update({item['token']: {'category': item['category'], 'notified_at': True} for item in new_items})
    STATE_FILE.write_text(json.dumps(state, indent=2, sort_keys=True) + '\n')
    
  • Make it executable with sudo chmod +x /usr/local/bin/ns8-backup-notify/ns8-backup-fail-watch

This script reads the journalctl logs from the last 10 minutes and searches for failure-related keywords associated with run-backup using regular expressions.

When a match is found, it creates or updates a status file (last_notified_tasks.json ) in the same directory to track whether the error has already been notified. It then calls the previous script, ns8-backup-recap , with the --failed flag and passes the message that will be included in the email body.

  • Now create a service that will run this script:
    • In /etc/systemd/system create a file ns8-backup-fail-watch.service containing:

      [Unit]
      Description=NS8 backup failure watcher
      
      [Service]
      Type=oneshot
      ExecStart=/usr/local/bin/ns8-backup-notify/ns8-backup-fail-watch
      
    • In the same directory create a file ns8-backup-fail-watch.timer containing:

      [Unit]
      Description=Run NS8 backup failure watcher every minute
      
      [Timer]
      OnBootSec=2min
      OnUnitActiveSec=1min
      AccuracySec=15s
      Unit=ns8-backup-fail-watch.service
      
      [Install]
      WantedBy=timers.target
      
    • Then run:

      systemctl daemon-reload
      
      systemctl enable --now ns8-backup-fail-watch.timer
      
      systemctl start ns8-backup-fail-watch.service
      
  • Now, every minute, the watcher checks journalctl for signs of a backup failure and calls the ns8-backup-recap script, which sends the failure email..

This setup now sends a single email with the backup status instead of one email per module, and it also solves the previous issue where no recap was sent when the repository was unavailable.

I’d be happy to receive any advice or suggestions.

Your LK
1 Like