Hi I’ve made a new thread since I’ve modified the previus version a lot. [Last Thread]
Nethserver 8 backup recap and failure notifications
The initial script shared by @giacomo caused NS8 to send one mail per module with the backup status.
Since I currently have around 13 modules active, receiving 13 mail in the middle of the night quickly became annoying.
So I decidced to come up with better solution by creating a single summary mail that includes the backup status of all modules. Additionally, in case the repository is unavailable, a dedicated “full failure” mail is sent.
This was the old 20notify script, located in /var/lib/nethserver/cluster/events/backup-status-changed/ :
#!/bin/bash
# Change the following variables to match your environment
MAIL_FROM="no-reply@nethserver.org"
MAIL_TO="giacomo@nethesis.it"
MAIL_SUBJECT="Backup status changed:"
MAIL_TEMPLATE="The backup status for {BACKUP_NAME} on {MODULE_ID} has changed to {STATUS}. Please check the system for details."
# WARNING - DO NOT EDIT BELOW THIS LINE (unless you know what you're doing)
# Redis command
rdb="redis-cli --raw"
# Read event data from stdin
read -r event_data
if ! echo "$event_data" | jq . >/dev/null 2>&1; then
echo "Failed to parse JSON input" >&2
exit 1
fi
# Extract necessary fields from event_data
module_id=$(echo "$event_data" | jq -r '.module_id')
backup_id=$(echo "$event_data" | jq -r '.backup_id')
leader_id=$($rdb hget cluster/environment NODE_ID)
self_id=$NODE_ID
if [[ "$self_id" != "$leader_id" ]]; then
exit 0 # LEADER ONLY! Do not run this procedure in worker nodes.
fi
backup_name=$($rdb hget "cluster/backup/$backup_id" "name")
errors=$($rdb hget "module/$module_id/backup_status/$backup_id" errors)
if [[ -z "$errors" ]]; then
echo "INFO: Status unknown, exiting." >&2
exit 0
fi
if [[ "$errors" == "0" ]]; then
status="SUCCESS"
else
status="FAIL"
fi
# Send email
subject="$backup_name ($module_id): $status"
msg="$(echo "$MAIL_TEMPLATE" | sed "s/{BACKUP_NAME}/$backup_name/g; s/{STATUS}/$status/g; s/{MODULE_ID}/$module_id/g")"
echo "$msg" | runagent ns8-sendmail -s "$subject" -f "$MAIL_FROM" "$MAIL_TO"
The new logic
Previously the script was executed immediately after the completion of the single module backup.
Now there is a new logic:
-
- The script launched immediately after the
run-backupcommand is now just a simple wrapper:
-
In the
/var/lib/nethserver/cluster/actions/run-backupdirectory, where the50run_backupand80upload_cluster_backupscripts are located, create a new script called90notifycontaining the following:#!/bin/bash set -euo pipefail event_data="$(cat)" backup_id="$(printf '%s\n' "$event_data" | jq -r '.id // empty')" [ -n "$backup_id" ] || exit 0 exec /usr/local/bin/ns8-backup-notify/ns8-backup-recap "$backup_id" -
Make it executable with
sudo chmod +x /var/lib/nethserver/cluster/actions/run-backup/90notify. -
When the two previous scripts complete successfully, 90notify is executed and simply calls the next script using values passed from the previous steps.
- The script launched immediately after the
-
The actual scripts are stored outside the NethServer core, which makes the setup somewhat safer across updates.
-
In the
/usr/local/bindirectory, create a sub-directory caleldns8-backup-notify. -
Inside create a file called
ns8-backup-recapand put this inside:#!/usr/bin/env python3 import argparse import html import json import subprocess import sys import time # Edit "MAIL_FROM_PLACEHOLDER" and "MAIL_TO_PLACEHOLDER" with corrisponding real values. MAIL_FROM = "MAIL_FROM_PLACEHOLDER" MAIL_TO = "MAIL_TO_PLACEHOLDER" MAIL_SUBJECT_PREFIX = "Backup recap" # WARNING | Do not edit below this line or it will does not work anymore. def esc(value): if value is None: return "-" return html.escape(str(value), quote=True) def run_cmd(cmd, input_text=None, check=True): proc = subprocess.run( cmd, input=input_text, text=True, capture_output=True ) if check and proc.returncode != 0: raise RuntimeError( proc.stderr.strip() or proc.stdout.strip() or f"command failed: {' '.join(cmd)}" ) return proc def get_backup_data(backup_id): raw = run_cmd(["api-cli", "run", "list-backups"]).stdout data = json.loads(raw) for backup in data.get("backups", []): if str(backup.get("id")) == str(backup_id): return backup return None def human_size(num): units = ["B", "KB", "MB", "GB", "TB", "PB"] n = float(num or 0) for unit in units: if n < 1024 or unit == units[-1]: if unit == "B": return f"{int(n)} {unit}" return f"{n:.2f} {unit}" n /= 1024.0 def fmt_ts(ts): if not ts: return "-" return time.strftime("%Y-%m-%d %H:%M:%S %Z", time.localtime(int(ts))) def summarize_backup(backup): instances = backup.get("instances", []) rows = [] failed_modules = [] total_size = 0 total_files = 0 started = [] ended = [] for inst in instances: module_id = inst.get("module_id", "_") status = inst.get("status") or {} success = status.get("success") is True state = "SUCCESS" if success else "FAIL" if not success: failed_modules.append(module_id) total_size += int(status.get("total_size", 0) or 0) total_files += int(status.get("total_file_count", 0) or 0) if status.get("start"): started.append(int(status["start"])) if status.get("end"): ended.append(int(status["end"])) rows.append({ "module_id": module_id, "state": state, "size": int(status.get("total_size", 0) or 0), "files": int(status.get("total_file_count", 0) or 0), "snapshots": int(status.get("snapshots_count", 0) or 0), }) return { "overall": "SUCCESS", "has_module_failures": bool(failed_modules), "rows": sorted(rows, key=lambda x: x["module_id"]), "failed_modules": sorted(failed_modules), "total_instances": len(instances), "total_size": total_size, "total_files": total_files, "start": min(started) if started else None, "end": max(ended) if ended else None, } def build_subject(backup_name, summary): if summary["failed_modules"]: return ( f"{MAIL_SUBJECT_PREFIX}: SUCCESS with module errors - " f"{backup_name} - {', '.join(summary['failed_modules'])}" ) return f"{MAIL_SUBJECT_PREFIX}: SUCCESS - {backup_name}" def build_body(backup, summary): name = backup.get("name", "backup") backup_id = backup.get("id", "") repository = backup.get("repository", "-") retention = backup.get("retention", "-") schedule = backup.get("schedule", "-") overall_color = "#1f7a1f" overall_bg = "#eaf7ea" rows_html = [] for row in summary["rows"]: status_color = "#1f7a1f" if row["state"] == "SUCCESS" else "#b42318" status_bg = "#eaf7ea" if row["state"] == "SUCCESS" else "#fdecec" rows_html.append(f""" <tr> <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;">{esc(row['module_id'])}</td> <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;"> <span style="display:inline-block;padding:4px 10px;border-radius:999px;font-weight:600;color:{status_color};background:{status_bg};"> {esc(row['state'])} </span> </td> <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;text-align:right;">{esc(human_size(row['size']))}</td> <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;text-align:right;">{esc(row['files'])}</td> <td style="padding:10px 12px;border-bottom:1px solid #e5e7eb;text-align:right;">{esc(row['snapshots'])}</td> </tr> """) if summary["failed_modules"]: failed_html = "".join( f"<li style='margin:4px 0;'>{esc(mod)}</li>" for mod in summary["failed_modules"] ) failed_block = f""" <div style="margin-top:24px;padding:12px 14px;border-radius:8px;background:#fff7ed;color:#b45309;"> <h3 style="margin:0 0 8px 0;font-size:16px;color:#111827;">Modules backup jobs failed:</h3> <ul style="margin:0;padding-left:20px;color:#374151;"> {failed_html} </ul> </div> """ overall_label = "Overall status: SUCCESS (with module errors)" else: failed_block = """ <div style="margin-top:24px;padding:12px 14px;border-radius:8px;background:#eaf7ea;color:#1f7a1f;font-weight:600;"> All modules backup jobs were successfully completed. </div> """ overall_label = "Overall status: SUCCESS" return f"""<!DOCTYPE html> <html lang="it"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Backup recap</title> </head> <body style="margin:0;padding:24px;background:#f3f4f6;font-family:Trebuchet MS,Segoe UI,sans-serif;color:#111827;"> <div style="max-width:860px;margin:0 auto;background:#ffffff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;"> <div style="padding:24px 28px;background:#161616;color:#ffffff;"> <h1 style="margin:0;font-size:24px;line-height:1.2;">Backup status recap for job: {esc(name)}</h1> <p style="margin:8px 0 0 0;font-size:14px;color:#d1d5db;"> Final status recap of backup job: {esc(name)} </p> </div> <div style="padding:24px 28px;"> <div style="margin-bottom:20px;"> <span style="display:inline-block;padding:6px 12px;border-radius:999px;font-size:14px;font-weight:700;color:{overall_color};background:{overall_bg};"> {esc(overall_label)} </span> {failed_block} </div> <table style="width:100%;border-collapse:collapse;margin-bottom:24px;"> <tr> <td style="padding:2px 0;color:#6b7280;width:180px;">Backup name</td> <td style="padding:2px 0;font-weight:600;">{esc(name)}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Backup ID</td> <td style="padding:2px 0;">{esc(backup_id)}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Repository</td> <td style="padding:2px 0;">{esc(repository)}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Schedule</td> <td style="padding:2px 0;">{esc(schedule)}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Retention</td> <td style="padding:2px 0;">{esc(retention)}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Start time</td> <td style="padding:2px 0;">{esc(fmt_ts(summary['start']))}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">End time</td> <td style="padding:2px 0;">{esc(fmt_ts(summary['end']))}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Instances</td> <td style="padding:2px 0;">{esc(summary['total_instances'])}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Total size</td> <td style="padding:2px 0;">{esc(human_size(summary['total_size']))}</td> </tr> <tr> <td style="padding:2px 0;color:#6b7280;">Total files</td> <td style="padding:2px 0;">{esc(summary['total_files'])}</td> </tr> </table> <h2 style="margin:0 0 12px 0;font-size:18px;color:#111827;">Modules details</h2> <table style="width:100%;border-collapse:collapse;border:1px solid #e5e7eb;border-radius:8px;overflow:hidden;"> <thead> <tr style="background:#f9fafb;"> <th style="text-align:left;padding:12px;border-bottom:1px solid #e5e7eb;">Module</th> <th style="text-align:left;padding:12px;border-bottom:1px solid #e5e7eb;">Status</th> <th style="text-align:right;padding:12px;border-bottom:1px solid #e5e7eb;">Size</th> <th style="text-align:right;padding:12px;border-bottom:1px solid #e5e7eb;">Files</th> <th style="text-align:right;padding:12px;border-bottom:1px solid #e5e7eb;">Snapshots</th> </tr> </thead> <tbody> {''.join(rows_html)} </tbody> </table> </div> </div> </body> </html> """ def build_failure_subject(label="repository unavailable"): return f"{MAIL_SUBJECT_PREFIX}: FAIL - {label}" def build_failure_body(error_msg): return f"""<!DOCTYPE html> <html lang="it"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Backup failure</title> </head> <body style="margin:0;padding:24px;background:#f3f4f6;font-family:Trebuchet MS,Segoe UI,sans-serif;color:#111827;"> <div style="max-width:760px;margin:0 auto;background:#fff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;"> <div style="padding:24px 28px;background:#161616;color:#fff;"> <h1 style="margin:0;font-size:24px;">Backup recap: FAIL</h1> <p style="margin:8px 0 0 0;font-size:14px;color:#fff;">The backup job failed before completion.</p> </div> <div style="padding:24px 28px;"> <p style="margin:0 0 12px 0;"><strong>Error</strong></p> <pre style="white-space:pre-wrap;background:#fdecec;border:1px solid #e5e7eb;padding:16px;border-radius:8px;">{esc(error_msg)}</pre> </div> </div> </body> </html> """ def send_mail(subject, body): cmd = [ "runagent", "ns8-sendmail", "-s", subject, "-f", MAIL_FROM, MAIL_TO ] proc = subprocess.run(cmd, input=body, text=True, capture_output=True) if proc.returncode != 0: raise RuntimeError( proc.stderr.strip() or proc.stdout.strip() or f"ns8-sendmail failed with exit code {proc.returncode}" ) def main(): parser = argparse.ArgumentParser() parser.add_argument("backup_id", nargs="?") parser.add_argument("--failed", default="") args = parser.parse_args() if args.failed: subject = build_failure_subject() body = build_failure_body(args.failed) send_mail(subject, body) return if not args.backup_id: return backup = get_backup_data(args.backup_id) if not backup: subject = f"{MAIL_SUBJECT_PREFIX}: FAIL - backup data not available" body = build_failure_body("Backup data not available") send_mail(subject, body) return summary = summarize_backup(backup) backup_name = backup.get("name", f"backup-{args.backup_id}") subject = build_subject(backup_name, summary) body = build_body(backup, summary) send_mail(subject, body) if __name__ == "__main__": main() -
Make it executable with
sudo chmod +x /usr/local/bin/ns8-backup-notify/ns8-backup-recap
-
With this approach, email notifications are sent only when the backup is successfully completed, and the report highlights only failed modules.
For safety, I also wanted to receive a notification if the entire backup job fails, so this is the solution I came up with.
Failed backup job monitoring
The idea is that, as of today (May 2026), there is no “full failed job” hook available in NethServer.
Therefore, an external watcher is required.
- Create a new
ns8-backup-fail-watchscript in the/usr/local/bin/ns8-backup-notify/directory containing:#!/usr/bin/env python3 import json import re import subprocess import sys from pathlib import Path BASE_DIR = Path('/usr/local/bin/ns8-backup-notify') STATE_FILE = BASE_DIR / 'last_notified_tasks.json' RECAP_BIN = str(BASE_DIR / 'ns8-backup-recap') cmd = ['journalctl', '--since', '10 minutes ago', '-o', 'json'] proc = subprocess.run(cmd, capture_output=True, text=True, check=False) if proc.returncode != 0: print(proc.stderr, file=sys.stderr) sys.exit(1) entries = [] for line in proc.stdout.splitlines(): line = line.strip() if not line: continue try: obj = json.loads(line) except Exception: continue msg = obj.get('MESSAGE', '') if msg: entries.append(msg) task_re = re.compile(r'task/module/([^/]+)/([a-f0-9-]+):?\s+action\s+"?run-backup"?\s+status\s+is\s+"?aborted"?\s+\(1\)\s+at\s+step\s+50run[_-]?backup', re.I) error_patterns = { 'repository init failed': re.compile(r'restic init failed|unable to open repository|failed to create file system', re.I), 'SMB unreachable': re.compile(r'no route to host|network is unreachable|connection refused|transport endpoint is not connected|host is down', re.I), 'credentials/auth': re.compile(r'access denied|authentication failed|logon failure|invalid credentials|permission denied', re.I), } state = {} if STATE_FILE.exists(): try: state = json.loads(STATE_FILE.read_text()) if not isinstance(state, dict): state = {} except Exception: state = {} found = {} last_context = None for msg in entries: for label, rx in error_patterns.items(): if rx.search(msg): last_context = label break m = task_re.search(msg) if not m: continue module, task_id = m.group(1), m.group(2) token = f'{module}:{task_id}' if token not in found: found[token] = { 'module': module, 'task_id': task_id, 'token': token, 'category': last_context or 'backup failure', 'message': msg, } new_items = [v for k, v in found.items() if k not in state] if not new_items: sys.exit(0) category_order = ['repository init failed', 'SMB unreachable', 'credentials/auth', 'backup failure'] for item in new_items: if item['category'] not in category_order: item['category'] = 'backup failure' lines = ['Cluster backup failures:'] for item in sorted(new_items, key=lambda x: (category_order.index(x['category']) if x['category'] in category_order else 99, x['module'], x['task_id'])): lines.append(f"- {item['module']} ({item['task_id']}): {item['category']} | {item['message']}") recap_text = '\n'.join(lines) subprocess.run([RECAP_BIN, '--failed', recap_text], check=False) state.update({item['token']: {'category': item['category'], 'notified_at': True} for item in new_items}) STATE_FILE.write_text(json.dumps(state, indent=2, sort_keys=True) + '\n') - Make it executable with
sudo chmod +x /usr/local/bin/ns8-backup-notify/ns8-backup-fail-watch
This script reads the journalctl logs from the last 10 minutes and searches for failure-related keywords associated with run-backup using regular expressions.
When a match is found, it creates or updates a status file (last_notified_tasks.json ) in the same directory to track whether the error has already been notified. It then calls the previous script, ns8-backup-recap , with the --failed flag and passes the message that will be included in the email body.
- Now create a service that will run this script:
-
In
/etc/systemd/systemcreate a filens8-backup-fail-watch.servicecontaining:[Unit] Description=NS8 backup failure watcher [Service] Type=oneshot ExecStart=/usr/local/bin/ns8-backup-notify/ns8-backup-fail-watch -
In the same directory create a file
ns8-backup-fail-watch.timercontaining:[Unit] Description=Run NS8 backup failure watcher every minute [Timer] OnBootSec=2min OnUnitActiveSec=1min AccuracySec=15s Unit=ns8-backup-fail-watch.service [Install] WantedBy=timers.target -
Then run:
systemctl daemon-reload systemctl enable --now ns8-backup-fail-watch.timer systemctl start ns8-backup-fail-watch.service
-
- Now, every minute, the watcher checks
journalctlfor signs of a backup failure and calls thens8-backup-recapscript, which sends the failure email..