Spurious warnings?

(Dan) #1

So I signed up for a subscription a day or two ago, and today I’ve gotten some warning emails that don’t make much sense. I got a warning early this morning about system load, and a follow-up email a minute or two later saying it was OK, but then I got two later this morning that look like this:
The other was for the boot partition.

This looks serious, but what does it mean? The server was up and running at the time and remains so (40 days’ uptime right now), so clearly the root partition hasn’t actually failed. I logged into my dashboard, but there was nothing more there (other than that this was classified as “medium” severity, which hardly seems correct if the root partition had in fact failed). So how do I track down what triggered this warning?

(Michael Träumner) #2

I had the same behaviour at my tests. @giacomo supposed that it comes from a job like a backup. For me it wasn’t a backup, but my test-server hasn’t much resources, so I guess this was the problem.
Perhaps @giacomo can help to find out your other problem.

(Giacomo Sanchietti) #3

It means that the root partition is almost full, I agree with you that the label is not descriptive enough.

Labels are defined here (https://github.com/nethesis/dartagnan/blob/master/athos/utils/utils.go#L111), with something like:

T(".. something ..").

We just need change the something thing. I’d suggest this corrections:

  • Boot partition -> Boot partition free space
  • Root partition -> Root partition free space
  • SWAP partition -> SWAP free space

You can find everything inside the /var/log/messages, something like:

/var/log/messages-20180107:Dec 31 06:09:30 nethservice collectd[18129]: [WARNING] load:load: Host nethservice.nethesis.it, plugin load type load: Data source "midterm" is currently 5.330000. That is above the warning threshold of 5.000000.
/var/log/messages-20180107:Dec 31 06:19:20 nethservice collectd[18129]: [OK] load:load: Host nethservice.nethesis.it, plugin load type load: All data sources are within range again. Current value of "shortterm" is 1.050000.

Alert rules are the same applied on thousands of active systems and usually they spots some existing problems.
Of course, not every machine is the same and threshold could be adjusted (we have it on the roadmap but I can’t give you any hint on when it will be available).

You can check load spikes looking at collectd graphs and logs at the time when the alert was active.

NethServer Subscription Alert report
(Dan) #4

…which isn’t true; it’s about 60% empty (so over 350 GB free). And the boot partition, for which I received the same message, is 76% empty (690M free).

Hmmm, not so much. There’s one mention of threshold in the relevant log, and it’s about a load average. Nothing about disk space.

(Giacomo Sanchietti) #5

This is the default config: at least 15% if free space for 2 ticks (https://github.com/NethServer/nethserver-subscription/blob/master/root/etc/e-smith/templates/etc/collectd.d/threshold.conf/10threshold20df#L25).

You’re right, I have no idea why collectd doesn’t log it.
If you want, still, you can enable the debug for the python plugin and see everything is sent to the server (https://github.com/NethServer/nethserver-subscription#collectd-python-plugin)

(Dan) #6

I’d suggest that the emails should be made much more specific: “At $TIME on $DATE, your server’s disk space was 86% full; the warning threshold is 85%.”

(Giacomo Sanchietti) #7

Information about date and time are already inside the mail, also both info are saved inside the alert history (which is not currently displayed inside the UI - https://github.com/nethesis/dartagnan/blob/master/deploy/roles/athos/files/database.sql#L82).

The current value and threshold configuration are not currently sent nor stored in the server, it requires much more work on being implemented :slight_smile: