What constitutes an offline node

Got this in my mailbox this morning:

What are the conditions for thinking a node is offline.

I’m running NS8 in an ESXi VM which is not showing any evidence of being underpowered or resource constrained.

Do you run more nodes? It seems that just node 1 is affected.

The node agents communicate with the API server via the redis database and if there’s no answer from a node after some attempts it is considered offline.

To check the cluster status from CLI:

api-cli run cluster/get-cluster-status | jq

Maybe it helps to restart the metrics or loki services?

Metrics:

runagent -m metrics1 systemctl --user restart prometheus.service alertmanager.service alert-proxy.service

Loki:

runagent -m loki1 systemctl --user restart loki.service

Please also check the logs for errors, maybe there’s more information about the cause.