Unable to get cluster status after OS upgrade

EddieA · May 23, 2023, 7:32pm

I’m using Markus’s Rocky image from here. which is v9.1. Recently, they released upgrade packages to move to v9.2. I applied these and then rebooted.

Trying to get the Cluster Status now throws this:

What other information can I collect on this.

Cheers.

mrmarkuz · May 23, 2023, 9:56pm

I can’t reproduce it on Proxmox, maybe another ESXi issue?

Is it a single node or a cluster?

Which modules did you install?

ls /home/

EddieA · May 23, 2023, 10:12pm

Single node.

Samba file sharing
Mail
Nginx
CrowdSec
MinIO

[rocky@node ~]$ ls /home/
ldapproxy1  loki1  mail1  minio1  rocky  samba1  traefik1  webserver1
[rocky@node ~]$

Cheers.

EddieA · May 23, 2023, 10:52pm

Not sure if any of this helps or not, but here’s a random selection of messages pulled from the log. Starting very early in the boot process:

May 23 18:26:31 node cluster[830]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:32 node node[832]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:32 node crowdsec1[831]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:32 node promtail1[833]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
--- snip ---
May 23 18:26:43 node minio1[1057]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:43 node systemd[944]: Started Rootless module/webserver1 agent.
May 23 18:26:43 node loki1[1064]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:43 node systemd[948]: Starting Create User's Volatile Files and Directories...
May 23 18:26:44 node samba1[1078]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:44 node ldapproxy1[1072]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
--- snip ---
May 23 18:26:44 node webserver1[1093]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:44 node systemd[941]: Starting Traefik edge proxy...
May 23 18:26:44 node systemd[948]: Started Rootless module/mail1 agent.
May 23 18:26:44 node traefik1[1103]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused
May 23 18:26:44 node mail1[1112]: Task queue pop error: dial tcp 127.0.0.1:6379: connect: connection refused

This following block was repeated also for mail1 (2 different task numbers), ldapproxy:

May 23 18:26:46 node crowdsec1[917]: Traceback (most recent call last):
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 559, in connect
May 23 18:26:46 node crowdsec1[917]:    sock = self._connect()
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 615, in _connect
May 23 18:26:46 node crowdsec1[917]:    raise err
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 603, in _connect
May 23 18:26:46 node crowdsec1[917]:    sock.connect(socket_address)
May 23 18:26:46 node crowdsec1[917]: ConnectionRefusedError: [Errno 111] Connection refused
May 23 18:26:46 node crowdsec1[917]: During handling of the above exception, another exception occurred:
May 23 18:26:46 node crowdsec1[917]: Traceback (most recent call last):
May 23 18:26:46 node crowdsec1[917]:  File "/var/lib/nethserver/crowdsec1/bin/expand-configuration", line 37, in <module>
May 23 18:26:46 node crowdsec1[917]:    for kenv in rdb.scan_iter(match='module/*/environment'):
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/client.py", line 2131, in scan_iter
May 23 18:26:46 node crowdsec1[917]:    cursor, data = self.scan(cursor=cursor, match=match,
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/client.py", line 2112, in scan
May 23 18:26:46 node crowdsec1[917]:    return self.execute_command('SCAN', *pieces)
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/client.py", line 898, in execute_command
May 23 18:26:46 node crowdsec1[917]:    conn = self.connection or pool.get_connection(command_name, **options)
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 1192, in get_connection
May 23 18:26:46 node crowdsec1[917]:    connection.connect()
May 23 18:26:46 node crowdsec1[917]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 563, in connect
May 23 18:26:46 node crowdsec1[917]:    raise ConnectionError(self._error_message(e))
May 23 18:26:46 node crowdsec1[917]: redis.exceptions.ConnectionError: Error 111 connecting to cluster-leader:6379. Connection refused.

Then this:

May 23 18:26:53 node crowdsec1[1468]: Traceback (most recent call last):
May 23 18:26:53 node crowdsec1[1468]:  File "/var/lib/nethserver/crowdsec1/bin/expand-configuration", line 37, in <module>
May 23 18:26:53 node crowdsec1[1468]:    for kenv in rdb.scan_iter(match='module/*/environment'):
May 23 18:26:53 node crowdsec1[1468]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/client.py", line 2131, in scan_iter
May 23 18:26:53 node crowdsec1[1468]:    cursor, data = self.scan(cursor=cursor, match=match,
May 23 18:26:53 node crowdsec1[1468]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/client.py", line 2112, in scan
May 23 18:26:53 node crowdsec1[1468]:    return self.execute_command('SCAN', *pieces)
May 23 18:26:53 node crowdsec1[1468]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/client.py", line 898, in execute_command
May 23 18:26:53 node crowdsec1[1468]:    conn = self.connection or pool.get_connection(command_name, **options)
May 23 18:26:53 node crowdsec1[1468]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 1192, in get_connection
May 23 18:26:53 node crowdsec1[1468]:    connection.connect()
May 23 18:26:53 node crowdsec1[1468]:  File "/usr/local/agent/pyenv/lib64/python3.9/site-packages/redis/connection.py", line 563, in connect
May 23 18:26:53 node crowdsec1[1468]:    raise ConnectionError(self._error_message(e))
May 23 18:26:53 node crowdsec1[1468]: redis.exceptions.ConnectionError: Error 111 connecting to cluster-leader:6379. Connection refused.

More refusers:

May 23 18:27:11 node redis[2428]: /usr/bin/bash: connect: Connection refused
May 23 18:27:11 node redis[2428]: /usr/bin/bash: line 1: /dev/tcp/127.0.0.1/6379: Connection refused
--- snip ---
May 23 18:27:14 node mail1[3050]: /usr/bin/bash: connect: Connection refused
May 23 18:27:14 node mail1[3050]: /usr/bin/bash: line 1: /dev/tcp/127.0.0.1/9288: Connection refused
--- snip ---
May 23 18:27:15 node mail1[3050]: /usr/bin/bash: connect: Connection refused
May 23 18:27:15 node mail1[3050]: /usr/bin/bash: line 1: /dev/tcp/127.0.0.1/9288: Connection refused
May 23 18:27:15 node bash[2086]: /usr/bin/bash: connect: Connection refused
May 23 18:27:15 node bash[2086]: /usr/bin/bash: line 1: /dev/tcp/192.168.0.109/53: Connection refused

At this point, it got more and more difficult to pull things as crowdsec started to flood the log with thousands and thousands of messages.

Cheers.

giacomo · May 24, 2023, 7:02am

It seems that redis is not running.
As root, you can check the status with:

systemctl status redis

Eventually, take a look also to podman:

podman ps -a

Try to restart redis:

systemctl restart redis

If something goes wrong, please paste the relevant part from journalctl.

Again, thanks for your time!

EddieA · May 24, 2023, 4:08pm

[root@node ~]# systemctl status redis
● redis.service - Core Redis DB
     Loaded: loaded (/etc/systemd/system/redis.service; enabled; preset: disabl>
     Active: active (running) since Wed 2023-05-24 15:56:57 UTC; 1min 12s ago
       Docs: https://github.com/NethServer/ns8-core
    Process: 1689 ExecStartPre=/bin/rm -f /run/redis.pid /run/redis.cid (code=e>
    Process: 1690 ExecStart=/usr/bin/podman run --conmon-pidfile=/run/redis.pid>
    Process: 2140 ExecStartPost=/usr/bin/bash -c while ! exec 3<>/dev/tcp/127.0>
    Process: 2212 ExecStartPost=/usr/local/bin/acl-load (code=exited, status=0/>
    Process: 2223 ExecStartPost=/usr/local/sbin/apply-vpn-routes (code=exited, >
   Main PID: 2076 (conmon)
      Tasks: 1 (limit: 48748)
     Memory: 1.1M
        CPU: 1.060s
     CGroup: /system.slice/redis.service
             └─2076 /usr/bin/conmon --api-version 1 -c f49abb81d38341a971820806>

May 24 15:56:53 node.ns8.test redis[2076]: 1:M 24 May 2023 15:56:53.819 * Done >
May 24 15:56:53 node.ns8.test redis[2076]: 1:M 24 May 2023 15:56:53.819 * DB lo>
May 24 15:56:53 node.ns8.test redis[2076]: 1:M 24 May 2023 15:56:53.819 * Ready>
May 24 15:56:54 node.ns8.test redis[2212]: ACLs loading skipped on the leader n>
May 24 15:56:55 node.ns8.test redis[2223]: wg set wg0 peer ywiHj3ul4V8eGxhTBh2J>
May 24 15:56:55 node.ns8.test redis[2223]: Address 192.168.0.109 is not routed >
May 24 15:56:55 node.ns8.test redis[2223]: ip route replace 10.5.4.1 nexthop de>
May 24 15:56:55 node.ns8.test redis[2223]: wg-quick save wg0
May 24 15:56:56 node.ns8.test redis[2274]: [#] wg showconf wg0
May 24 15:56:57 node.ns8.test systemd[1]: Started Core Redis DB.

[root@node ~]# podman ps -a
CONTAINER ID  IMAGE                                           COMMAND               CREATED        STATUS        PORTS       NAMES
9860614080f7  docker.io/grafana/promtail:2.7.3                -config.file=/etc...  4 minutes ago  Up 4 minutes              promtail1
f49abb81d383  ghcr.io/nethserver/redis:1.0.1                  redis-server /dat...  4 minutes ago  Up 4 minutes              redis
2e38a5f3c433  docker.io/crowdsecurity/crowdsec:v1.4.6-debian                        4 minutes ago  Up 4 minutes              crowdsec1
[root@node ~]#

Didn’t make any difference, I’m still seeing the error and the bars on the icons flowing back and fro continuously.

The only difference is that the log isn’t getting flooded with millions of CrowdSec messages this time.

Still seeing messages like this:

May 24 16:04:53 node node[834]: Task queue pop error: dial tcp 127.0.0.1:6379: c
onnect: connection refused
May 24 16:04:53 node crowdsec1[833]: Task queue pop error: dial tcp 127.0.0.1:63
79: connect: connection refused
May 24 16:04:53 node cluster[832]: Task queue pop error: dial tcp 127.0.0.1:6379
: connect: connection refused
May 24 16:04:53 node promtail1[835]: Task queue pop error: dial tcp 127.0.0.1:63
79: connect: connection refused
May 24 16:04:53 node ldapproxy1[1109]: Task queue pop error: dial tcp 127.0.0.1:
6379: connect: connection refused
May 24 16:04:53 node loki1[1097]: Task queue pop error: dial tcp 127.0.0.1:6379:
 connect: connection refused
May 24 16:04:53 node traefik1[1089]: Task queue pop error: dial tcp 127.0.0.1:63
79: connect: connection refused
May 24 16:04:53 node minio1[1075]: Task queue pop error: dial tcp 127.0.0.1:6379
: connect: connection refused
May 24 16:04:53 node samba1[1065]: Task queue pop error: dial tcp 127.0.0.1:6379
: connect: connection refused
May 24 16:04:53 node webserver1[1081]: Task queue pop error: dial tcp 127.0.0.1:
6379: connect: connection refused
May 24 16:04:53 node mail1[1117]: Task queue pop error: dial tcp 127.0.0.1:6379:
 connect: connection refused

Cheers.

giacomo · May 24, 2023, 4:15pm

Maybe selinux?
Look inside /var/log/audit.log ore just try if setenforce 0 fixes the issue.

EddieA · May 24, 2023, 5:21pm

That might have worked. I want to do a couple more tests though before I am certain, based on the way it asked me to renew my credentials part way through.

Cheers.

EddieA · May 24, 2023, 7:13pm

Really wild question from left field.

Could any of the above be caused by trying to access when the admin user was expired. During all the tests, even through the reboot, I had a browser session open and was only using the menu to navigate.

It was only when I entered the selinux command did I hit refresh on the browser, which said the user had expired and I needed to log in again. Following that everything worked.

I also rebooted again, and make sure I logged in following the reboot, and everything is working without having to use the selinux command.

Cheers.

mrmarkuz · May 24, 2023, 7:29pm

Yes, I think so. There’s also a notification when the websocket was lost and you have a possibility to reload.

Which browser did you use?

EddieA · May 24, 2023, 9:31pm

Chrome.

Is the expire time of the user based on sign-in time or last use. What is the expiry time.

Cheers.

giacomo · May 25, 2023, 7:20am

I do not recall the details, but maybe @andre8244 or @edoardo_spadoni can give more info.

andre8244 · May 29, 2023, 6:58am

Hi Markus, the expire time is based on sign-in time and occurs after two weeks.