Nethserver 8 Cluster joining error

wahmed · April 27, 2024, 10:52pm

NethServer Version: NS8
Module: Cluster

Now that I have a NS8 with at least mail, webtop working, wanted to install a second NS8 and join that to the cluster to see the feature in action.
I have installed the second NS8 on a new VM with identical resource config on Debian+BTRFS.

After the install I clicked on Join Cluster. copy/pasted the Join code from the leader node. No matter how many times I tried and how long I waited for the task to finish, the screen sits with spinning wheel as following:

I did not think I needed any further configuration on the leader node for the cluster feature to work. Have I missed something? I should mention, I tried to join the node with both TLS validation enabled and disabled but same result.

mrmarkuz · April 28, 2024, 9:31am

I really can’t understand why there are that much issues. Are the system requirements fulfilled?
I’m sorry to repeat myself but…in logs we trust. Please share logs so we’re able to check what’s going on.
Could you please also share some hardware information (like RAM, CPU and disk type) and if there’s a special proxmox vm config (not default)?
Is this VM stored on Ceph?
Which Debian install image did you use? The installation should really be minimal, see Minimum Requirements To Run NethServer 8 - #6 by mrmarkuz

Some ideas:

I assume you disabled the firewall for the virtual network device of the NS8 VM in Proxmox.

Let’s check the nodes FQDNs:

hostname -f

Can the future worker node reach the cluster leader HTTPS port? (gives back “Found” when working)

curl -k https://nodename.domain.tld/cluster-admin

I tested NS8 on Debian 12 with ext4 and xfs and there are no issues. I think that’s possibly the wrong direction as podman seems to work with most filesystems.
There are for example known issues regarding NFS mounts, see Ns8 merged error - #15 by davidep

No, I don’t think so, there’s no additional configuration needed.

wahmed · April 28, 2024, 6:28pm

Indeed it is odd why my instance having so many issues. I do not consider my virtual environment unique as far as architecture goes. Here are the details of the VM for NS8:
Hypervisor: Proxmox 7.4-13
VM Type: KVM
Core/RAM/VM Disk Type: 4/8GB/VirtioIO
Storage: Initially NS8 was on Ceph. Currently running on local ZFS storage
No special Proxmox configuration. No Proxmox firewall

Which log would be helpful to find the issue of not being able to join new node? Log from the leader or slave?
If you need log from the slave, where is the location of log store? Since I cannot login to GUI without taking care of cluster creation or joining, I cannot pull the log via cluster admin gui.

The new node just return Found for the curl command with FQDN of leader.

mrmarkuz · April 28, 2024, 6:52pm

Thanks for the infos.

I guess both. To get the logs you could also use journalctl on CLI.

To export a time range of the logs to a file:

journalctl --since "2024-04-28 11:30" --until "2024-04-28 11:40:00" > mylogs.log

That’s good news.

wahmed · April 28, 2024, 8:09pm

After few attempts i was able move past the cluster joining page. Leader node now actually recognizes that a node was added to the cluster. However it sees as offline but the need is online and accessible.

After ths my leader node GUI is now struggling. It became very slow and mostly sits with these gray placeholders.

I have included the logs from both nodes below. Hopefully they can shed some light.
NS8 leader node error

NS8 slave node error log

mrmarkuz · April 29, 2024, 8:26pm

Apr 28 12:37:43 ca0401nth02 agent@cluster[7550]: Error: NAME_CONFLICT: new_service(): 'ns-wireguard'
Apr 28 12:37:43 ca0401nth02 agent@cluster[7550]: task/cluster/42a1ae52-d05d-43a6-a66a-1fb1d87d710e: action "join-node" status is "aborted" (26) at step 20wgboot

There’s a similar issue:

To list the services in the firewall:

firewall-cmd --list-services --zone public

To remove the wireguard firewall service before retrying to join:

firewall-cmd --remove-service=ns-wireguard

wahmed · May 12, 2024, 10:19pm

I made one last attempt to get this to work.

I created 2 fresh VMs with identical specs (2 vCPU, 8GB RAM) and installed Debian 12.
I installed NS8 following the guide using curl command on both VMs.
On node 1, I created a new cluster and nothing else, no app, no configuration. Even after waiting for hours and refreshing many times, the GUI never fully loads. Always missing something and looks like unfinished GUI page as following image:

On node 2, I clicked on Join Cluster, copy/pasted the join code from node 1. The join never finishes. The GUI shows the spinning wheel indefinitely as following:

I am all out of ideas to try on this. How much more basic I could get than clean, freshly installed OS?

Any debian user out there who has a fully functioning NS8 which loads GUI every time?

stephdl · May 13, 2024, 7:51am

look in the log on two servers, I suspect you went to a timeout like the other day, what I could say is that you have reproduced the issue twice

wahmed · May 13, 2024, 7:51pm

Where is the location that Neth stores the logs so I can access them via console?
I can reproduce the issue repeatedly on fresh debian and cannot figure out what is it about my environment that can cause this. 100s of other VMs working fine. I don’t want to believe that NS8 requires something very special.

Andy_Wismer · May 13, 2024, 8:15pm

@wahmed

I just screwed another attempt to migrate to a fresh installed Debian 12…
Worked well, cluster worked.

Started Migration from freshly updated NS7 and BOOM !!!

After reboot, NS8 just spins a wheel, NS7 won’t connect anymore. No network access to NS8, only console via Proxmox.

I was even using BtrFS and on a very performant Proxmox with local NVME storage…

I still don’t see me using Rocky or Alma in future. Almost all code is untrusted red hat code in both…

This statement is exactly what I am worried about…
Another nice one:

Developed by Red Hat or “intensive development” by the community? You can’t have both, or bugs would (should) disapear bit by bit…

At least backups / rollbacks of plain vanilla Debian 12 and nS7 using PBS are very fast!

I’d really like to see NS8 run with the same stability I see in all other Debian based products I use, among them Proxmox PVE & PBS, OpenMediaVault and others…

My 2 cents
Andy

PS

I do have a working NS8 running on Debian as a VM on Proxmox.
But that’s exactly ONE from 30…

stephdl · May 13, 2024, 8:37pm

journalctl is the way, run it on the both server at the same time

stephdl · May 13, 2024, 8:48pm

just tested on debian 12, well it just works as expected but my hypervisor is a race horse, only dedicated to clone and start machines fastly as possible

At least I am honest

it is interesting to use real hardware/hypervisor.

this is scary me

what are the load and the iowait on the hypervisor when you install the cluster ???

wahmed · May 14, 2024, 1:23am

Looks like I have solved the Cluster joining issue. Thanks @stephdl for the journalctl tip. I ran it on both server then attempted to join the second to the cluster again. Looks like there was a DNS issue and ns8 was searching for full FQDN of the host instead of the IP. I added the host in /etc/hosts. After the the node joined to the cluster fine.
I still have very slow GUI loading issue as shown in the following screenshot. But At least i was able to join to the cluster.

Not sure what that really meant. Probably something I dont want to get into.

This was not to meant to show the prowess of my virtual environment, but to convey that all other VMs were working without any IO issue or anything related to the Hypervisor itself. The issue was most likely isolated in the NS8 deployment itself as it turned out to be the case in this instances. Even though it was my DNS causing the issue, still the issue was inside the NS8 itself and not outside.

This cluster joining issue can be called Closed. Bottom line, NS8 uses FQDN of host to reach each other. So either use of DNS entry or coded in the /etc/hosts must be done.

stephdl · May 14, 2024, 6:01am

Yep dns matters a lot. You can imagine ns8 like a proxmox running plenty of vm on it, this is what containers are, just little OS

Relevant to dns with /etc/hosts yes the container could read it but it is better to get a wildcard pointing to the dns entry of your server.

In my home lab I got a ns7 acting as a firewall with a dns wildcard (*.deb12-pve.org dns entry)…everything is routed to the ns8 vm and I did the same for each VM. Never a dns issue for tests, when you install a module every sub.deb12-pve.org is routed to the vm and workable for the module

Relevant to the slow UI, the concensus is to say that the UI is quite fast so maybe another issue on your side. I would monitor the hardware with glances to check load, iowait, bandwitdh … of your hypervisor server

I got myself raid6 of sata spinning drives on the proxmox and I recall the real game changer was when I put a nvme for my testing works. Less than 2 minutes to get up a NS8 but of course I use the spinning drives when I want to simulate slow computers

At the beginning we got some services synchronisation issues with containers that could start to query to a database before that it was fully up but it should be good now…at least it is a case that we are aware now

Timeout could be an issue for systemd because we have a limit but manually with the systemd template system a user can increase it himself if he wants

I am happy to see you testing NS8, the new version of mail 1.4 will see a lot of improvements