NS8 just killed itself

Turbond · January 1, 2024, 10:35pm

Hi everyone… I think I’m going to have to bin NS8.
Did a restart today of the proxmox system, and now I get this, error (even after rolling back I get the same error. It occurs once memory has been cleared, or a full reboot).

{"context":{"action":"list-domain-groups","data":{"domain":"ad.flashelectrical.co.nz"},"extra":{"eventId":"72007e69-2d58-4e07-93da-582bf8e317cc","isNotificationHidden":true,"title":"List domain groups"},"id":"d5fa6924-8644-42fa-a691-0353f333d51c","parent":"","queue":"cluster/tasks","timestamp":"2024-01-01T22:01:56.77019952Z","user":"admin"},"status":"aborted","progress":100,"subTasks":[],"validated":false,"result":{"error":"Traceback (most recent call last):\n  File \"/var/lib/nethserver/cluster/actions/list-domain-groups/50list_groups\", line 33, in <module>\n    groups = Ldapclient.factory(**domain).list_groups()\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/agent/pypkg/agent/ldapclient/__init__.py\", line 29, in factory\n    return LdapclientAd(**kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/agent/pypkg/agent/ldapclient/base.py\", line 37, in __init__\n    self.ldapconn = ldap3.Connection(self.ldapsrv,\n                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/agent/pyenv/lib64/python3.11/site-packages/ldap3/core/connection.py\", line 363, in __init__\n    self._do_auto_bind()\n  File \"/usr/local/agent/pyenv/lib64/python3.11/site-packages/ldap3/core/connection.py\", line 389, in _do_auto_bind\n    self.bind(read_server_info=True)\n  File \"/usr/local/agent/pyenv/lib64/python3.11/site-packages/ldap3/core/connection.py\", line 607, in bind\n    response = self.post_send_single_response(self.send('bindRequest', request, controls))\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/agent/pyenv/lib64/python3.11/site-packages/ldap3/strategy/sync.py\", line 160, in post_send_single_response\n    responses, result = self.get_response(message_id)\n                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/agent/pyenv/lib64/python3.11/site-packages/ldap3/strategy/base.py\", line 370, in get_response\n    raise LDAPSessionTerminatedByServerError(self.connection.last_error)\nldap3.core.exceptions.LDAPSessionTerminatedByServerError: session terminated by server\n","exit_code":1,"file":"task/cluster/d5fa6924-8644-42fa-a691-0353f333d51c","output":""}}

Considering the hours I have spent manually migrating over to NS8 due to the NS8 migration tool not working correctly… a couple of errors which I resolved (having to remove webtop even though I had correct DNS and a virtual hostname assigned on old server and dns correct, I got an error saying no virtual host). Then migration worked (all green ticks), but the following day all the migration had been deleted by the sync script, software uninstalled and then, as I had a bear metal server (unable to install proxmox at the time due to hardware being too new), I was unable to re-do migration of server even after following the re-enable services steps on github. I could migrate nextcloud but all other options failed, or I was not able to be run (no option to sync or migrate domain or file server).

I can’t login to users, access groups, or shares. I needed this running today. Now I have nothing to show for my time. I have spent all Christmas Holidays setting up two servers and both are now defunct.

Question Where is the LDAP and how do I find it to debug this? Is it a corrupt database? I highly doubt it, but one never knows. Both servers could have decided to corrupt the ZFS disks (even though raid 10)
DNS servers have not changed and are working the same as when I started setup.

Overall I have enjoyed Nethserver and it’s community, and even after checking out UCS/Zentyal/yunohost, Nethserver is still the best option for our business with the least amount of issues. It is a shame that this release of NS8 RC1 is not actually production ready as per the notice. If anyone can help from the design team to point me in the correct direction to fix the above error I’d be extremely grateful, but now I’m taking a backup from a week ago and rolling out 7.9 on the proxmox, which hopefully will work.

dnutan · January 1, 2024, 11:00pm

Sorry to hear that.
This won’t provide a solution but the error is similar to the one reported here:

Andy_Wismer · January 2, 2024, 12:18am

Hi @Turbond

I am considered the local Proxmox “guru” on this forum, but I can’t quite follow your story…

Here, we talk about bare metal servers, as too new hardware…
(And no bears are harmed in the process!)

yet in the start of the post you say:

so I’m not quite sure where we stand, or what is the cause for a corrupted ZFS system.

Is this a Proxmox Issue?
Is this a hardware issue?
Is this really a NS8 issue?

What Proxmox setup are. we talking about, what kind of CPU, RAM, Disks, System (ZFS Boot?)…
And, just as important: PBS available, NAS or only local backups?

If it’s as urgent as you say, you may contact me now per PM…

My 2 cents
Andy

Turbond · January 2, 2024, 12:33am

Hi Andy…

Sorry for the confusion…

The first server I built is around seven years old, and was using NVMe SSD’s. At which time my RAID array was not compatible with proxmox… although I was really hoping to use it back then. So therefore I ended up with a bare metal server which has no rollback features, and hence why I could not do the migration again as it appears NS8 migration sets some flags which are not removed using the re-enable options mention in github (ns7-migration tool).

I was wondering if the proxmox zfs was corrupt. I’ve done a little more testing and now I’m wondering if the samba container is timing out/has time sync issues on restart due to the large number of files I’ve just copied over (about 500,000 odd).

Restarting the proxmox vm with it’s current memory intact I can get the same issue, but if I do a systemctl restart chronyd.service this resolves it after a minute or two. It’s very frustrating as I’m not sure if it’s an LDAP service not starting, or a samba service issue. I’m not up with the container version of NS8 yet and yes, no bears were harmed… sorry had little sleep as been working on server migrations (bare metal I meant)

Andy_Wismer · January 2, 2024, 12:50am

OK, so if I understand the situ right, you have an older server with NS7 running on Proxmox.
The newer server is still bare metal.

?

Can understand that very well. Did my first NS8 migration 2 months ago, for a doctors practice.
NS8 wasn’t the issue in the end, it was storage / NAS / Backups…

Turbond · January 2, 2024, 12:52am

Hi Andy,
Old server running NS7 on bare metal.
New Server Proxmox vm with 4 cores, 8GB RAM
(I’d like to increase the RAM and CPU cores but then I end up with the LDAP error as above)

Andy_Wismer · January 2, 2024, 12:54am

What is the new server exactly (hardware?)

Andy_Wismer · January 2, 2024, 1:01am

Here is what I’m using at said doctors practice:

NS8 here is running with AD, no LDAP.

Server is a HPE Proliant:

VMs are running of a PCI Dual NVME card, system is the original HP RAID on small SSDs…

NS8 has now been running for 2 Months…

Turbond · January 2, 2024, 1:03am

Hi Andy,
New Server hardware are HP Z640’s with 6TB HDD (RAID 10), 64GB RAM. Thanks for the images…I’d like to match but then I end up with the top issue I posted on restarting server… I was actually going to increase to 32GB RAM as have 75 users. My scsi0 is 2000GB in size.

Andy_Wismer · January 2, 2024, 1:09am

Sorry, as I wasn’t quite familiar with your hardware, needed to check out the specs…

I do read in the specs that the RAM seems officially limited to 32 GB, you have 64 inside now…

Can the VM start & work with 16 GB RAM?

Turbond · January 2, 2024, 1:10am

The VM starts up with 16GB, but the LDAP does not

Andy_Wismer · January 2, 2024, 1:11am

Any reason why LDAP, when you’re using Samba?

Turbond · January 2, 2024, 1:11am

Current model I have handles up to 128GB RAM. HP created a few models of this workstation/server combo

Turbond · January 2, 2024, 1:12am

If you look at the error, it’s referring to LDAP even though I’m using Samba. I setup as a Samba domain, with users, groups and shares all in the web gui.

Andy_Wismer · January 2, 2024, 1:14am

What FileSysetem are you using on the HP Z640? Ext4? XFS? ZFS?
I assume you’re using HPs Built-In RAID for the RAID10?

Turbond · January 2, 2024, 1:17am

I’m using ZFS and no, the inbuilt array is software based and would cause me grief. I’m using the software based proxmox ZFS RAID (which is more stable).

Andy_Wismer · January 2, 2024, 1:20am

AKA as Fake-RAID…

Another important question:

Did you set any RAM limitations for ZFS?
If no, ZFS will eat up half your RAM, even if you have a Petabyte RAM…

Also there’s the question of SWAP. Any SWAP partition prepped?

Turbond · January 2, 2024, 1:21am

Hi Andy… Yes I’ve set a limit on ZFS eating my RAM lol.

Andy_Wismer · January 2, 2024, 1:22am

And swap?

A lot of folks aren’t aware of the fact that all UN*Xes and Linux will run better with SWAP than without, no matter if you have a Petabyte RAM!

Turbond · January 2, 2024, 1:23am

I only have a 32GB swap file… I didn’t think Id need more