Nethserver crashed (kernel panic) from conntrack "nf_ct_deliver_cached_events"

yummiweb · August 31, 2021, 8:07am

NethServer Version: NethServer release 7.9.2009 (final)
Kernel Release: 3.10.0-1160.36.2.el7.x86_64
Module: conntrack?

Host: PROXMOX_VE (standalone)
Virtual NIC: VirtIO (paravirtualized)

Hi,

my Nethserver crashed (since some days) with “nf_ct_deliver_cached_events” message in console.
I have activated “kernel.log” but there are no helpful messages about the crash.

First, i had checked the virtual harddisk (no problems) and then the corresponding real hardisk (SSD). The corresponding SSD was some years old and has shows some errors in SMART so i have moved the virtual Disks (with stopped vm) in Proxmox to a new Disk (SSD). I have restarted Proxmox also because there was an kernel update in the pipeline. But the Nethserver crashes come back (at different time intervals) and the used SSD shows no errors. There are some other vm (qt) and container (ct) on Proxmox on the same host-ssd without problems.

Then i have went back to the second-newest kernel (Nethserver) and reinstalled the newest kernel again. No change, the problem persist.

Next i would install “crash” to get a “full” crashdump but maybe the “nf_ct_deliver_cached_events” message in panic console (shown in Proxmox Console) is enough information.

Could it be a problem with “conntrack”? What could the cause for it? Network attacks or a DDOS is overfill a corresponding cache?

Any suggestions?

Regards
yummiweb

filippo_carletti · August 31, 2021, 8:44am

Is suricata (IDS) enabled?
If true, could you please try to temporarily disable it?

yummiweb · August 31, 2021, 9:09am

Yes, the Intrusion prevention system is enabled.

I could disable it for some time, but i have doubts because some log entrys in “/var/log/suricate/fast.log”(some minutes) before crash shows:

This time:
“[Drop] [] [1:2403399:68355] ET CINS Active Threat Intelligence Poor Reputation IP group 100 [] [Classification: Misc Attack] [Priority: 2] {TCP} SOURCE_IP:57810 -> TARGET_IP:443”

Last time:
[Drop] [] [1:2017616:4] ET SCAN NETWORK Incoming Masscan detected [] [Classification: Detection of a Network Scan] [Priority: 3] {TCP} SOURCE_IP:21345 -> TARGET_IP:80
08/29/2021-02:36:01.793293 [Drop] [] [1:2403365:68331] ET CINS Active Threat Intelligence Poor Reputation IP group 66 [] [Classification: Misc Attack] [Priority: 2] {TCP} SOURCE_IP:54775 -> TARGET_IP:443

If i deactivate IPS such packages wouldnt be dropped i think.

The time between the crashes are within one day until some days, so the Nethserver would be unprotected in this time.

Is there some cache that could enlarged or something?

yummiweb · August 31, 2021, 9:17am

Maybe i could run a copy of this Nethserver VM in a separate Enviroment or with disabled NIC, but i guess the problem would not appear if no packages are processed in netfilter/conntrack.

filippo_carletti · August 31, 2021, 10:37am

Correct, it will be unprotected from some kind of threat.

yummiweb · August 31, 2021, 8:18pm

Then I will first prepare an external firewall or filter.

In the meantime I change the Proxmox network adapter to “vmxnet3”. Lets see what happens.

pike · August 31, 2021, 9:15pm

Which is a “quite good old piece of software”. With good and bad sides.

yummiweb · September 1, 2021, 5:47pm

OK, the type of network adapter has nothing to do with it. Same crash today.

In the meantime there was a kernel update (3.10.0-1160.41.1.el7.x86_64) that i have already installed on another Nethserver VM on an different Hosts in different Networks. Now there was the same crash on this machine also.

I thought I could create an alternative firewall solution in the meantime, but I would have to rearrange the network structure too much. Unfortunately, this is too time-consuming at the moment. So I have to deactivate the IPS without replacing with an other Solution.

A copy of the problematic machine (with disabled NIC) has run without problems so far.

yummiweb · September 7, 2021, 9:48pm

After deactivating the IPS, the server ran for 4 days without problems. Now I am reactivating IPS to see what happens.

If the IPS is really the cause, at what point should I further investigate?

The IPS always ran without problems before, the Nethserver ran normally for months.

So far there have only been problems with the firewall or ThreatShield after I blocked certain IP areas (GeoIP blocking). If this was active, it could happen that the firewall would not start when starting or during a reconfiguration (observed on different systems). So I finally deactivated this function again. But that was months ago and the server ran without any problems after that. Until now…

filippo_carletti · September 8, 2021, 8:35am

My guess is that the virtual machine has not enough cpu resources to keep up with traffic analysis.

yummiweb · October 20, 2021, 6:11pm

Thanks for the hint.

Because the Nethserver runs in a VM, I have adjusted its resources upwards accordingly in several stages. Unfortunately, this did not bring lasting success.

In the meantime, identical problems also occurred in another installation in a different network. Therefore I now have to delegate the corresponding tasks (IPS / IDS) to an OpenSense.

This makes things a little more complex and maintenance-intensive, but having a separate firewall might not be the worst idea at all. I’ve been postponing this consideration for a while, now fate has given me a little kick in that direction.

Thank you for your help!