Nethserver crashed (kernel panic) from conntrack "nf_ct_deliver_cached_events"

NethServer Version: NethServer release 7.9.2009 (final)
Kernel Release: 3.10.0-1160.36.2.el7.x86_64
Module: conntrack?

Host: PROXMOX_VE (standalone)
Virtual NIC: VirtIO (paravirtualized)

Hi,

my Nethserver crashed (since some days) with “nf_ct_deliver_cached_events” message in console.
I have activated “kernel.log” but there are no helpful messages about the crash.

First, i had checked the virtual harddisk (no problems) and then the corresponding real hardisk (SSD). The corresponding SSD was some years old and has shows some errors in SMART so i have moved the virtual Disks (with stopped vm) in Proxmox to a new Disk (SSD). I have restarted Proxmox also because there was an kernel update in the pipeline. But the Nethserver crashes come back (at different time intervals) and the used SSD shows no errors. There are some other vm (qt) and container (ct) on Proxmox on the same host-ssd without problems.

Then i have went back to the second-newest kernel (Nethserver) and reinstalled the newest kernel again. No change, the problem persist.

Next i would install “crash” to get a “full” crashdump but maybe the “nf_ct_deliver_cached_events” message in panic console (shown in Proxmox Console) is enough information.

Could it be a problem with “conntrack”? What could the cause for it? Network attacks or a DDOS is overfill a corresponding cache?

Any suggestions?

Regards
yummiweb

Is suricata (IDS) enabled?
If true, could you please try to temporarily disable it?

Yes, the Intrusion prevention system is enabled.

I could disable it for some time, but i have doubts because some log entrys in “/var/log/suricate/fast.log”(some minutes) before crash shows:

This time:
“[Drop] [] [1:2403399:68355] ET CINS Active Threat Intelligence Poor Reputation IP group 100 [] [Classification: Misc Attack] [Priority: 2] {TCP} SOURCE_IP:57810 -> TARGET_IP:443”

Last time:
[Drop] [] [1:2017616:4] ET SCAN NETWORK Incoming Masscan detected [] [Classification: Detection of a Network Scan] [Priority: 3] {TCP} SOURCE_IP:21345 -> TARGET_IP:80
08/29/2021-02:36:01.793293 [Drop] [] [1:2403365:68331] ET CINS Active Threat Intelligence Poor Reputation IP group 66 [] [Classification: Misc Attack] [Priority: 2] {TCP} SOURCE_IP:54775 -> TARGET_IP:443

If i deactivate IPS such packages wouldnt be dropped i think.

The time between the crashes are within one day until some days, so the Nethserver would be unprotected in this time.

Is there some cache that could enlarged or something?

Maybe i could run a copy of this Nethserver VM in a separate Enviroment or with disabled NIC, but i guess the problem would not appear if no packages are processed in netfilter/conntrack.

Correct, it will be unprotected from some kind of threat.

Then I will first prepare an external firewall or filter.

In the meantime I change the Proxmox network adapter to “vmxnet3”. Lets see what happens.

Which is a “quite good old piece of software”. With good and bad sides.

OK, the type of network adapter has nothing to do with it. Same crash today.

In the meantime there was a kernel update (3.10.0-1160.41.1.el7.x86_64) that i have already installed on another Nethserver VM on an different Hosts in different Networks. Now there was the same crash on this machine also.

I thought I could create an alternative firewall solution in the meantime, but I would have to rearrange the network structure too much. Unfortunately, this is too time-consuming at the moment. So I have to deactivate the IPS without replacing with an other Solution.

A copy of the problematic machine (with disabled NIC) has run without problems so far.

After deactivating the IPS, the server ran for 4 days without problems. Now I am reactivating IPS to see what happens.

If the IPS is really the cause, at what point should I further investigate?

The IPS always ran without problems before, the Nethserver ran normally for months.

So far there have only been problems with the firewall or ThreatShield after I blocked certain IP areas (GeoIP blocking). If this was active, it could happen that the firewall would not start when starting or during a reconfiguration (observed on different systems). So I finally deactivated this function again. But that was months ago and the server ran without any problems after that. Until now…

My guess is that the virtual machine has not enough cpu resources to keep up with traffic analysis.