NS7 -> NS8 migration: Loki keeps restarting

NethServer Version: NS8 on Rocky Linux 9.4
Module: Loki
Repositories: default and nethforge, no testing enabled

Last weekend I spent a great deal of my time to migrate my NS7 server to NS8. The migration procedure (nextcloud, email and AD/fileserver) got stuck on the email module (see earlier post) and eventually I skipped the migration of email and completed the migration. Afterwards I did a manual migration of the email which went well, with the handy help of the imapsync module. But I have quite a few problems left to be solved which I will address in separate posts. This is the first one.

While looking for clues in the /var/log/messages file I saw it was flooding with a lot messages about loki restarting indefinitely. So there are no logs within the NS8 interface. I see also some Crowdsec error messages but not sure if it is related or more a result of the loki problem.

Abstract of Loki restarting messages:

Jun 24 10:36:09 server2 systemd[2309]: Starting Loki pod service...
Jun 24 10:36:09 server2 systemd[2309]: Created slice cgroup user-libpod_pod_c2e89c0b44372e46375bcad9b942aed9f5ed5629b68a95de7c071b37615f7e75.slice.
Jun 24 10:36:09 server2 podman[2750167]: c2e89c0b44372e46375bcad9b942aed9f5ed5629b68a95de7c071b37615f7e75
Jun 24 10:36:09 server2 systemd[2309]: Started libcrun container.
Jun 24 10:36:09 server2 podman[2750177]: c2e89c0b44372e46375bcad9b942aed9f5ed5629b68a95de7c071b37615f7e75
Jun 24 10:36:09 server2 systemd[2309]: Started Loki pod service.
Jun 24 10:36:09 server2 systemd[2309]: Starting Loki server...
Jun 24 10:36:09 server2 systemd[2309]: Starting Loki frontend HTTP proxy...
Jun 24 10:36:09 server2 systemd[2309]: Started libcrun container.
Jun 24 10:36:09 server2 systemd[2309]: Started libcrun container.
Jun 24 10:36:09 server2 podman[2750213]: 086a1dd08ce95e76f288b8dc55ea08552209e5a8dfe9188387571b012f56cf4d
Jun 24 10:36:09 server2 systemd[2309]: Started Loki server.
Jun 24 10:36:09 server2 podman[2750222]: 1e7c44ac121068d36d27c914ba33ae99826c2ec2c828b5b39e1af540c3a6e1f4
Jun 24 10:36:09 server2 systemd[2309]: Started Loki frontend HTTP proxy.
Jun 24 10:36:09 server2 loki-server[2750237]: level=error ts=2024-06-24T08:36:09.603266567Z caller=log.go:106 msg="error running loki" err="stat /loki/boltdb-shipper-active/uploader/name: permission denied\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:179\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:319\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"
Jun 24 10:36:09 server2 traefik[2750241]: time="2024-06-24T08:36:09Z" level=info msg="Configuration loaded from flags."
Jun 24 10:36:09 server2 systemd[2309]: loki-server.service: Main process exited, code=exited, status=1/FAILURE
Jun 24 10:36:09 server2 podman[2750279]: 086a1dd08ce95e76f288b8dc55ea08552209e5a8dfe9188387571b012f56cf4d
Jun 24 10:36:09 server2 systemd[2309]: loki-server.service: Failed with result 'exit-code'.
Jun 24 10:36:09 server2 crowdsec1[3238]: time="2024-06-24T08:36:09Z" level=error msg="UnmarshalJSON : invalid character 'i' in literal true (expecting 'r')" line="time=\"2024-06-24T08:36:09Z\" level=info msg=\"Configuration loaded from flags.\""
Jun 24 10:36:09 server2 crowdsec1[3238]: time="2024-06-24T08:36:09Z" level=warning msg="failed to run filter : invalid character 'i' in literal true (expecting 'r') (1:1)\n | UnmarshalJSON(evt.Parsed.message, evt.Unmarshaled, \"traefik\") in [\"\", nil]\n | ^" id=empty-shape name=child-crowdsecurity/traefik-logs stage=s01-parse
Jun 24 10:36:10 server2 systemd[2309]: loki-server.service: Scheduled restart job, restart counter is at 14644.
Jun 24 10:36:10 server2 systemd[2309]: Stopped Loki server.
Jun 24 10:36:10 server2 systemd[2309]: Stopping Loki frontend HTTP proxy...
Jun 24 10:36:10 server2 traefik[2750241]: time="2024-06-24T08:36:10Z" level=error msg="accept tcp [::]:8080: use of closed network connection" entryPointName=loki
Jun 24 10:36:10 server2 traefik[2750241]: time="2024-06-24T08:36:10Z" level=error msg="Error while starting server: http: Server closed" entryPointName=loki
Jun 24 10:36:10 server2 traefik[2750241]: time="2024-06-24T08:36:10Z" level=error msg="Error while starting server: http: Server closed" entryPointName=loki
Jun 24 10:36:10 server2 traefik[2750241]: time="2024-06-24T08:36:10Z" level=error msg="close tcp [::]:8080: use of closed network connection" entryPointName=loki
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=error msg="UnmarshalJSON : invalid character 'i' in literal true (expecting 'r')" line="time=\"2024-06-24T08:36:10Z\" level=error msg=\"accept tcp [::]:8080: use of closed network connection\" entryPointName=loki"
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=warning msg="failed to run filter : invalid character 'i' in literal true (expecting 'r') (1:1)\n | UnmarshalJSON(evt.Parsed.message, evt.Unmarshaled, \"traefik\") in [\"\", nil]\n | ^" id=empty-shape name=child-crowdsecurity/traefik-logs stage=s01-parse
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=error msg="UnmarshalJSON : invalid character 'i' in literal true (expecting 'r')" line="time=\"2024-06-24T08:36:10Z\" level=error msg=\"Error while starting server: http: Server closed\" entryPointName=loki"
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=warning msg="failed to run filter : invalid character 'i' in literal true (expecting 'r') (1:1)\n | UnmarshalJSON(evt.Parsed.message, evt.Unmarshaled, \"traefik\") in [\"\", nil]\n | ^" id=empty-shape name=child-crowdsecurity/traefik-logs stage=s01-parse
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=error msg="UnmarshalJSON : invalid character 'i' in literal true (expecting 'r')" line="time=\"2024-06-24T08:36:10Z\" level=error msg=\"Error while starting server: http: Server closed\" entryPointName=loki"
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=warning msg="failed to run filter : invalid character 'i' in literal true (expecting 'r') (1:1)\n | UnmarshalJSON(evt.Parsed.message, evt.Unmarshaled, \"traefik\") in [\"\", nil]\n | ^" id=empty-shape name=child-crowdsecurity/traefik-logs stage=s01-parse
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=error msg="UnmarshalJSON : invalid character 'i' in literal true (expecting 'r')" line="time=\"2024-06-24T08:36:10Z\" level=error msg=\"close tcp [::]:8080: use of closed network connection\" entryPointName=loki"
Jun 24 10:36:10 server2 crowdsec1[3238]: time="2024-06-24T08:36:10Z" level=warning msg="failed to run filter : invalid character 'i' in literal true (expecting 'r') (1:1)\n | UnmarshalJSON(evt.Parsed.message, evt.Unmarshaled, \"traefik\") in [\"\", nil]\n | ^" id=empty-shape name=child-crowdsecurity/traefik-logs stage=s01-parse
Jun 24 10:36:11 server2 podman[2750289]: 1e7c44ac121068d36d27c914ba33ae99826c2ec2c828b5b39e1af540c3a6e1f4
Jun 24 10:36:11 server2 podman[2750309]: 1e7c44ac121068d36d27c914ba33ae99826c2ec2c828b5b39e1af540c3a6e1f4
Jun 24 10:36:11 server2 systemd[2309]: Stopped Loki frontend HTTP proxy.
Jun 24 10:36:11 server2 systemd[2309]: Stopping Loki pod service...
Jun 24 10:36:11 server2 systemd[2309]: Removed slice cgroup user-libpod_pod_c2e89c0b44372e46375bcad9b942aed9f5ed5629b68a95de7c071b37615f7e75.slice.
Jun 24 10:36:11 server2 podman[2750319]: c2e89c0b44372e46375bcad9b942aed9f5ed5629b68a95de7c071b37615f7e75
Jun 24 10:36:11 server2 podman[2750340]: c2e89c0b44372e46375bcad9b942aed9f5ed5629b68a95de7c071b37615f7e75
Jun 24 10:36:11 server2 systemd[2309]: Stopped Loki pod service.
Jun 24 10:36:11 server2 systemd[2309]: Starting Loki pod service...

Do you have Grafana installed?

Not a solution but I guess this is key ‘permission’ issues…?

No, Grafana is not installed. Should I do that?

No need. Loki is part of ns8 core handling the logs.

Grafana Loki is a special database designed to store, index and search system logs. For more information, see nethserver/ns8-loki

1 Like

This problem is still existing and I have no clue on how to solve it. Loki keeps restarting itself constantly every three seconds. Is it possible to perform a kind of re-installation of Loki to see whether that helps? What else can I do about it?

Probably the main error message, as @LayLow mentioned earlier:

Jun 28 11:46:27 server2 loki-server[866964]: level=error ts=2024-06-28T09:46:27.462578956Z caller=log.go:106 msg="error running loki" err="stat /loki/boltdb-shipper-active/uploader/name: permission denied\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:179\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:319\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"

As you come from a migration, I assume you can loose all your logs.

If this is the case, and your leader node is 1, execute the following procedure in the leader node 1.

  1. install another Loki instance in Node 1

    add-module loki 1 
    

    Here I assume the new instance is loki2.

  2. transform the new instance in the active one

    redis-cli set cluster/default_instance/loki loki2
    redis-cli publish cluster/event/default-instance-changed '{"instance":"loki","previous":"loki1","current":"loki2"}'
    
  3. remove the old Loki instance (optional)

    remove-module --no-preserve --force loki1
    
2 Likes

Hi Davide,

I can confirm that your solution did the trick, loki is running now! Thank you very much for your help, it makes debugging problems a lot easier. :+1: