Latest Core Update fails

Task Trace:

{"context":{"action":"update-core","data":{"nodes":[1]},"extra":{"description":"Processing","eventId":"d8da22a9-b39b-4a04-9f8c-adf3b2c5db54","title":"Update core"},"id":"e0a87263-e1b8-4669-8790-c6f0e36416c4","parent":"","queue":"cluster/tasks","timestamp":"2024-04-17T21:52:22.602785163Z","user":"admin"},"status":"aborted","progress":100,"subTasks":[{"context":{"action":"list-actions","data":{},"extra":{},"id":"b2e415b9-b63f-47fa-af8a-d5fe16a5a344","parent":"e0a87263-e1b8-4669-8790-c6f0e36416c4"},"status":"completed","progress":100,"subTasks":[],"result":{"error":"","exit_code":0,"file":"task/node/1/b2e415b9-b63f-47fa-af8a-d5fe16a5a344","output":["add-module","add-public-service","add-tun","remove-module","start-support-session","get-facts","get-fqdn","get-node-status","remove-public-service","update-os","add-custom-zone","get-firewall-status","remove-custom-zone","set-name","update-core","stop-support-session","get-info","get-name","get-support-session","remove-tun","set-fqdn","list-actions","cancel-task"]}},{"context":{"action":"update-core","data":{"core_url":"ghcr.io/nethserver/core:2.7.0","force":false},"extra":{},"id":"a900bcde-9bd6-4e1e-a826-3ba94b2bb08d","parent":"e0a87263-e1b8-4669-8790-c6f0e36416c4"},"status":"running","progress":60,"subTasks":[]}],"validated":true,"result":{"error":"_acontrol_task request attempt failed (Connection closed by server.). Retrying...\n_acontrol_task request recovered successfully at attempt 2\n<7>run-scriptdir /var/lib/nethserver/cluster/update-core-pre-modules.d/\nRunning /var/lib/nethserver/cluster/update-core-pre-modules.d/50update_grants...\nTask cluster/update-module run failed: {'output': '', 'error': 'Traceback (most recent call last):\\n  File \"/var/lib/nethserver/cluster/actions/update-module/50update\", line 54, in <module>\\n    ping_errors = agent.tasks.runp_brief([{\"agent_id\": f\"module/{mid}\", \"action\": \"list-actions\"} for mid in instances],\\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/run.py\", line 61, in runp_brief\\n    results = asyncio.run(_runp(tasks, **kwargs))\\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/lib64/python3.11/asyncio/runners.py\", line 190, in run\\n    return runner.run(main)\\n           ^^^^^^^^^^^^^^^^\\n  File \"/usr/lib64/python3.11/asyncio/runners.py\", line 118, in run\\n    return self._loop.run_until_complete(task)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/lib64/python3.11/asyncio/base_events.py\", line 653, in run_until_complete\\n    return future.result()\\n           ^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/run.py\", line 120, in _runp\\n    return await asyncio.gather(*runners, return_exceptions=(len(tasks) > 1))\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/run.py\", line 127, in _run_with_protocol\\n    return await run_redisclient(taskrq, **pconn)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/redisclient.py\", line 77, in run_redisclient\\n    await _task_submission_check_client_idle(rdb, taskrq, kwargs[\\'check_idle_time\\'])\\n  File \"/usr/local/agent/pypkg/agent/tasks/redisclient.py\", line 41, in _task_submission_check_client_idle\\n    raise TaskSubmissionCheckFailed(f\"Client \\\\\"{taskrq[\\'agent_id\\']}\\\\\" was not found\")\\nagent.tasks.exceptions.TaskSubmissionCheckFailed: Client \"module/traefik1\" was not found\\n', 'exit_code': 1}\nTask cluster/update-module run failed: {'output': '', 'error': 'Traceback (most recent call last):\\n  File \"/var/lib/nethserver/cluster/actions/update-module/50update\", line 54, in <module>\\n    ping_errors = agent.tasks.runp_brief([{\"agent_id\": f\"module/{mid}\", \"action\": \"list-actions\"} for mid in instances],\\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/run.py\", line 61, in runp_brief\\n    results = asyncio.run(_runp(tasks, **kwargs))\\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/lib64/python3.11/asyncio/runners.py\", line 190, in run\\n    return runner.run(main)\\n           ^^^^^^^^^^^^^^^^\\n  File \"/usr/lib64/python3.11/asyncio/runners.py\", line 118, in run\\n    return self._loop.run_until_complete(task)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/lib64/python3.11/asyncio/base_events.py\", line 653, in run_until_complete\\n    return future.result()\\n           ^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/run.py\", line 120, in _runp\\n    return await asyncio.gather(*runners, return_exceptions=(len(tasks) > 1))\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/run.py\", line 127, in _run_with_protocol\\n    return await run_redisclient(taskrq, **pconn)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \"/usr/local/agent/pypkg/agent/tasks/redisclient.py\", line 77, in run_redisclient\\n    await _task_submission_check_client_idle(rdb, taskrq, kwargs[\\'check_idle_time\\'])\\n  File \"/usr/local/agent/pypkg/agent/tasks/redisclient.py\", line 41, in _task_submission_check_client_idle\\n    raise TaskSubmissionCheckFailed(f\"Client \\\\\"{taskrq[\\'agent_id\\']}\\\\\" was not found\")\\nagent.tasks.exceptions.TaskSubmissionCheckFailed: Client \"module/loki1\" was not found\\n', 'exit_code': 1}\n<7>run-scriptdir /var/lib/nethserver/cluster/update-core-post-modules.d/\nupdate-core failed in some core modules\n  File \"/var/lib/nethserver/cluster/actions/update-core/70update_modules\", line 57, in <module>\n    agent.assert_exp(update_module_errors == 0, 'update-core failed in some core modules')\n","exit_code":2,"file":"task/cluster/e0a87263-e1b8-4669-8790-c6f0e36416c4","output":""}}

Cheers.

The error comes from a consistency check that runs before starting the update. Agents of core modules are not connected to Redis, which is an error condition: they were stopped, killed or their Redis credentials were removed.

How many nodes has your cluster? Did you remove the first node in the past? Please provide some background information.

Loki is the backend of the Logs page: see if it is working.

If you want to inspect the agent status

  runagent -m loki1 systemctl --user status agent

As a quick workaround you can try to reboot all the cluster nodes and see if they come back online.

1 Like

The core update was the first thing I tried to do after completing the migration from NS7 → NS8 which appeared to go well as I could see everything I expected to on the NS8 side.

My system rebooted last night after applying some Rocky updates and now hasn’t assigned itself an IPv4 address so all I have is the Proxmox console to work with. Any ideas on why this might have happened and how to recover.

But to the issue at hand:

runagent: [FATAL] Cannot find runtime directory for loki1

Minor point: The Wireguard listener for the old NS7 is still being activated. It wasn’t torn down as a final act of the migration.

Cheers.

Sorry, forgot to add this part earlier.

Just a single node home setup, nothing special.

Did you want to try and postmortem this, because if not, I’m quite happy to just blow away the NS8, build a new one from one of the published images and remigrate using my lessons learned from the first time because my NS7 is still up and running, thanks to this.

Cheers.

2 Likes