Grafana and Prometheus only work on leader node

Tested with Alma 9 on Proxmox and Virtualbox.

Grafana and Prometheus work on cluster leader only, installing/configuring on a worker node didn’t work.

Both Grafana and Prometheus had following error on worker nodes:

FileNotFoundError: [Errno 2] No such file or directory: '/etc/nethserver/logcli.env'

This file exists only on the leader node.

Grafana

Grafana installs on worker nodes but configure-module fails, it just works on the leader.
No containers were started.

Grafana worker node /var/log/messages:

May 22 17:29:04 alma2 systemd[1281]: Starting Grafana server...
May 22 17:29:05 alma2 runagent[1666]: Traceback (most recent call last):
May 22 17:29:05 alma2 runagent[1666]:  File "/home/grafana1/.config/bin/provision", line 32, in <module>
May 22 17:29:05 alma2 runagent[1666]:    logcli = agent.read_envfile("/etc/nethserver/logcli.env")
May 22 17:29:05 alma2 runagent[1666]:  File "/usr/local/agent/pypkg/agent/__init__.py", line 89, in read_envfile
May 22 17:29:05 alma2 runagent[1666]:    fo = open(file_path, 'r')
May 22 17:29:05 alma2 runagent[1666]: FileNotFoundError: [Errno 2] No such file or directory: '/etc/nethserver/logcli.env'
May 22 17:29:05 alma2 systemd[1281]: grafana.service: Control process exited, code=exited, status=1/FAILURE
May 22 17:29:05 alma2 systemd[1281]: grafana.service: Failed with result 'exit-code'.
May 22 17:29:05 alma2 systemd[1281]: Failed to start Grafana server.
May 22 17:29:06 alma2 systemd[1281]: grafana.service: Scheduled restart job, restart counter is at 4.
May 22 17:29:06 alma2 systemd[1281]: Stopped Grafana server.
May 22 17:29:06 alma2 systemd[1281]: Starting Grafana server...
May 22 17:29:06 alma2 runagent[1675]: Traceback (most recent call last):
May 22 17:29:06 alma2 runagent[1675]:  File "/home/grafana1/.config/bin/provision", line 32, in <module>
May 22 17:29:06 alma2 runagent[1675]:    logcli = agent.read_envfile("/etc/nethserver/logcli.env")
May 22 17:29:06 alma2 runagent[1675]:  File "/usr/local/agent/pypkg/agent/__init__.py", line 89, in read_envfile
May 22 17:29:06 alma2 runagent[1675]:    fo = open(file_path, 'r')
May 22 17:29:06 alma2 runagent[1675]: FileNotFoundError: [Errno 2] No such file or directory: '/etc/nethserver/logcli.env'
May 22 17:29:06 alma2 systemd[1281]: grafana.service: Control process exited, code=exited, status=1/FAILURE
May 22 17:29:07 alma2 systemd[1281]: grafana.service: Failed with result 'exit-code'.
May 22 17:29:07 alma2 systemd[1281]: Failed to start Grafana server.
May 22 17:29:07 alma2 systemd[1281]: grafana.service: Scheduled restart job, restart counter is at 5.
May 22 17:29:07 alma2 systemd[1281]: Stopped Grafana server.
May 22 17:29:07 alma2 systemd[1281]: grafana.service: Start request repeated too quickly.
May 22 17:29:07 alma2 systemd[1281]: grafana.service: Failed with result 'exit-code'.
May 22 17:29:07 alma2 systemd[1281]: Failed to start Grafana server.

configure-module failed, log from UI:

{
  "context": {
    "action": "configure-module",
    "data": {
      "host": "grafana.mrmarkuz.ddnss.eu",
      "http2https": true,
      "lets_encrypt": false
    },
    "extra": {
      "description": "Processing...",
      "eventId": "5aa01173-ab84-4234-ab2d-2ada0b2667f8",
      "title": "Configure grafana2"
    },
    "id": "af151d12-da28-4e81-a6e8-af8933fdc5f4",
    "parent": "",
    "queue": "module/grafana2/tasks",
    "timestamp": "2023-05-22T17:36:13.253892893Z",
    "user": "admin"
  },
  "status": "aborted",
  "progress": 50,
  "subTasks": [
    {
      "context": {
        "action": "set-route",
        "data": {
          "host": "grafana.mrmarkuz.ddnss.eu",
          "http2https": true,
          "instance": "grafana2",
          "lets_encrypt": false,
          "url": "http://127.0.0.1:20009"
        },
        "extra": {},
        "id": "64acb3b4-ac62-4c28-9820-bf8fd56b09f3",
        "parent": "af151d12-da28-4e81-a6e8-af8933fdc5f4",
        "queue": "module/traefik2/tasks",
        "timestamp": "2023-05-22T17:36:13.78511532Z",
        "user": "module/grafana2"
      },
      "status": "completed",
      "progress": 100,
      "subTasks": [],
      "result": {
        "error": "",
        "exit_code": 0,
        "file": "task/module/traefik2/64acb3b4-ac62-4c28-9820-bf8fd56b09f3",
        "output": ""
      }
    }
  ],
  "validated": true,
  "result": {
    "error": "Job for grafana.service failed because the control process exited with error code.\nSee \"systemctl --user status grafana.service\" and \"journalctl --user -xeu grafana.service\" for details.\n",
    "exit_code": 1,
    "file": "task/module/grafana2/af151d12-da28-4e81-a6e8-af8933fdc5f4",
    "output": ""
  }
}

Prometheus

Prometheus installation fails on worker nodes.

UI log:

{
  "context": {
    "action": "add-module",
    "data": {
      "image": "ghcr.io/nethserver/prometheus:1.0.0",
      "node": 2
    },
    "extra": {
      "completion": {
        "extraTextParams": [
          "node"
        ],
        "i18nString": "software_center.instance_installed_on_node",
        "outputTextParams": [
          "module_id"
        ]
      },
      "description": "Installing on Node 2",
      "eventId": "9fb9b467-b0c3-4fcf-934e-2ebd2fab5430",
      "node": "Node 2",
      "title": "Install Prometheus"
    },
    "id": "f4970be3-839a-4f6f-a041-181db02b3cad",
    "parent": "",
    "queue": "cluster/tasks",
    "timestamp": "2023-05-22T17:45:22.232176793Z",
    "user": "admin"
  },
  "status": "aborted",
  "progress": 33,
  "subTasks": [
    {
      "context": {
        "action": "add-module",
        "data": {
          "environment": {
            "IMAGE_DIGEST": "sha256:e4c50fceb5c7c3170a25ea1af47e1232ad8b7e77b0e24672b4fff219327e59df",
            "IMAGE_ID": "e28418b111cf330277d3dcfb29b9a9cc167bb370e8e3464e1ff4fb6562aeda05",
            "IMAGE_REOPODIGEST": "ghcr.io/nethserver/prometheus@sha256:e4c50fceb5c7c3170a25ea1af47e1232ad8b7e77b0e24672b4fff219327e59df",
            "IMAGE_URL": "ghcr.io/nethserver/prometheus:1.0.0",
            "MODULE_ID": "prometheus1",
            "MODULE_UUID": "b4057f5d-82c2-48d9-b39d-8a9770f0f449",
            "NODE_ID": "2",
            "TCP_PORT": "20011",
            "TCP_PORTS": "20011"
          },
          "is_rootfull": false,
          "module_id": "prometheus1"
        },
        "extra": {},
        "id": "1e464de0-2041-428f-b9e7-953807ad141d",
        "parent": "f4970be3-839a-4f6f-a041-181db02b3cad"
      },
      "status": "completed",
      "progress": 100,
      "subTasks": [],
      "result": {
        "error": "<7>useradd -m -k /etc/nethserver/skel -s /bin/bash prometheus1\n<7>extract-image ghcr.io/nethserver/prometheus:1.0.0\nTrying to pull ghcr.io/nethserver/prometheus:1.0.0...\nGetting image source signatures\nCopying blob sha256:7c29cac7af4b7e18ee9c8cb18f6c0534c49728b0f02aecc110c2189120d5eaaf\nCopying config sha256:e28418b111cf330277d3dcfb29b9a9cc167bb370e8e3464e1ff4fb6562aeda05\nWriting manifest to image destination\nStoring signatures\nExtracting container filesystem imageroot to /home/prometheus1/.config\nTotal bytes read: 7987200 (7.7MiB, 89MiB/s)\nimageroot/actions/\nimageroot/actions/create-module/\nimageroot/actions/create-module/20configure\nimageroot/actions/create-module/80start_services\nimageroot/actions/create-module/validate-input.json\nimageroot/actions/destroy-module/\nimageroot/actions/destroy-module/20destroy\nimageroot/actions/get-configuration/\nimageroot/actions/get-configuration/20read\nimageroot/actions/get-configuration/validate-output.json\nimageroot/bin/\nimageroot/bin/reload_configuration\nimageroot/bin/validation.json\nimageroot/etc/\nimageroot/etc/state-include.conf\nimageroot/events/\nimageroot/events/service-prometheus-metrics-updated/\nimageroot/events/service-prometheus-metrics-updated/10handler\nimageroot/systemd/\nimageroot/systemd/user/\nimageroot/systemd/user/prometheus.service\nchanged ownership of './state/environment' from root:root to prometheus1:prometheus1\nchanged ownership of './state/agent.env' from root:root to prometheus1:prometheus1\nchanged ownership of './systemd/user/prometheus.service' from root:root to prometheus1:prometheus1\nchanged ownership of './.imageroot.lst' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/create-module/20configure' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/create-module/80start_services' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/create-module/validate-input.json' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/create-module' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/destroy-module/20destroy' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/destroy-module' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/get-configuration/20read' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/get-configuration/validate-output.json' from root:root to prometheus1:prometheus1\nchanged ownership of './actions/get-configuration' from root:root to prometheus1:prometheus1\nchanged ownership of './actions' from root:root to prometheus1:prometheus1\nchanged ownership of './bin/reload_configuration' from root:root to prometheus1:prometheus1\nchanged ownership of './bin/validation.json' from root:root to prometheus1:prometheus1\nchanged ownership of './bin' from root:root to prometheus1:prometheus1\nchanged ownership of './etc/state-include.conf' from root:root to prometheus1:prometheus1\nchanged ownership of './etc' from root:root to prometheus1:prometheus1\nchanged ownership of './events/service-prometheus-metrics-updated/10handler' from root:root to prometheus1:prometheus1\nchanged ownership of './events/service-prometheus-metrics-updated' from root:root to prometheus1:prometheus1\nchanged ownership of './events' from root:root to prometheus1:prometheus1\n8110f86df277cb9ad22b879854df24efa4f65af9c7d9413ea51b2d0d1f00ea8e\n<7>loginctl enable-linger prometheus1\n",
        "exit_code": 0,
        "file": "task/node/2/1e464de0-2041-428f-b9e7-953807ad141d",
        "output": {
          "redis_sha256": "8f677f8ae005c949202e630474440ed05448d71299651e8b1e057fed27568f11"
        }
      }
    },
    {
      "context": {
        "action": "create-module",
        "data": {
          "images": [
            "quay.io/prometheus/prometheus:v2.37.5"
          ]
        },
        "extra": {},
        "id": "678a5ab9-4600-4bb6-b168-d651836ad8e9",
        "parent": "f4970be3-839a-4f6f-a041-181db02b3cad"
      },
      "status": "aborted",
      "progress": 77,
      "subTasks": [
        {
          "context": {
            "action": "set-route",
            "data": {
              "http2https": false,
              "instance": "prometheus1",
              "lets_encrypt": false,
              "path": "/589c23e0551e4481a4a3e80ddff21b56",
              "url": "http://127.0.0.1:20011"
            },
            "extra": {},
            "id": "529ba1d3-c1d5-4776-aa68-d13ef58cdfb1",
            "parent": "678a5ab9-4600-4bb6-b168-d651836ad8e9",
            "queue": "module/traefik2/tasks",
            "timestamp": "2023-05-22T17:45:44.382109352Z",
            "user": "module/prometheus1"
          },
          "status": "completed",
          "progress": 100,
          "subTasks": [],
          "result": {
            "error": "",
            "exit_code": 0,
            "file": "task/module/traefik2/529ba1d3-c1d5-4776-aa68-d13ef58cdfb1",
            "output": ""
          }
        }
      ],
      "result": {
        "error": "Add to module/prometheus1 environment PROMETHEUS_IMAGE=quay.io/prometheus/prometheus:v2.37.5\n<7>dump_env() is deprecated and implemented as a no-op\n<7>podman-pull-missing quay.io/prometheus/prometheus:v2.37.5\nTrying to pull quay.io/prometheus/prometheus:v2.37.5...\nGetting image source signatures\nCopying blob sha256:a364fdffa75b874796f397b75df22b943243de461006ddc5f6ed693af248c0cc\nCopying blob sha256:5c12815fee558b157dee7f7509dedbaba0a8379098858a65ec869e1f1526ea0c\nCopying blob sha256:22b70bddd3acadc892fca4c2af4260629bfda5dfd11ebc106a93ce24e752b5ed\nCopying blob sha256:ea71be1a17871f09737ccfb04116b09e7c85c921d91f310f544741d1907a92bd\nCopying blob sha256:f91bda4f4397a941b6706ca8c6a526fe80f187039bb05e2983eab8acb712e347\nCopying blob sha256:2596b6b1b5e8d84f84fc6adea2114aa358afaba0cb55a22bd7739c837f2566f9\nCopying blob sha256:5c3550bf7ce22ef28073d72f9e91f05ae3d10aa4ad42c48b42a750cce5f5d247\nCopying blob sha256:9e7f059b5bedeb9d64749f8176d28c96ca5b80cc8654a9b4bcae9ca3490d2d52\nCopying blob sha256:10b9a1426ff25cfb9391c4eef2705dc9ee7a0bedea500ed4153b2b3807cf3795\nCopying blob sha256:22a71175390db83d1f984b08620ef56d8a23e0df562c22245b510040f4428223\nCopying blob sha256:9a2560986b70c9d1f975e7f9bc1dfefa444918503db1c7c74dac074379745dd7\nCopying blob sha256:cca231f698f0adbb824708c35b6395cdb1b603716389a659bf80370453a75a1a\nCopying config sha256:cdc06c473c9b6f1ae04422ae8a901af087e35753439d7edad95706c2ba65134e\nWriting manifest to image destination\nStoring signatures\ncdc06c473c9b6f1ae04422ae8a901af087e35753439d7edad95706c2ba65134e\nTraceback (most recent call last):\n  File \"/home/prometheus1/.config/actions/create-module/20configure\", line 53, in <module>\n    logcli = agent.read_envfile(\"/etc/nethserver/logcli.env\")\n  File \"/usr/local/agent/pypkg/agent/__init__.py\", line 89, in read_envfile\n    fo = open(file_path, 'r')\nFileNotFoundError: [Errno 2] No such file or directory: '/etc/nethserver/logcli.env'\n",
        "exit_code": 1,
        "file": "task/module/prometheus1/678a5ab9-4600-4bb6-b168-d651836ad8e9",
        "output": ""
      }
    }
  ],
  "validated": true,
  "result": {
    "error": "<7>podman-pull-missing ghcr.io/nethserver/prometheus:1.0.0\nTrying to pull ghcr.io/nethserver/prometheus:1.0.0...\nGetting image source signatures\nCopying blob sha256:7c29cac7af4b7e18ee9c8cb18f6c0534c49728b0f02aecc110c2189120d5eaaf\nCopying config sha256:e28418b111cf330277d3dcfb29b9a9cc167bb370e8e3464e1ff4fb6562aeda05\nWriting manifest to image destination\nStoring signatures\ne28418b111cf330277d3dcfb29b9a9cc167bb370e8e3464e1ff4fb6562aeda05\n<7>extract-ui ghcr.io/nethserver/prometheus:1.0.0\nExtracting container filesystem ui to /var/lib/nethserver/cluster/ui/apps/prometheus1\nui/css/\nui/css/about~31ecd969.aaeac019.css\nui/css/app~748942c6.75af5757.css\nui/i18n/\nui/i18n/en/\nui/i18n/en/translation.json\nui/i18n/it/\nui/i18n/it/translation.json\nui/img/\nui/img/module_default_logo.cce7147c.png\nui/index.html\nui/js/\nui/js/about~31ecd969.174fb5c7.js\nui/js/about~31ecd969.174fb5c7.js.map\nui/js/app~748942c6.36add9e5.js\nui/js/app~748942c6.36add9e5.js.map\nui/js/chunk-vendors~02576867.44c88c8d.js\nui/js/chunk-vendors~02576867.44c88c8d.js.map\nui/js/chunk-vendors~0605657e.7f254d6a.js\nui/js/chunk-vendors~0605657e.7f254d6a.js.map\nui/js/chunk-vendors~0f485567.2b4008a3.js\nui/js/chunk-vendors~0f485567.2b4008a3.js.map\nui/js/chunk-vendors~17faf02d.7f68aab6.js\nui/js/chunk-vendors~17faf02d.7f68aab6.js.map\nui/js/chunk-vendors~1d97ff09.290bda9d.js\nui/js/chunk-vendors~1d97ff09.290bda9d.js.map\nui/js/chunk-vendors~2a42e354.0df57e14.js\nui/js/chunk-vendors~2a42e354.0df57e14.js.map\nui/js/chunk-vendors~2aa62147.38204dba.js\nui/js/chunk-vendors~2aa62147.38204dba.js.map\nui/js/chunk-vendors~41d44f25.e7779fa1.js\nui/js/chunk-vendors~41d44f25.e7779fa1.js.map\nui/js/chunk-vendors~46852254.0b6d19c2.js\nui/js/chunk-vendors~46852254.0b6d19c2.js.map\nui/js/chunk-vendors~57473a66.f586d2ed.js\nui/js/chunk-vendors~57473a66.f586d2ed.js.map\nui/js/chunk-vendors~5bb1f863.3efba861.js\nui/js/chunk-vendors~5bb1f863.3efba861.js.map\nui/js/chunk-vendors~5eba3806.b8428291.js\nui/js/chunk-vendors~5eba3806.b8428291.js.map\nui/js/chunk-vendors~690b702c.fd888ae3.js\nui/js/chunk-vendors~690b702c.fd888ae3.js.map\nui/js/chunk-vendors~7274e1de.72bc2eab.js\nui/js/chunk-vendors~7274e1de.72bc2eab.js.map\nui/js/chunk-vendors~86f6b1bc.a87cb010.js\nui/js/chunk-vendors~86f6b1bc.a87cb010.js.map\nui/js/chunk-vendors~b5906859.deff14e4.js\nui/js/chunk-vendors~b5906859.deff14e4.js.map\nui/js/chunk-vendors~bc21d4b3.da6ccf35.js\nui/js/chunk-vendors~bc21d4b3.da6ccf35.js.map\nui/js/chunk-vendors~c8728516.dd1ccc58.js\nui/js/chunk-vendors~c8728516.dd1ccc58.js.map\nui/js/chunk-vendors~d2305125.6a99a4ff.js\nui/js/chunk-vendors~d2305125.6a99a4ff.js.map\nui/js/chunk-vendors~d9886323.1a1a66f6.js\nui/js/chunk-vendors~d9886323.1a1a66f6.js.map\nui/js/chunk-vendors~db300d2f.ddb8641c.js\nui/js/chunk-vendors~db300d2f.ddb8641c.js.map\nui/js/chunk-vendors~ec8c427e.5c1734c3.js\nui/js/chunk-vendors~ec8c427e.5c1734c3.js.map\nui/js/chunk-vendors~fdc6512a.81c5c86c.js\nui/js/chunk-vendors~fdc6512a.81c5c86c.js.map\nui/js/lang-en-translation-json~9b60384d.182e4fbe.js\nui/js/lang-en-translation-json~9b60384d.182e4fbe.js.map\nui/js/lang-it-translation-json~e043826f.3fae064e.js\nui/js/lang-it-translation-json~e043826f.3fae064e.js.map\nui/metadata.json\nui/shortcuts.json\n619e08db58e8f3259403d0b58f89169dcfc205a4a9b330f9eb1998a75a860cc6\nAssertion failed\n  File \"/var/lib/nethserver/cluster/actions/add-module/50update\", line 202, in <module>\n    agent.assert_exp(create_module_result['exit_code'] == 0) # Ensure create-module is successful\n",
    "exit_code": 2,
    "file": "task/cluster/f4970be3-839a-4f6f-a041-181db02b3cad",
    "output": ""
  }
}

/var/log/messages:

May 22 18:24:29 alma2 prometheus2[5451]: Traceback (most recent call last):
May 22 18:24:29 alma2 prometheus2[5451]:  File "/home/prometheus2/.config/actions/create-module/20configure", line 53, in <module>
May 22 18:24:29 alma2 prometheus2[5451]:    logcli = agent.read_envfile("/etc/nethserver/logcli.env")
May 22 18:24:29 alma2 prometheus2[5451]:  File "/usr/local/agent/pypkg/agent/__init__.py", line 89, in read_envfile
May 22 18:24:29 alma2 prometheus2[5451]:    fo = open(file_path, 'r')
May 22 18:24:29 alma2 prometheus2[5451]: FileNotFoundError: [Errno 2] No such file or directory: '/etc/nethserver/logcli.env'
May 22 18:24:29 alma2 prometheus2[5451]: task/module/prometheus2/b1f598fb-2101-481e-9283-aeca9fb17c68: action "create-module" status is "aborted" (1) at step 20configure
3 Likes

I’m pretty sure I tested it few months ago :thinking:

Anyway, thanks and noted!

2 Likes

@giacomo, what to do ?

I can see that the action ns8-core/core/imageroot/var/lib/nethserver/cluster/actions/create-cluster/70logcli_env at main · NethServer/ns8-core · GitHub create the file, probably to use it locally with the cli ns8-core/logcli at main · NethServer/ns8-core · GitHub

but indeed when you trigger create-module of prometheus ns8-prometheus/20configure at main · NethServer/ns8-prometheus · GitHub or when you trigger the grafana script ns8-grafana/provision at main · NethServer/ns8-grafana · GitHub we do not have it locally

why do not make a request to redis like here ns8-core/70logcli_env at c2f91171c0395b782dd3dc287076703e3483b5dd · NethServer/ns8-core · GitHub among the cluster instead to read a file

can we trust that it will be always this settings for all instances or each loki module must have its credentials ?

we use logcli.env in different part Sign in to GitHub · GitHub

I’d say no.
We could try to access the /etc/nethserver/logcli.env file, if not present just skip the configuration part that uses it.

you mean that on a worker we do not need to read the file and we could do an exit if it doesn’t exist locally , but from the vars of this file we build yml files both on prometheus and grafana modules, not expert but maybe a mandatory ???

I’d just skip the yaml part that uses it.
Not tested :slight_smile:

1 Like

ok for grafana the doc states

Grafana can be installed only on the leader node. After installation, you will need to configure the Host name with a valid FQDN to access the Grafana instance. Enable Let's Encrypt and HTTP to HTTPS options accordingly to your needs.

however not elegant we should add a better warning inside the software center

1 Like

Relevant to ns8-grafana

sorry to bother you @giacomo but we cannot simply do not expand the config local.yml if /etc/nethserver/logcli.env because it is a requirement to start the container, I mean

my proposal will be to disabled the UI if we cannot find that /etc/nethserver/logcli.env and require to install it on the worker

what do you think

I didn’t mean that, but just disable only the relevant yaml part, not the whole file.
In this case, I’d remove these lines.

no warning in the UI that grafana is not installed on the good node ?

Yes, we could add it there is not better solution.
But I’d prefer to overcome the limitation :slight_smile:

1 Like