I woke up this morning and suddenly none of my internet facing services are working. Since they run on multiple servers, this was a surprise. I thought maybe something had gone wrong with the router, but alas, after spending half the day troubleshooting, the only remaining piece is Traefik.
I’ve run all the NS8 updates and am on the latest core as of today. Traefik shows no errors in the log. However, I can go to the LAN address of a service and it loads perfectly. When I try to connect to the same service using it’s dns name, it fails to load. This happens even if I eliminate the router and put the DNS name in a hosts file pointed at the NS8 server.
Hi guys. Don’t wish to be a pain here, but all of my services are out. This happened without me adjusting/touching/doing ANYTHING. I’ve traced the problem to Traefik. I suppose I could go hunt around on Traefik forums, but I can’t stop thinking about…
Because NS8 chooses to do things it’s own way, me poking about in Traefik config files is as likely to do damage as help anything. Surely, someone has some suggestions where I can look to get this sorted?
Ok. Disregard. Suddenly it’s all working again… Would still love to know where to look next time this happens. If there is a doc on Traefik or a forum post I’ve missed that I need to read up on, I’d be grateful to whomever points me in the right direction.
Traefik appears to be working for a few minutes after a restart. Then it quickly fails in some way I’ve yet to determine.
Here’s some additional info in hopes that it will draw out some ideas and suggestions:
Traefik acts as a reverse proxy for six services:
email (running via the NS8 module)
calibre-web (a docker container on OMV6 - same proxmox as NS8)
immich (hosted on a separate RPi without any virtualization)
emby (it’s own VM on the same proxmox as NS8)
wordpress 1 (hosted on NS8 - wordpress module)
wordpress 2 (hosted on NS8 - wordpress module)
I can connect to calibre-web, immich and emby using their internal addresses. They respond instantly. However, if I enter the internet facing addresses Traefik is supposed to be routing, I get a timeout after a very long time.
I am unable to connect to email or either wordpress site. I suppose I could try to create an SSH tunnel and bypass Traefik, but I feel like I’ve already isolated the problem.
As you can see in the log snippet above, the Traefik logs are almost entirely heartbeats for a partially migrated nextcloud server. Since that migration won’t complete and I need to see what’s really happening, I’ve tried to remove that instance, but uninstall failed.
I resorted the reverse proxy serviced by Traefik. italic for NS8-hosted
Then sorted by closeness.
-same host, different guest
-same host, container on a different guest
-another host
Please, consider to delete temporarely from immich one service at the time (excluding NS8 modules), for try to diagnose if there’s anyone of these that create issues.
When you find the “breaking” host, then avoid adding back, and try with others.
@pike you are amazing! I still don’t understand the cause. However, I removed all of the non-NS8 traefik entries as well as all NS8 modules that were not currently in use. The NS8 modules were immediately accessible again.
Then I slowly added back the services in Traefik. So far, everything is working. I will monitor for a few hours and report back.
Here’s my question – why would bad behavior at the far end of a reverse proxy prevent access to other proxied resources?
Personal opinion
I bet for configuration of Traefik well tested, polished and verified for any of the NS8 modules, both for behaviour and application compatibility. The dev team is skilled and conscious for trying to avoid any issue for adopters and customers.
Then… there’s the whole world behind of web applications that could work (or not) with traefik, and that could need some improvement and tests for any other “foreingn host” (as traefik perspective).
Last but not least: foreing hosts, even in LAN, might need specific settings even for timing/timeout. And if a service won’t respond in timely manner… maybe traefik might not like that that much?
IDK.
Keep updating this thread, while your tests proceed.
Hmm… Traefik dropped again. So, once more, I removed all of the non-NS8 entries. Unfortunately, this time I’m still unable to access the NS8 resources via Traefik.
I thought perhaps my brief success was related to removing some unnecessary NS8 modules, so I reinstalled and removed Webserver – taking the time to create a fake virtual host entry and be sure it all populated into Traefik. No luck. Nothing I’ve done has restored access.
Ok. More poking. More information. I’ve learned a bit more about Traefik in NS8. It’s actually what’s providing the custer-admin interface – which means that Traefik is working for some things.
If someone understands the process and could help me trace the distinctions between the api and the app modules that would be helpful. I went ahead and started digging into the configs. They’re located (by default) in /home/traefik1/.config/state/configs
Cluster Admin appears to be setup in the _api_server.yml file:
I’ve no idea what this one does. Note, that odd space on the first line is how the file is on my server. Is that correct?
I’m going to skip the certificate files for now and just compare an NS8 module to one of my other entries (again, I can remove or add these with no effect).
Here's a manually created entry for Emby: Emby.yml
Continuing in my efforts to reverse engineer how Traefik works in NS8…
I created a new node, added it to my cluster and moved one of my wordpress sites over. Of course it works flawlessly on the new node.
Comparing Traefik configs on the new node, _api_server.yml is line for line identical. _api.yml contains the same odd starting carriage return on the new node and is the same except that the prefix codes are different.
So far, this remains completely unhelpful as to why one works and the other does not. I did run the core update yesterday, which included a bumped version of Traefik. This made zero difference in my problem.
Someone that knows this software, please help! I’d love to say I’ve screwed up somewhere. I think I’ve done that on this forum many times. But, a running system which I didn’t touch any part of suddenly stopped working. My mail server and several internet facing websites have been broken for more than a week, and I’ve no idea how to solve it.
Surely, someone knows what part of Traefik can change on it’s own? That would be a starting point at least.
Next idea… I setup a simple Nginx reverse proxy on a separate machine to handle incoming http and https requests, thus allowing me to bypass Trafik for non-NS8 modules.
This works great for all of the non-NS8 services. However, Trafik won’t play nice with the reverse proxy… at least not as far as I can tell. Sites on NS8 simply dump to root on the nginx machine.
Is there anyway to address NS8 modules outside of their 127.0.0.1 addressing?
Alright, after further testing, I have all of my non-NS8 sites working using an external nginx proxy.
However, the nginx proxy gets a 404 error when it tries to contact sites still on NS8. However, those sites are still trapped on my NS8 box with the broken Traefik. No longer trusting Traefik, I went ahead and migrated the sites I could away from NS8.
That said, my email server and one of my websites is still hostage, inaccessible on my NS8 box. I was unsuccessful migrating them to a new cluster node. I’ll try that again later today.
Moving them to a new cluster node does solve the problem.
But this isn’t an acceptable solution. If, at any time, without being touched, Traefik can break and the only solution is to turn up a new server and migrate all services…
The complete radio silence from everyone except @pike is also quite shocking.
Apparently you didn’t read the scope of the problem above. As Ted so clearly pointed out, the issue is the traefik reverse proxy routing doesn’t seem to be working for him, and I’m worried I have the same problem.
So, very sorry for the confusion, I would assume that some context could have been gleaned from the statements above about the reverse proxy routing been broken. I also don’t have access to the cluster-admin page.