this post was submitted on 15 Mar 2024
0 points (NaN% liked)

Selfhosted

39167 readers
381 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago
MODERATORS
 

I run an old desktop mainboard as my homelab server. It runs Ubuntu smoothly at loads between 0.2 and 3 (whatever unit that is).

Problem:
Occasionally, the CPU load skyrockets above 400 (yes really), making the machine totally unresponsive. The only solution is the reset button.

Solution:

  • I haven't found what the cause might be, but I think that a reboot every few days would prevent it from ever happening. That could be done easily with a crontab line.
  • alternatively, I would like to have some dead-simple script running in the background that simply looks at the CPU load and executes a reboot when the load climbs over a given threshold.

--> How could such a cpu-load-triggered reboot be implemented?


edit: I asked ChatGPT to help me create a script that is started by crontab every X minutes. The script has a kill-threshold that does a kill-9 on the top process, and a higher reboot-threshold that ... reboots the machine. before doing either, or none of these, it will write a log line. I hope this will keep my system running, and I will review the log file to see how it fares. Or, it might inexplicable break my system. Fun!

top 2 comments
sorted by: hot top controversial new old
[–] h3ndrik@feddit.de 0 points 6 months ago* (last edited 6 months ago)

The answer is to create a short script that periodically queries the load, makes a decision and then triggers a reboot. Run it with a SystemD service and give it privileges to do the reboot. Useful languages for the script would be bash or python.

It's a silly way to handle it. You're probably quicker and better off solving the actual issue. Because it's not normal having this happen. Have a look at the logs, or install a monitoring software like netdata to get to the root of this. It's probably some software you installed that is looping, or having a memory leak and then swapping and hogging the IO until OOM kicks in. All of that will show up in the logs. And you'll see the memory graphs slowly rising in netdata if it's a leak.

journalctl -b -1 shows you messages from the previous boot. (To debug after you've pressed reset.) You can use a pastebin service to ask for more help if you can't make sense of the output.

Other solutions: Some server boards have dedicated hardware, a watchdog to detect something similar to that.

You can solder a microcontroller (an ESP32 with wifi) to the reset button and program that to be a watchdog.

Edit: But in my experience it's most of the times a similar amount of effort to either delve down and solve the underlying problem entirely and at once. Or writing scripts around it and putting a band-aid on it. But with that the issue is still there, and you're bound to spend additional time with it once side-effects and quirks become obvious.

[–] cron@feddit.de 0 points 6 months ago

Just as a side note, the load factor can also mean that processes are limited by IO:

Unix systems traditionally just counted processes waiting for the CPU, but Linux also counts processes waiting for other resources -- for example, processes waiting to read from or write to the disk.

Source