Fog Scheduler running at 100% CPU + SSH connection flood between nodes
-
Hello,
Recently, I noticed that I cannot run any snapins that were associated with the host. I also noticed the FOGScheduler was not working. After investigating, I found several issues.
FOG Version:
Upgraded from 1.15.10 → <several 1.6-beta in between> → currently running 1.6.0-beta.2297Setup:
- FOG Server: 172.28.1.80
- Storage Node: 172.28.1.89
1) Replicator: falsely reports image files as missing at first, then immediately syncs.
[04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.fixed_size_partitions(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.mbr(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.minimum.partitions(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.original.fstypes(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.original.swapuuids(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.partitions(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1.shrunken.partitions(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1p1.img(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1p2.img(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1p3.img(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist d1p4.img(storage 1) [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.fixed_size_partitions on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.fixed_size_partitions file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.mbr on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.mbr file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.minimum.partitions on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.minimum.partitions file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.original.fstypes on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.original.fstypes file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.original.swapuuids on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.original.swapuuids file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.partitions on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.partitions file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.shrunken.partitions on storage 1 [04-07-26 9:28:10 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.shrunken.partitions file to storage 1 [04-07-26 9:28:10 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p1.img on storage 1 [04-07-26 9:28:11 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p1.img file to storage 1 [04-07-26 9:28:11 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p2.img on storage 1 [04-07-26 9:28:11 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p2.img file to storage 1 [04-07-26 9:28:11 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p3.img on storage 1 [04-07-26 9:28:11 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p3.img file to storage 1 [04-07-26 9:28:11 am] # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p4.img on storage 1 [04-07-26 9:28:11 am] | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p4.img file to storage 1 [04-07-26 9:28:11 am] | CMD: lftp -e 'set xfer:log 1; set xfer:log-file "/opt/fog/log/fogreplicator.peruswin-audit-1.2.transfer.storage 1.log";set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c --parallel=20 -R --ignore-time -vvv --exclude ".srvprivate" "/images/peruswin-audit-1.2" "/images/peruswin-audit-1.2";exit' -u fogproject,[redacted] 172.28.1.89 [04-07-26 9:28:11 am] * Started sync for Image peruswin-audit-1.2 - Resource id #1583 [04-07-26 9:28:11 am] | Sync finished - Resource id #602Observed:
- Files exist on both server and storage under
/images/<image> - Verified with
findandlftp - Image deploys successfully to clients
2) SSH spam between nodes
Apr 07 09:42:13 fog sshd[2483177]: error: kex_exchange_identification: Connection closed by remote host Apr 07 09:42:13 fog sshd[2483177]: Connection closed by 172.28.1.89 port 55330 Apr 07 09:42:13 fog sshd[2483178]: error: kex_exchange_identification: Connection closed by remote host Apr 07 09:42:13 fog sshd[2483178]: Connection closed by 172.28.1.89 port 55336 Apr 07 09:42:14 fog sshd[2483179]: error: kex_exchange_identification: Connection closed by remote host Apr 07 09:42:14 fog sshd[2483179]: Connection closed by 172.28.1.80 port 34766 Apr 07 09:42:14 fog sshd[2483180]: error: kex_exchange_identification: Connection closed by remote host Apr 07 09:42:14 fog sshd[2483180]: Connection closed by 172.28.1.80 port 34768Observed:
- Happens multiple times per second
- Seen on both server and storage
Fix / Isolation:
- Stopping
FOGMulticastManagerstops the SSH spam - Starting it again reproduces the issue
3) FOGMulticastManager creates broken PHP session files (storage node)
session_start(): open(... Permission denied)Observed:
-
/var/lib/php/sessionsdirectory is correct:drwx-wx-wt root:www-data -
Session files are created as:
-rw------- 1 root root ... -
Apache/PHP-FPM runs as
www-data→ cannot access them
Isolation:
-
Stop:
systemctl stop FOGScheduler FOGMulticastManager -
Delete sessions:
find /var/lib/php/sessions -type f -name 'sess_*' -delete -
Errors stop
-
Start only:
systemctl start FOGMulticastManager -
Errors immediately return
4) Power Management warnings
Undefined array key "pmAction"Observed:
- Many hosts have no row in
powerManagementtable
Fix:
- Disabling Power Management in FOG settings stops the warnings
5) Scheduler tasks do not run
Observed:
- Scheduled tasks do not execute unless
FOGScheduleris restarted - After restart, tasks run, but later scheduler stalls again
- New tasks are not picked up
6) Snapins do not execute
Observed:
- Snapins can be assigned
- Execution on a single host associated with the snapins fails
- Snapin runs on a group
7) High CPU usage (PHP)
php (root) ~100% CPUObserved:
- High CPU usage on both server and storage
- Drops when stopping
FOGScheduler
Additional notes
- Manual SSH from server → storage using
fogprojectworks - FTP (
lftp) can list image files correctly - Installer has been re-run on both nodes after update