• Recent
    • Unsolved
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Register
    • Login

    Fog Scheduler running at 100% CPU + SSH connection flood between nodes

    Scheduled Pinned Locked Moved Unsolved FOG Problems
    1 Posts 1 Posters 24 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mashina
      last edited by mashina

      Hello,

      Recently, I noticed that I cannot run any snapins that were associated with the host. I also noticed the FOGScheduler was not working. After investigating, I found several issues.

      FOG Version:
      Upgraded from 1.15.10 → <several 1.6-beta in between> → currently running 1.6.0-beta.2297

      Setup:

      • FOG Server: 172.28.1.80
      • Storage Node: 172.28.1.89

      1) Replicator: falsely reports image files as missing at first, then immediately syncs.

      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.fixed_size_partitions(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.mbr(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.minimum.partitions(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.original.fstypes(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.original.swapuuids(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.partitions(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1.shrunken.partitions(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1p1.img(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1p2.img(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1p3.img(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist d1p4.img(storage 1)
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.fixed_size_partitions on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.fixed_size_partitions file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.mbr on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.mbr file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.minimum.partitions on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.minimum.partitions file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.original.fstypes on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.original.fstypes file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.original.swapuuids on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.original.swapuuids file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.partitions on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.partitions file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1.shrunken.partitions on storage 1
      [04-07-26 9:28:10 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1.shrunken.partitions file to storage 1
      [04-07-26 9:28:10 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p1.img on storage 1
      [04-07-26 9:28:11 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p1.img file to storage 1
      [04-07-26 9:28:11 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p2.img on storage 1
      [04-07-26 9:28:11 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p2.img file to storage 1
      [04-07-26 9:28:11 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p3.img on storage 1
      [04-07-26 9:28:11 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p3.img file to storage 1
      [04-07-26 9:28:11 am]   # peruswin-audit-1.2: File does not exist on master node, deleting /images/peruswin-audit-1.2/d1p4.img on storage 1
      [04-07-26 9:28:11 am]  | peruswin-audit-1.2: No need to sync /images/peruswin-audit-1.2/d1p4.img file to storage 1
      [04-07-26 9:28:11 am]  | CMD: lftp -e 'set xfer:log 1; set xfer:log-file "/opt/fog/log/fogreplicator.peruswin-audit-1.2.transfer.storage 1.log";set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c --parallel=20 -R --ignore-time -vvv --exclude ".srvprivate" "/images/peruswin-audit-1.2" "/images/peruswin-audit-1.2";exit' -u fogproject,[redacted] 172.28.1.89
      [04-07-26 9:28:11 am]  * Started sync for Image peruswin-audit-1.2 - Resource id #1583
      [04-07-26 9:28:11 am]  | Sync finished - Resource id #602
      

      Observed:

      • Files exist on both server and storage under /images/<image>
      • Verified with find and lftp
      • Image deploys successfully to clients

      2) SSH spam between nodes

      Apr 07 09:42:13 fog sshd[2483177]: error: kex_exchange_identification: Connection closed by remote host
      Apr 07 09:42:13 fog sshd[2483177]: Connection closed by 172.28.1.89 port 55330
      Apr 07 09:42:13 fog sshd[2483178]: error: kex_exchange_identification: Connection closed by remote host
      Apr 07 09:42:13 fog sshd[2483178]: Connection closed by 172.28.1.89 port 55336
      Apr 07 09:42:14 fog sshd[2483179]: error: kex_exchange_identification: Connection closed by remote host
      Apr 07 09:42:14 fog sshd[2483179]: Connection closed by 172.28.1.80 port 34766
      Apr 07 09:42:14 fog sshd[2483180]: error: kex_exchange_identification: Connection closed by remote host
      Apr 07 09:42:14 fog sshd[2483180]: Connection closed by 172.28.1.80 port 34768
      

      Observed:

      • Happens multiple times per second
      • Seen on both server and storage

      Fix / Isolation:

      • Stopping FOGMulticastManager stops the SSH spam
      • Starting it again reproduces the issue

      3) FOGMulticastManager creates broken PHP session files (storage node)

      session_start(): open(... Permission denied)
      

      Observed:

      • /var/lib/php/sessions directory is correct:

        drwx-wx-wt root:www-data
        
      • Session files are created as:

        -rw------- 1 root root ...
        
      • Apache/PHP-FPM runs as www-data → cannot access them

      Isolation:

      • Stop:

        systemctl stop FOGScheduler FOGMulticastManager
        
      • Delete sessions:

        find /var/lib/php/sessions -type f -name 'sess_*' -delete
        
      • Errors stop

      • Start only:

        systemctl start FOGMulticastManager
        
      • Errors immediately return


      4) Power Management warnings

      Undefined array key "pmAction"
      

      Observed:

      • Many hosts have no row in powerManagement table

      Fix:

      • Disabling Power Management in FOG settings stops the warnings

      5) Scheduler tasks do not run

      Observed:

      • Scheduled tasks do not execute unless FOGScheduler is restarted
      • After restart, tasks run, but later scheduler stalls again
      • New tasks are not picked up

      6) Snapins do not execute

      Observed:

      • Snapins can be assigned
      • Execution on a single host associated with the snapins fails
      • Snapin runs on a group

      7) High CPU usage (PHP)

      php (root) ~100% CPU
      

      Observed:

      • High CPU usage on both server and storage
      • Drops when stopping FOGScheduler

      Additional notes

      • Manual SSH from server → storage using fogproject works
      • FTP (lftp) can list image files correctly
      • Installer has been re-run on both nodes after update
      1 Reply Last reply Reply Quote 0
      • 1 / 1
      • First post
        Last post

      148

      Online

      12.6k

      Users

      17.5k

      Topics

      156.4k

      Posts
      Copyright © 2012-2026 FOG Project