Replication has stopped after ugprade
-
After digging a little deeper, it appears that all logs in the log viewer stopped at the 29th of last month, except for apache logs. All of the logs for the storage nodes say “null”.
I’ve verified that both MySQL and FTP passwords for everything are correct.
-
@moses As far as I can tell those messages in the apache error logs are not related to the replication issue. But we should still get that fixed!! I remember talking to Tom because I saw the exact same errors on my virtual machine setup here - this is kind of special as I am using a tap network device and FOG wasn’t happy with that. We fixed it but this might have broken it for others. Can you please post the output of
ip addr
on your FOG server.@Tom-Elliott Should we better use /sys/class/net/*/operstate instead of calling
ip
/ifconfig
??About the replication. Tom and Wayne have been working on this lately but I don’t think it’s the same issue. Please check if the services/daemons on your FOG server are all running:
ps ax | grep FOG
(maybe they fail on startup/restart?!) -
@Sebastian-Roth This is all I got as output from
ps
:1341 ? S 0:00 /usr/bin/php -q /opt/fog/service/FOGTaskScheduler/FO TaskScheduler 1364 ? S 0:00 /usr/bin/php -q /opt/fog/service/FOGPingHosts/FOGPingHosts 9595 ? Z 0:00 [FOGPingHosts] <defunct> 9596 ? Z 0:00 [FOGTaskSchedule] <defunct> 10088 pts/1 S+ 0:00 grep --color=auto FOG
So perhaps multiple services are failing on start? Is there any way to find logs that would indicate the why?
-
@Sebastian-Roth Here’s the output of
ip addr
:1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 9c:b6:54:f0:3e:0d brd ff:ff:ff:ff:ff:ff inet 192.168.1.66/24 brd 192.168.1.255 scope global em1 valid_lft forever preferred_lft forever inet6 fe80::9eb6:54ff:fef0:3e0d/64 scope link valid_lft forever preferred_lft forever
-
@Sebastian-Roth After updating this morning,
the services now appear to be running(they are not running after a restart), but replication still isn’t actually taking place, nor has the log for the Image Replicator changed from my original post. -
@moses can you check the storage nodes all have their proper network interface set correctly? This is why the errors and would also cause issues with the services as they’re waiting for the interface to become available.
-
@Tom-Elliott I checked the setting in Storage Mgmt > Storage Nodes and they all match what is shown using
ifconfig
on the nodes thru ssh. -
@moses if you manual stop and start the services do things work?
-
@Tom-Elliott Nope, though the behavior there is kind of odd:
If I restart, it says “failed”.
If I stop, then start, it says “OK” for both, but nothing in the logs, and an image that is supposed to be replicating is not.
-
@moses So you’re running a variant of Ubuntu, Just guessing?
-
@Tom-Elliott ubuntu 14.04
-
@moses Can you run:
sudo service vsftpd stop sudo service FOGImageReplicator stop sudo service FOGSnapinReplicator stop sudo service FOGPingHosts stop sudo service FOGMulticastManager stop sudo service FOGScheduler stop sleep 5 sudo service vsftpd start sudo service FOGImageReplicator start sudo service FOGSnapinReplicator start sudo service FOGPingHosts start sudo service FOGMulticastManager start sudo service FOGScheduler start
-
@Tom-Elliott okay, did that:
administrator@SVR-HQ-IMAGING:~$ sudo service vsftpd stop vsftpd stop/waiting administrator@SVR-HQ-IMAGING:~$ sudo service FOGImageReplicator stop * Stopping FOG Computer Imaging Solution: FOGImageReplicator [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGSnapinReplicator stop * Stopping FOG Computer Imaging Solution: FOGSnapinReplicator [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGPingHosts stop * Stopping FOG Computer Imaging Solution: FOGPingHosts [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGMulticastManager stop * Stopping FOG Computer Imaging Solution: FOGMulticastManager [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGScheduler stop * Stopping FOG Computer Imaging Solution: FOGScheduler [ OK ] administrator@SVR-HQ-IMAGING:~$ sleep 5 administrator@SVR-HQ-IMAGING:~$ sudo service vsftpd start vsftpd start/running, process 9607 administrator@SVR-HQ-IMAGING:~$ sudo service FOGImageReplicator start * Starting FOG Computer Imaging Solution: FOGImageReplicator [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGSnapinReplicator start * Starting FOG Computer Imaging Solution: FOGSnapinReplicator [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGPingHosts start * Starting FOG Computer Imaging Solution: FOGPingHosts [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGMulticastManager start * Starting FOG Computer Imaging Solution: FOGMulticastManager [ OK ] administrator@SVR-HQ-IMAGING:~$ sudo service FOGScheduler start
Still no change, however (no replication or log changes)
-
I just updated to the latest, then deleted all images on my slave node, and quickly rebooted both master and slave machines… we’ll see how it goes.
-
@Wayne-Workman See anything on that slave? Trying to determine if this is just an issue somewhere on the master I’m running.
-
@moses It replicated every image back perfectly. There’s something with your setup. Keep in mind it could still be FOG-related though, such as a credentials issue, it could be firewall related, or SELinux related. You might even be out of space on the storage node?? There are many possibilities.
I know you have said you’ve verified the FTP credentials, but let’s make doubly sure? There are instructions here for testing it: https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_FTP
Basically you should ssh into your main and try to open an FTP connection to the other nodes using the user/pass they have set in their respective listing under storage management.
Then the opposite, from the nodes, try to open an FTP connection to the main using the main’s credentials that are set in it’s respective Storage Management listing.
-
@moses Did you find out what’s going wrong with the services? Any errors in the apache log when you try starting the service?
We’ve fixed the bandwidth.php issue. Can you please update to the latest version to see if those errors stop and the bandwidth monitor on the dashboard works for you.
-
Unfortunately, even with some hands-on help, I wasn’t able to determine what the cause was. At this point my best bet was that it’s related to Linux. Even upgrading to a newer distro didn’t help. I’m currently in the process of moving my configuration over to CentOS, once I back up my images.
-
@moses Are the services actually running after you started them by hand?
ps ax | grep FOG
-
if you run top, how much of your cpu do these services utilize?