Image replication crashes when FTP connection fails
-
FOG Server version 1.5.9
FOG Nodes version: fog-dev-branch 1.5.9.200FOG Server PHP Version:
PHP 7.2.24-0ubuntu0.18.04.13 (cli) (built: Jul 6 2022 12:23:22) ( NTS )
Copyright 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright 1998-2018 Zend Technologies
with Zend OPcache v7.2.24-0ubuntu0.18.04.13, Copyright 1999-2018, by Zend TechnologiesI’m having an issue where the image replication service seemingly crashes while replicating an image that’s located on a node where the network connection is unavailable. I would expect the program to continue on a failure and proceed to the other images but it just stops processing subsequent images. If the network connection (VPN tunnel) is restored the replication service continues on and I see successful log messages indicating that the image doesn’t need to be replicated (no need to sync etc). Alternatively as a workaround we’re able to manually disable replication for the specific image which allows the replication service to proceed on to the other images. If the network connection is down I get the following error in the replication log
[11-07-22 11:20:30 pm] * Type: 1024, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 219, Message: FTP connection failed, Host: 192.168.113.197, Username: fogproject
To explain our configuration a little bit we are an MSP with two offices in different states. We’ve been using fog in our two physical locations for a couple years where one of our offices has the FOG server and the other has a storage node connected via VPN tunnel and a few images replicate between them. We’re now adding storage nodes connected by VPN to our clients’ offices in the hopes of remotely imaging machines at our clients sites with a customer specific image preventing the need to physically transport computers to our office to image them. The issue here is we have a ton of clients and no real way to ensure that a given VPN tunnel will be active. Since a disconnected node crashes the replication service it doesn’t proceed to the next image. Fortunately we caught this early because it seems to be replicating the images in alphabetical order and one of our clients starting with an “A” lost their vpn tunnel meaning all other clients alphabetically after “A” didn’t receive their images. In our use case having a few images that fail image replication isn’t a catastrophic event and would actually be expected due to the inconsistent state of our VPN tunnels. Any number of network outages or firewall problems could cause a node to be unavailable temporarily. If this happens it would be nice if it would just log the error and proceed on to the next images. Admittedly I have never worked with PHP before and don’t know how its error handling works but I poked around in the source code and commented out line 219 of the fogftp.class.php that does a trigger_error (the only place I saw that does this in the repository) with the FTP connection failed message. It subsequently failed in a different file “/packages/web/lib/fog/fogurlrequests.class.php” when opening a socket. I attempted to put a Try/Catch block within the for loop I thought the image replication loop was happening but the FTP timeout fails deeper than where I am looking. At this point I decided it would be best to reach out to you guys. I have undone any code changes I made to the fogftp.class and imagereplicator classes.
Below is what the replication log looks like when an image fails to replicate due to network connectivity problem. I’ve replaced the image name with TESTIMAGE to protect the privacy of the client but just know it starts with “A”. You can see the log begins to replicate TESTIMAGE then fails, the next log entry is Starting image replication. There are many more images alphabetically after the TESTIMAGE that do not get replicated when this happens.
…
…
[11-07-22 11:18:59 pm] * Found Image to transfer to 2 groups
[11-07-22 11:18:59 pm] | Image Name: TESTIMAGE
[11-07-22 11:20:30 pm] * Type: 1024, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 219, Message: FTP connection failed, Host: 192.168.113.197, Username: fogproject
[11-07-22 11:28:43 pm] * Starting Image Replication.
…
…
some time later the replication service runs again and it stops in the same place it did before.[11-07-22 11:28:59 pm] | Image Name: SOMEIMAGE
[11-07-22 11:28:59 pm] | There are no other members to sync to.
[11-07-22 11:28:59 pm] * Found Image to transfer to 2 groups
[11-07-22 11:28:59 pm] | Image Name: TESTIMAGE
[11-07-22 11:30:30 pm] * Type: 1024, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 219, Message: FTP connection failed, Host: 192.168.113.197, Username: fogprojectDo you have any idea how to fix this? If this system works well for us we could conceivably have 100 nodes out in the wild.
-
@londonium said in Image replication crashes when FTP connection fails:
FOG Server version 1.5.9
FOG Nodes version: fog-dev-branch 1.5.9.200Just to be sure, all the nodes are on the exact same version, 1.5.9.200, right?
-
@sebastian-roth Yes all nodes are on the exact same version.
-
I have an update on my issue. I reviewed the fogservice class and found that the storagenodes are in fact checking if they are online before replicating at Line 359 in the fogservice class
I then intentionally took down a vpn tunnel for a different client where I receive the error message output from that online status check.
[11-09-22 5:47:30 pm] * Found Image to transfer to 2 groups
[11-09-22 5:47:30 pm] | Image Name: SomeOtherImage
[11-09-22 5:47:30 pm] | SomeOtherNode server does not appear to be online.
[11-09-22 5:47:30 pm] * Not syncing Image between groupsI am not sure why this error would come up on one client specifically and not another. This would lead me to believe something is tricking the FOG server into thinking the node is online by setting this online flag to true. When then VPN tunnel is down for the client in question I am unable to ping the ip address of the fog node. I don’t think there is anything else on the network with the IP of the storage node that would make it appear online and then subsequently fail an FTP operation.
-
Sorry for the million updates but we just took down a vpn tunnel for a third client and actually ran into the same issue as stated initially with the service crashing. I commented out the trigger_error in the fogftp class again to see the error in fogurlrequests and it says its unable to connect to an IP address that is not the same as the node this image is supposed to be replicating to.
192.168.1.134:21 is not the 10.81.2.117 I would expect to be seeing here. The fact remains the tunnel is down from our own doing and bringing the tunnel up removes this error but it seems strange that this error would say its connecting to the ip of a different unrelated node and maybe give some insight into why some nodes report offline correctly during replication and some try to replicate even though they should fail the offline test.See the log messages below.
[11-09-22 8:35:16 pm] * Found Image to transfer to 2 groups
[11-09-22 8:35:16 pm] | Image Name: 2ndFailedImage
[11-09-22 8:36:47 pm] * Type: 2, File: /var/www/fog/lib/fog/fogurlrequests.class.php, Line: 611, Message: fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject
[11-09-22 8:36:48 pm] * Starting Image Replication. -
@londonium said in Image replication crashes when FTP connection fails:
Sorry for the million updates
Not at all! It’s great you figured out so much details already. Thank you! I will be looking into this soon when I have more time.
-
To give you another update I’ve been reviewing the source code for the image replication and think I found that in the /packages/web/lib/service/fogservice.class.php file the replicateItems() function gathers its storagenode objects from the API route class. From here I think it checks the online flag from a json deserialized response from the API. This gave me the idea to see what the API says about the storage nodes.
I did an API call at the /storagenode route and found that this disconnected storage node is showing online. Another one is offline and correctly reporting its online status as “false” and this particular node doesn’t cause the replication service to crash.
Below is the API response for a node which is offline due to a down vpn tunnel but is showing it is online and as such causes the replication service to crash.
{ "count": 15, "storagenodes": [ { "id": "7", "name": "Some-Offline-Node", "description": "", "isMaster": "1", "storagegroupID": "7", "isEnabled": "1", "isGraphEnabled": "0", "path": "/images", "ftppath": "/images", "bitrate": "", "helloInterval": "", "snapinpath": "/opt/fog/snapins", "sslpath": "/opt/fog/snapins/ssl", "ip": "192.168.113.197", "maxClients": "10", "user": "fogproject", "pass": "somePassword", "key": "", "interface": "ens160", "bandwidth": "0", "webroot": "/fog", "storagegroup": { "id": "7", "name": "SomeName", "description": "", "totalsupportedclients": 10, "enablednodes": [ "7" ], "allnodes": [ "7" ] }, "clientload": 0, "online": true **<---- this is saying its online but cannot be pinged through the down vpn** }
Here is one of the nodes that is offline but the boolean “online” property is set to false. This node gives the message in the log that the node is offline and continues on with the program instead of attempting an ftp connection.
{ "id": "11", "name": "aDifferentNode", "description": "", "isMaster": "1", "storagegroupID": "11", "isEnabled": "1", "isGraphEnabled": "0", "path": "/images", "ftppath": "/images", "bitrate": "", "helloInterval": "", "snapinpath": "/opt/fog/snapins", "sslpath": "/opt/fog/snapins/ssl", "ip": "192.168.1.134", "maxClients": "10", "user": "fogproject", "pass": "somePassword", "key": "", "interface": "ens160", "bandwidth": "0", "webroot": "/fog", "storagegroup": { "id": "11", "name": "someName", "description": "", "totalsupportedclients": 10, "enablednodes": [ "11" ], "allnodes": [ "11" ] }, "clientload": 0, "online": false **<--correctly set to false** }
I will continue investigating the method that sets this online flag but thought this could be helpful to you. Have you ever seen anything like this before?
-
This is an incredibly jank patch to the fogservice.class.php file but I’ve added a function that performs a ping check on the storage node, if it fails it simply logs the ping failure and proceeds on to the other images.
Here is what the log looks like now with it skipping non pingable nodes.
11-10-22 11:47:04 pm] | Image Name: TESTIMAGE [11-10-22 11:47:14 pm] | Node IP: 192.168.113.197 [11-10-22 11:47:14 pm] | Some-Server server could not be pinged. [11-10-22 11:47:14 pm] * Not syncing Image between groups [11-10-22 11:47:14 pm] | Image Name: TEST2 [11-10-22 11:47:14 pm] | There are no other members to sync to. [11-10-22 11:47:14 pm] * Found Image to transfer to 2 groups [11-10-22 11:47:14 pm] | Image Name: TEST3 [11-10-22 11:47:14 pm] | Node IP: 10.1.3.108
The immediate problem has a band-aid fix but it would be nice to know why the fog server thinks the nodes are online when they should be offline.
-
@Londonium Thank you once again! You are doing great. I appreciate you looking into it thus far. I still have not had enough time to create a setup to be able to test and look into the details myself.
Your debugging is very helpful and I am pretty sure you are on the right track. The different “online” state of the two nodes obviously cause this issue only for one of them. While your idea of adding a ping check is probably a good idea I would favor figuring out why you get the “fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject” (obviously unrelated IPs).
Would you be able to add debugging statements to these functions and see if you can figure out where those different IPs are coming from?
https://github.com/FOGProject/fogproject/blob/171d63724131c396029992730660497d48410842/packages/web/lib/fog/storagenode.class.php#L148
https://github.com/FOGProject/fogproject/blob/9fc716769556790f68135f80a920f21bb8b95779/packages/web/lib/fog/fogurlrequests.class.php#L580I would imagine a variable being re-used but not properly re-initialized. Looking through the code I have not spotted the bug yet.
Usually when I debug this kind of issues I simply use
file_put_contents()
to print valuable information to a separate log file. For example:file_put_contents('/var/www/fog/foo.log', print_r($this->get('ip'), 1), FILE_APPEND); // or file_put_contents('/var/www/fog/foo.log', print_r($socket, 1), FILE_APPEND);
On the server you need to create the file and modify ownership and access rights (as root):
touch /var/www/fog/foo.log chown www-data:www-data /var/www/fog/foo.log tail -f /var/www/fog/foo.log
PS: I shall find some more time on the upcoming weekend.
-
@Londonium Diving through the code again I have a feeling this change in fogftp.class.php might have caused this: https://github.com/FOGProject/fogproject/commit/c027a9ed2aa27c45aef9e62c7b1aab3172f4da82#diff-d9ddaf1f03ae05e1c61498e2b053c2c098f45ab710a5ada8ee5b95cdf3dfa9be
As
serialize()
can’t be used on any class in PHP 8 (docs) I switched to useprint_r()
instead. Possibly this is not working as I intended. -
@sebastian-roth
Hi Sebastian,
I agree with you that we should figure out why the fsocketopen() isnt working. I tried adding some debug statements to the functions you mentioned and I get an IP address print out of the node and then a resource ID# for the socket objects. I tried doing a var_dump($socket) and don’t see any additional information on the sockets or anything that gives me insight into the problem. I formatted the output a bit but here’s what that looked like.Resource id #1150 10.178.109.162 Resource id #1153 192.168.16.102 Resource id #1156 10.0.26.66 Resource id #1159 10.20.65.72 Resource id #1162 192.168.42.121 Resource id #1165 192.168.231.21 Resource id #1168 192.168.2.38
I suspect the fsocketopen() is crashing at line 611 when the $socket variable is initialized.
https://github.com/FOGProject/fogproject/blob/9fc716769556790f68135f80a920f21bb8b95779/packages/web/lib/fog/fogurlrequests.class.php#L611When I try to print_r () or var_dump() the $socket variable I get nothing when the program crashes. Is this an issue with PHP fsocketopen() function? I read the documentation on it and it seems like all the correct parameters are supplied. I have very limited understanding of PHP at the moment so I’m not sure of the implications of using serialize() vs print_r().
I’d be curious if this issue can be replicated outside of our network. Thank you for looking into this I know you guys are busy! -
@Londonium I finally found the time to look into this myself and do some testing. Sorry for the delay. Your debugging and posts helped a lot to nail this issue down, so thanks again!
Your VPN tunnel down scenario is kind of special. I triggers when port 21 (FTP) can be reached (socket connection fine and so online status true) but an FTP connection fails. What I mean is that the TCP connection opens but it doesn’t receive the FTP banner information as expected when connecting to a FTP server. I was able to replicate this behavior using
netcat
(listens on port 21 but does not respond to a full FTP connection try). It’s interesting you can trigger this issue by closing a VPN tunnel because to me it seems a TCP connection to port 21 is still being answered despite the VPN tunnel being down - maybe by the VPN router?[11-09-22 8:36:47 pm] * Type: 2, File: /var/www/fog/lib/fog/fogurlrequests.class.php, Line: 611, Message: fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject
I was also able to replicate this scenario and found why it shows two different IPs (when “trigger_error” is commented). On FTP errors the PHP function error_get_last() is used to retrieve information from the last error happening and combines it with current host/username information. But when a FTP connect fails this is not a general PHP error and therefore the “trigger_error” is used to simulate this being a real PHP error. If “trigger_error” is not in place the last PHP error can be anything really and in your case it was the failed fsockopen connect to the prior storage node.
Anyhow, thanks to your topic I was able to improve the replication code (dev-branch 1.5.9.222) and make it skip to the next task/node in case of FTP connect errors.
-
@sebastian-roth
This is great thanks for going through all the trouble to set up a test environment and figure out the problem. I’ll try updating our FOG server to the new dev-branch version and see how it works out. I looked at your code changes and I’m sure this will work just fine.
Its weird that the FTP connection is hung up somewhere. I thought maybe our firewall/router was causing the initiated FTP connections to the dead node but couldn’t be sure since this is the only type that we have including at my home office! Anyways thanks again for your help. We really enjoy using FOG and have big plans for it! -
@londonium said in Image replication crashes when FTP connection fails:
Anyways thanks again for your help.
You are welcome!
We really enjoy using FOG and have big plans for it!
You might consider contributing to the project as you’ve done a great job digging into this issue. Feel free to send me a message if you consider helping out.