Image replication crashes when FTP connection fails
-
Sorry for the million updates but we just took down a vpn tunnel for a third client and actually ran into the same issue as stated initially with the service crashing. I commented out the trigger_error in the fogftp class again to see the error in fogurlrequests and it says its unable to connect to an IP address that is not the same as the node this image is supposed to be replicating to.
192.168.1.134:21 is not the 10.81.2.117 I would expect to be seeing here. The fact remains the tunnel is down from our own doing and bringing the tunnel up removes this error but it seems strange that this error would say its connecting to the ip of a different unrelated node and maybe give some insight into why some nodes report offline correctly during replication and some try to replicate even though they should fail the offline test.See the log messages below.
[11-09-22 8:35:16 pm] * Found Image to transfer to 2 groups
[11-09-22 8:35:16 pm] | Image Name: 2ndFailedImage
[11-09-22 8:36:47 pm] * Type: 2, File: /var/www/fog/lib/fog/fogurlrequests.class.php, Line: 611, Message: fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject
[11-09-22 8:36:48 pm] * Starting Image Replication. -
@londonium said in Image replication crashes when FTP connection fails:
Sorry for the million updates
Not at all! It’s great you figured out so much details already. Thank you! I will be looking into this soon when I have more time.
-
To give you another update I’ve been reviewing the source code for the image replication and think I found that in the /packages/web/lib/service/fogservice.class.php file the replicateItems() function gathers its storagenode objects from the API route class. From here I think it checks the online flag from a json deserialized response from the API. This gave me the idea to see what the API says about the storage nodes.
I did an API call at the /storagenode route and found that this disconnected storage node is showing online. Another one is offline and correctly reporting its online status as “false” and this particular node doesn’t cause the replication service to crash.
Below is the API response for a node which is offline due to a down vpn tunnel but is showing it is online and as such causes the replication service to crash.
{ "count": 15, "storagenodes": [ { "id": "7", "name": "Some-Offline-Node", "description": "", "isMaster": "1", "storagegroupID": "7", "isEnabled": "1", "isGraphEnabled": "0", "path": "/images", "ftppath": "/images", "bitrate": "", "helloInterval": "", "snapinpath": "/opt/fog/snapins", "sslpath": "/opt/fog/snapins/ssl", "ip": "192.168.113.197", "maxClients": "10", "user": "fogproject", "pass": "somePassword", "key": "", "interface": "ens160", "bandwidth": "0", "webroot": "/fog", "storagegroup": { "id": "7", "name": "SomeName", "description": "", "totalsupportedclients": 10, "enablednodes": [ "7" ], "allnodes": [ "7" ] }, "clientload": 0, "online": true **<---- this is saying its online but cannot be pinged through the down vpn** }
Here is one of the nodes that is offline but the boolean “online” property is set to false. This node gives the message in the log that the node is offline and continues on with the program instead of attempting an ftp connection.
{ "id": "11", "name": "aDifferentNode", "description": "", "isMaster": "1", "storagegroupID": "11", "isEnabled": "1", "isGraphEnabled": "0", "path": "/images", "ftppath": "/images", "bitrate": "", "helloInterval": "", "snapinpath": "/opt/fog/snapins", "sslpath": "/opt/fog/snapins/ssl", "ip": "192.168.1.134", "maxClients": "10", "user": "fogproject", "pass": "somePassword", "key": "", "interface": "ens160", "bandwidth": "0", "webroot": "/fog", "storagegroup": { "id": "11", "name": "someName", "description": "", "totalsupportedclients": 10, "enablednodes": [ "11" ], "allnodes": [ "11" ] }, "clientload": 0, "online": false **<--correctly set to false** }
I will continue investigating the method that sets this online flag but thought this could be helpful to you. Have you ever seen anything like this before?
-
This is an incredibly jank patch to the fogservice.class.php file but I’ve added a function that performs a ping check on the storage node, if it fails it simply logs the ping failure and proceeds on to the other images.
Here is what the log looks like now with it skipping non pingable nodes.
11-10-22 11:47:04 pm] | Image Name: TESTIMAGE [11-10-22 11:47:14 pm] | Node IP: 192.168.113.197 [11-10-22 11:47:14 pm] | Some-Server server could not be pinged. [11-10-22 11:47:14 pm] * Not syncing Image between groups [11-10-22 11:47:14 pm] | Image Name: TEST2 [11-10-22 11:47:14 pm] | There are no other members to sync to. [11-10-22 11:47:14 pm] * Found Image to transfer to 2 groups [11-10-22 11:47:14 pm] | Image Name: TEST3 [11-10-22 11:47:14 pm] | Node IP: 10.1.3.108
The immediate problem has a band-aid fix but it would be nice to know why the fog server thinks the nodes are online when they should be offline.
-
@Londonium Thank you once again! You are doing great. I appreciate you looking into it thus far. I still have not had enough time to create a setup to be able to test and look into the details myself.
Your debugging is very helpful and I am pretty sure you are on the right track. The different “online” state of the two nodes obviously cause this issue only for one of them. While your idea of adding a ping check is probably a good idea I would favor figuring out why you get the “fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject” (obviously unrelated IPs).
Would you be able to add debugging statements to these functions and see if you can figure out where those different IPs are coming from?
https://github.com/FOGProject/fogproject/blob/171d63724131c396029992730660497d48410842/packages/web/lib/fog/storagenode.class.php#L148
https://github.com/FOGProject/fogproject/blob/9fc716769556790f68135f80a920f21bb8b95779/packages/web/lib/fog/fogurlrequests.class.php#L580I would imagine a variable being re-used but not properly re-initialized. Looking through the code I have not spotted the bug yet.
Usually when I debug this kind of issues I simply use
file_put_contents()
to print valuable information to a separate log file. For example:file_put_contents('/var/www/fog/foo.log', print_r($this->get('ip'), 1), FILE_APPEND); // or file_put_contents('/var/www/fog/foo.log', print_r($socket, 1), FILE_APPEND);
On the server you need to create the file and modify ownership and access rights (as root):
touch /var/www/fog/foo.log chown www-data:www-data /var/www/fog/foo.log tail -f /var/www/fog/foo.log
PS: I shall find some more time on the upcoming weekend.
-
@Londonium Diving through the code again I have a feeling this change in fogftp.class.php might have caused this: https://github.com/FOGProject/fogproject/commit/c027a9ed2aa27c45aef9e62c7b1aab3172f4da82#diff-d9ddaf1f03ae05e1c61498e2b053c2c098f45ab710a5ada8ee5b95cdf3dfa9be
As
serialize()
can’t be used on any class in PHP 8 (docs) I switched to useprint_r()
instead. Possibly this is not working as I intended. -
@sebastian-roth
Hi Sebastian,
I agree with you that we should figure out why the fsocketopen() isnt working. I tried adding some debug statements to the functions you mentioned and I get an IP address print out of the node and then a resource ID# for the socket objects. I tried doing a var_dump($socket) and don’t see any additional information on the sockets or anything that gives me insight into the problem. I formatted the output a bit but here’s what that looked like.Resource id #1150 10.178.109.162 Resource id #1153 192.168.16.102 Resource id #1156 10.0.26.66 Resource id #1159 10.20.65.72 Resource id #1162 192.168.42.121 Resource id #1165 192.168.231.21 Resource id #1168 192.168.2.38
I suspect the fsocketopen() is crashing at line 611 when the $socket variable is initialized.
https://github.com/FOGProject/fogproject/blob/9fc716769556790f68135f80a920f21bb8b95779/packages/web/lib/fog/fogurlrequests.class.php#L611When I try to print_r () or var_dump() the $socket variable I get nothing when the program crashes. Is this an issue with PHP fsocketopen() function? I read the documentation on it and it seems like all the correct parameters are supplied. I have very limited understanding of PHP at the moment so I’m not sure of the implications of using serialize() vs print_r().
I’d be curious if this issue can be replicated outside of our network. Thank you for looking into this I know you guys are busy! -
@Londonium I finally found the time to look into this myself and do some testing. Sorry for the delay. Your debugging and posts helped a lot to nail this issue down, so thanks again!
Your VPN tunnel down scenario is kind of special. I triggers when port 21 (FTP) can be reached (socket connection fine and so online status true) but an FTP connection fails. What I mean is that the TCP connection opens but it doesn’t receive the FTP banner information as expected when connecting to a FTP server. I was able to replicate this behavior using
netcat
(listens on port 21 but does not respond to a full FTP connection try). It’s interesting you can trigger this issue by closing a VPN tunnel because to me it seems a TCP connection to port 21 is still being answered despite the VPN tunnel being down - maybe by the VPN router?[11-09-22 8:36:47 pm] * Type: 2, File: /var/www/fog/lib/fog/fogurlrequests.class.php, Line: 611, Message: fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject
I was also able to replicate this scenario and found why it shows two different IPs (when “trigger_error” is commented). On FTP errors the PHP function error_get_last() is used to retrieve information from the last error happening and combines it with current host/username information. But when a FTP connect fails this is not a general PHP error and therefore the “trigger_error” is used to simulate this being a real PHP error. If “trigger_error” is not in place the last PHP error can be anything really and in your case it was the failed fsockopen connect to the prior storage node.
Anyhow, thanks to your topic I was able to improve the replication code (dev-branch 1.5.9.222) and make it skip to the next task/node in case of FTP connect errors.
-
@sebastian-roth
This is great thanks for going through all the trouble to set up a test environment and figure out the problem. I’ll try updating our FOG server to the new dev-branch version and see how it works out. I looked at your code changes and I’m sure this will work just fine.
Its weird that the FTP connection is hung up somewhere. I thought maybe our firewall/router was causing the initiated FTP connections to the dead node but couldn’t be sure since this is the only type that we have including at my home office! Anyways thanks again for your help. We really enjoy using FOG and have big plans for it! -
@londonium said in Image replication crashes when FTP connection fails:
Anyways thanks again for your help.
You are welcome!
We really enjoy using FOG and have big plans for it!
You might consider contributing to the project as you’ve done a great job digging into this issue. Feel free to send me a message if you consider helping out.