@sebastian-roth
This is great thanks for going through all the trouble to set up a test environment and figure out the problem. I’ll try updating our FOG server to the new dev-branch version and see how it works out. I looked at your code changes and I’m sure this will work just fine.
Its weird that the FTP connection is hung up somewhere. I thought maybe our firewall/router was causing the initiated FTP connections to the dead node but couldn’t be sure since this is the only type that we have including at my home office! Anyways thanks again for your help. We really enjoy using FOG and have big plans for it!
Posts made by Londonium
-
RE: Image replication crashes when FTP connection fails
-
RE: Image replication crashes when FTP connection fails
@sebastian-roth
Hi Sebastian,
I agree with you that we should figure out why the fsocketopen() isnt working. I tried adding some debug statements to the functions you mentioned and I get an IP address print out of the node and then a resource ID# for the socket objects. I tried doing a var_dump($socket) and don’t see any additional information on the sockets or anything that gives me insight into the problem. I formatted the output a bit but here’s what that looked like.Resource id #1150 10.178.109.162 Resource id #1153 192.168.16.102 Resource id #1156 10.0.26.66 Resource id #1159 10.20.65.72 Resource id #1162 192.168.42.121 Resource id #1165 192.168.231.21 Resource id #1168 192.168.2.38
I suspect the fsocketopen() is crashing at line 611 when the $socket variable is initialized.
https://github.com/FOGProject/fogproject/blob/9fc716769556790f68135f80a920f21bb8b95779/packages/web/lib/fog/fogurlrequests.class.php#L611When I try to print_r () or var_dump() the $socket variable I get nothing when the program crashes. Is this an issue with PHP fsocketopen() function? I read the documentation on it and it seems like all the correct parameters are supplied. I have very limited understanding of PHP at the moment so I’m not sure of the implications of using serialize() vs print_r().
I’d be curious if this issue can be replicated outside of our network. Thank you for looking into this I know you guys are busy! -
RE: Image replication crashes when FTP connection fails
This is an incredibly jank patch to the fogservice.class.php file but I’ve added a function that performs a ping check on the storage node, if it fails it simply logs the ping failure and proceeds on to the other images.
Here is what the log looks like now with it skipping non pingable nodes.
11-10-22 11:47:04 pm] | Image Name: TESTIMAGE [11-10-22 11:47:14 pm] | Node IP: 192.168.113.197 [11-10-22 11:47:14 pm] | Some-Server server could not be pinged. [11-10-22 11:47:14 pm] * Not syncing Image between groups [11-10-22 11:47:14 pm] | Image Name: TEST2 [11-10-22 11:47:14 pm] | There are no other members to sync to. [11-10-22 11:47:14 pm] * Found Image to transfer to 2 groups [11-10-22 11:47:14 pm] | Image Name: TEST3 [11-10-22 11:47:14 pm] | Node IP: 10.1.3.108
The immediate problem has a band-aid fix but it would be nice to know why the fog server thinks the nodes are online when they should be offline.
-
RE: Image replication crashes when FTP connection fails
To give you another update I’ve been reviewing the source code for the image replication and think I found that in the /packages/web/lib/service/fogservice.class.php file the replicateItems() function gathers its storagenode objects from the API route class. From here I think it checks the online flag from a json deserialized response from the API. This gave me the idea to see what the API says about the storage nodes.
I did an API call at the /storagenode route and found that this disconnected storage node is showing online. Another one is offline and correctly reporting its online status as “false” and this particular node doesn’t cause the replication service to crash.
Below is the API response for a node which is offline due to a down vpn tunnel but is showing it is online and as such causes the replication service to crash.
{ "count": 15, "storagenodes": [ { "id": "7", "name": "Some-Offline-Node", "description": "", "isMaster": "1", "storagegroupID": "7", "isEnabled": "1", "isGraphEnabled": "0", "path": "/images", "ftppath": "/images", "bitrate": "", "helloInterval": "", "snapinpath": "/opt/fog/snapins", "sslpath": "/opt/fog/snapins/ssl", "ip": "192.168.113.197", "maxClients": "10", "user": "fogproject", "pass": "somePassword", "key": "", "interface": "ens160", "bandwidth": "0", "webroot": "/fog", "storagegroup": { "id": "7", "name": "SomeName", "description": "", "totalsupportedclients": 10, "enablednodes": [ "7" ], "allnodes": [ "7" ] }, "clientload": 0, "online": true **<---- this is saying its online but cannot be pinged through the down vpn** }
Here is one of the nodes that is offline but the boolean “online” property is set to false. This node gives the message in the log that the node is offline and continues on with the program instead of attempting an ftp connection.
{ "id": "11", "name": "aDifferentNode", "description": "", "isMaster": "1", "storagegroupID": "11", "isEnabled": "1", "isGraphEnabled": "0", "path": "/images", "ftppath": "/images", "bitrate": "", "helloInterval": "", "snapinpath": "/opt/fog/snapins", "sslpath": "/opt/fog/snapins/ssl", "ip": "192.168.1.134", "maxClients": "10", "user": "fogproject", "pass": "somePassword", "key": "", "interface": "ens160", "bandwidth": "0", "webroot": "/fog", "storagegroup": { "id": "11", "name": "someName", "description": "", "totalsupportedclients": 10, "enablednodes": [ "11" ], "allnodes": [ "11" ] }, "clientload": 0, "online": false **<--correctly set to false** }
I will continue investigating the method that sets this online flag but thought this could be helpful to you. Have you ever seen anything like this before?
-
RE: Image replication crashes when FTP connection fails
Sorry for the million updates but we just took down a vpn tunnel for a third client and actually ran into the same issue as stated initially with the service crashing. I commented out the trigger_error in the fogftp class again to see the error in fogurlrequests and it says its unable to connect to an IP address that is not the same as the node this image is supposed to be replicating to.
192.168.1.134:21 is not the 10.81.2.117 I would expect to be seeing here. The fact remains the tunnel is down from our own doing and bringing the tunnel up removes this error but it seems strange that this error would say its connecting to the ip of a different unrelated node and maybe give some insight into why some nodes report offline correctly during replication and some try to replicate even though they should fail the offline test.See the log messages below.
[11-09-22 8:35:16 pm] * Found Image to transfer to 2 groups
[11-09-22 8:35:16 pm] | Image Name: 2ndFailedImage
[11-09-22 8:36:47 pm] * Type: 2, File: /var/www/fog/lib/fog/fogurlrequests.class.php, Line: 611, Message: fsockopen(): unable to connect to 192.168.1.134:21 (Connection timed out), Host: 10.81.2.117, Username: fogproject
[11-09-22 8:36:48 pm] * Starting Image Replication. -
RE: Image replication crashes when FTP connection fails
I have an update on my issue. I reviewed the fogservice class and found that the storagenodes are in fact checking if they are online before replicating at Line 359 in the fogservice class
I then intentionally took down a vpn tunnel for a different client where I receive the error message output from that online status check.
[11-09-22 5:47:30 pm] * Found Image to transfer to 2 groups
[11-09-22 5:47:30 pm] | Image Name: SomeOtherImage
[11-09-22 5:47:30 pm] | SomeOtherNode server does not appear to be online.
[11-09-22 5:47:30 pm] * Not syncing Image between groupsI am not sure why this error would come up on one client specifically and not another. This would lead me to believe something is tricking the FOG server into thinking the node is online by setting this online flag to true. When then VPN tunnel is down for the client in question I am unable to ping the ip address of the fog node. I don’t think there is anything else on the network with the IP of the storage node that would make it appear online and then subsequently fail an FTP operation.
-
RE: Image replication crashes when FTP connection fails
@sebastian-roth Yes all nodes are on the exact same version.
-
Image replication crashes when FTP connection fails
FOG Server version 1.5.9
FOG Nodes version: fog-dev-branch 1.5.9.200FOG Server PHP Version:
PHP 7.2.24-0ubuntu0.18.04.13 (cli) (built: Jul 6 2022 12:23:22) ( NTS )
Copyright 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright 1998-2018 Zend Technologies
with Zend OPcache v7.2.24-0ubuntu0.18.04.13, Copyright 1999-2018, by Zend TechnologiesI’m having an issue where the image replication service seemingly crashes while replicating an image that’s located on a node where the network connection is unavailable. I would expect the program to continue on a failure and proceed to the other images but it just stops processing subsequent images. If the network connection (VPN tunnel) is restored the replication service continues on and I see successful log messages indicating that the image doesn’t need to be replicated (no need to sync etc). Alternatively as a workaround we’re able to manually disable replication for the specific image which allows the replication service to proceed on to the other images. If the network connection is down I get the following error in the replication log
[11-07-22 11:20:30 pm] * Type: 1024, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 219, Message: FTP connection failed, Host: 192.168.113.197, Username: fogproject
To explain our configuration a little bit we are an MSP with two offices in different states. We’ve been using fog in our two physical locations for a couple years where one of our offices has the FOG server and the other has a storage node connected via VPN tunnel and a few images replicate between them. We’re now adding storage nodes connected by VPN to our clients’ offices in the hopes of remotely imaging machines at our clients sites with a customer specific image preventing the need to physically transport computers to our office to image them. The issue here is we have a ton of clients and no real way to ensure that a given VPN tunnel will be active. Since a disconnected node crashes the replication service it doesn’t proceed to the next image. Fortunately we caught this early because it seems to be replicating the images in alphabetical order and one of our clients starting with an “A” lost their vpn tunnel meaning all other clients alphabetically after “A” didn’t receive their images. In our use case having a few images that fail image replication isn’t a catastrophic event and would actually be expected due to the inconsistent state of our VPN tunnels. Any number of network outages or firewall problems could cause a node to be unavailable temporarily. If this happens it would be nice if it would just log the error and proceed on to the next images. Admittedly I have never worked with PHP before and don’t know how its error handling works but I poked around in the source code and commented out line 219 of the fogftp.class.php that does a trigger_error (the only place I saw that does this in the repository) with the FTP connection failed message. It subsequently failed in a different file “/packages/web/lib/fog/fogurlrequests.class.php” when opening a socket. I attempted to put a Try/Catch block within the for loop I thought the image replication loop was happening but the FTP timeout fails deeper than where I am looking. At this point I decided it would be best to reach out to you guys. I have undone any code changes I made to the fogftp.class and imagereplicator classes.
Below is what the replication log looks like when an image fails to replicate due to network connectivity problem. I’ve replaced the image name with TESTIMAGE to protect the privacy of the client but just know it starts with “A”. You can see the log begins to replicate TESTIMAGE then fails, the next log entry is Starting image replication. There are many more images alphabetically after the TESTIMAGE that do not get replicated when this happens.
…
…
[11-07-22 11:18:59 pm] * Found Image to transfer to 2 groups
[11-07-22 11:18:59 pm] | Image Name: TESTIMAGE
[11-07-22 11:20:30 pm] * Type: 1024, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 219, Message: FTP connection failed, Host: 192.168.113.197, Username: fogproject
[11-07-22 11:28:43 pm] * Starting Image Replication.
…
…
some time later the replication service runs again and it stops in the same place it did before.[11-07-22 11:28:59 pm] | Image Name: SOMEIMAGE
[11-07-22 11:28:59 pm] | There are no other members to sync to.
[11-07-22 11:28:59 pm] * Found Image to transfer to 2 groups
[11-07-22 11:28:59 pm] | Image Name: TESTIMAGE
[11-07-22 11:30:30 pm] * Type: 1024, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 219, Message: FTP connection failed, Host: 192.168.113.197, Username: fogprojectDo you have any idea how to fix this? If this system works well for us we could conceivably have 100 nodes out in the wild.