Possible Image and Snapin Replication Problem w/ Working Branch



  • Server
    • FOG Version: v1.5.0 RC-8 v24 w/ Location Plugin
    • OS: CEntOS 7
    Description

    We have 2 Storage Nodes with same as above, 3 Storage Groups, 3 Locations, Each Location has 1 storage node in 1 storage group.

    Snapin and Image Replication has stopped as of Aug 26th Working Update. We’ve removed and re-connected the storage nodes and we’ve removed, reinstalled and reconfigured the Locations Plugin, but the problem persists.

    All images and all snapins are configured to replicate to all storage groups. The main FOG server is the primary for all images and snapins. On the dashboard, Storage nodes are shown to be online and Storage Groups report as expected.

    Image Replicator Log from the main FOG server (the Primary for all Images) shows 2 sorts of issue in a single pass:
    It shows images fail to replicate because the storage nodes are offline (but they aren’t):
    [09-07-17 11:23:23 am] * Starting Image Replication.
    [09-07-17 11:23:23 am] * We are group ID: 1. We are group name: default
    [09-07-17 11:23:23 am] * We are node ID: 1. We are node name: DefaultMember
    [09-07-17 11:23:23 am] * Attempting to perform Group -> Group image replication.
    [09-07-17 11:23:23 am] | Replicating postdownloadscripts
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name:
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] | Replicating postinitscripts
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name:
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Found Image to transfer to 3 groups
    [09-07-17 11:23:23 am] | Image Name: W10Prox64BIOSSysprep
    [09-07-17 11:23:23 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] * Found Image to transfer to 3 groups
    [09-07-17 11:23:23 am] | Image Name: W7ProSp1x32ReamDrivers
    [09-07-17 11:23:23 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] * Found Image to transfer to 3 groups
    [09-07-17 11:23:23 am] | Image Name: W7ProSP1x64ReArmDrivers
    [09-07-17 11:23:23 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] sal2fogsnl01 Server does not appear to be online.
    .
    .
    .
    Second, it shows some images are not configured to replicate:
    [09-07-17 11:23:23 am] | Image Name: Win7ProSP1x64DriversRearm
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Attempting to perform Group -> Nodes image replication.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name: W10Prox64BIOSSysprep
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name: W7ProSp1x32ReamDrivers
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name: W7ProSP1x64ReArmDrivers
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    .
    .
    .

    Note that some of the images are listed twice in one replication pass.

    Similarly, the Snapin Replication Log from the Main FOG Server (Primary for all Snapins) shows the same two issues:
    First that the Storage Nodes are offline:
    [09-07-17 11:23:26 am] * Starting Snapin Replication.
    [09-07-17 11:23:26 am] * We are group ID: 1. We are group name: default
    [09-07-17 11:23:26 am] * We are node ID: 1. We are node name: DefaultMember
    [09-07-17 11:23:26 am] * Attempting to perform Group -> Group snapin replication.
    [09-07-17 11:23:26 am] | Replicating ssl less private key
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name:
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name:
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: -DeliverFogExe
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: -ExtendDisk
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: -Timeout
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: 0-AdminSet
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    .
    .
    .

    and Second that the snapin isn’t configured for replications:
    09-07-17 11:23:26 am] * Attempting to perform Group -> Nodes snapin replication.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: -DeliverFogExe
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: -ExtendDisk
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: -Timeout
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: 0-AdminSet
    [09-07-17 11:23:26 am] | There are no other members to sync to.

    And, again, some, if not all, Snapins are listed twice in the single log pass.

    This all worked in previous versions of the working branch of v1.5.0 at the end of August.

    In the current system we have, Images and snapins fail from storage nodes and work from the main FOG server. It appears the only problem is replication. Our next step is to manually copy files around and test deployment to verify the problem is limited to replication alone.

    Any idea how to proceed?

    Any suggestions would be appreciated.

    Thanks,
    Jim


  • Moderator

    @sebastian-roth I have plans (in my head) to build automated functional testing for this, I’m not setup to test replication at the moment.


  • Developer

    @Moderators @Testers Is anyone able to replicate this issue?



  • @tom-elliott

    thanks Tom… we’ve tested FTP and found it working on all nodes…

    Jim


  • Senior Developer

    @wayne-workman while quality assurance is always a good thing, the code I added would not have broken what is being reported here. I added a simple check to find out if it can reach the server on port 21, the ftp port. If it cannot communicate it will report it cannot. If it does communicate it will perform replication tasks.



  • @wayne-workman

    I prefer to move forward rather than backward so I’ll choose Option B - manually replicate and test that it’s only a replication problem. We’re already working on that. This will work for a few days, but we’ll have problems as soon as we have to upload PCs before imaging.

    I’ll post here if all images and snapins work to clients at each site.

    Jim


  • Moderator

    @jim-graczyk Ok then. I think you’ve found an issue with replication in RC8/working. Without waiting for the developers, you have two options. 1 is to go back to RC 7 where it worked. 2 is to just manually scp the image & snapin changes to the nodes as appropriate and see if that works.

    Here’s how to go back to rc7:

    git checkout 31a61db2c12ebc394ea167f9b37ba6ef4da7ea99
    cd bin
    ./installfog.sh -y
    

    Normally I don’t recommend downgrading but it looks like no DB changes have happened since then and now, so it should work in this case. (future readers, it will not work for you).

    When our developers come back from vacation, hopefully they can resolve the issue.

    AND - I feel it’s time to build some quality-checking for replication - so I’ll be working on that in my free time in the coming weekends so that we can immediately know when this stuff isn’t working in one of the branches.



  • @wayne-workman All nodes are already up-to-date (as reported by git pull). We had updated to v24 at 10 am this morning and retested replication before posting this issue to the forum.

    Jim


  • Moderator

    @jim-graczyk I’m looking through the recent commits to the working branch of FogProject: https://github.com/FOGProject/fogproject/commits/working

    Tom did push a comment to correct issues with replication in there. So what I’d recommend is updating to the working branch to see if it’s fixed or not.

    If you’re unfamiliar with how to switch branches and update, go to your git repo location on your servers (each server needs done). Do this first:
    git pull
    Then this:
    git checkout working
    then the usual:
    cd bin
    ./installfog.sh -y



  • @wayne-workman

    I was able to connect to each node’s FTP service with the credentials stored in the storage definition in FOG and transfer files into and out of each server.

    Jim



  • @wayne-workman

    I’m thinking the FTP errors are the result of failed replication, not the cause - again, but what do I know…
    I’m thinking this only because the FTP errors are from FOG client machines attempting to access a snapin that hasn’t replicated.

    I’ll test the FTP connection of each node for files I know are actually on each node and report back…

    Jim


  • Moderator

    @jim-graczyk lots of FTP errors. I’d say check your FTP credentials to be sure they are correct. There’s a test here you can do: https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_FTP#Try_to_get_a_file_with_Windows:



  • @wayne-workman

    I did the tail command you posted with ’ | grep error’ on the end. Here’s what matched:

    [root@Sal1FOGV1 bin]# tail -n 500 /var/log/httpd/error_log | grep error
    [Mon Sep 04 12:06:49.751559 2017] [:error] [pid 110308] [client 192.168.100.171:49805] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/Malwarebytes.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Mon Sep 04 12:25:33.023607 2017] [:error] [pid 110150] [client 192.168.100.171:49721] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/1-Win10NET35.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Mon Sep 04 13:00:58.193160 2017] [:error] [pid 86532] [client 192.168.100.171:49742] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/Malwarebytes.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Mon Sep 04 13:02:51.022016 2017] [:error] [pid 93233] [client 192.168.100.171:49766] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/Office_v2013_PP.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Tue Sep 05 14:31:15.976881 2017] [:error] [pid 89324] [client 192.168.100.196:50414] PHP Warning: ftp_chmod(): SITE CHMOD command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 137
    [Tue Sep 05 15:00:08.147464 2017] [:error] [pid 86534] [client 10.179.100.156:49891] PHP Warning: array_filter() expects parameter 1 to be array, null given in /var/www/html/fog/lib/fog/image.class.php on line 164, referer: http://fogserver/fog/management/index.php?node=image&sub=membership&id=23
    [Tue Sep 05 15:30:25.281071 2017] [:error] [pid 86534] [client 10.179.100.177:58270] PHP Warning: ftp_chmod(): SITE CHMOD command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 137
    [Tue Sep 05 16:20:21.716391 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.716447 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_delete(): Delete operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 244
    [Tue Sep 05 16:20:21.717251 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.717739 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.718304 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.718808 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.719324 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.719736 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.720228 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.720734 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823
    [Tue Sep 05 16:20:21.868833 2017] [:error] [pid 86534] [client 10.179.100.177:42560] PHP Warning: ftp_chmod(): SITE CHMOD command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 137
    [Tue Sep 05 22:18:12.376965 2017] [:error] [pid 6299] [client 192.168.100.196:49729] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/1-Win10NET35.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Tue Sep 05 22:23:05.179099 2017] [:error] [pid 6299] [client 192.168.100.196:49690] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/Office_v2013_Std.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 10:06:51.808428 2017] [:error] [pid 101284] [client 192.168.100.196:50025] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/1-Win10NET35.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 14:23:46.601396 2017] [:error] [pid 12985] [client 10.179.100.156:54027] PHP Warning: array_filter() expects parameter 1 to be array, null given in /var/www/html/fog/lib/fog/image.class.php on line 164, referer: http://fogserver/fog/management/index.php?node=image&sub=membership&id=24
    [Wed Sep 06 15:01:43.531662 2017] [:error] [pid 12985] [client 10.179.100.176:57724] PHP Warning: ftp_chmod(): SITE CHMOD command failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 137
    [Wed Sep 06 15:44:05.209565 2017] [:error] [pid 188731] [client 192.168.100.171:49720] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/1-Win10NET35.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 16:07:37.877958 2017] [:error] [pid 12985] [client 192.168.100.171:49720] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/Malwarebytes.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 16:10:13.754976 2017] [:error] [pid 12985] [client 192.168.100.171:49749] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/Office_v2013_PP.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 17:23:14.101811 2017] [:error] [pid 158225] [client 192.168.100.171:51205] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/1-Win10NET35.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 18:06:31.179787 2017] [:error] [pid 8849] [client 192.168.100.171:49767] PHP Warning: fopen(ftp://…@192.168.100.30/opt/fog/snapins/1-Win10NET35.exe): failed to open stream: FTP server reports 550 Could not get file size.\r\n in /var/www/html/fog/lib/client/snapinclient.class.php on line 618
    [Wed Sep 06 18:29:18.857955 2017] [:error] [pid 243543] [client 10.179.100.156:56865] PHP Warning: ftp_rmdir(): Remove directory operation failed. in /var/www/html/fog/lib/fog/fogftp.class.php on line 823, referer: http://fogserver/fog/management/index.php?node=snapin&sub=edit&id=42
    [root@Sal1FOGV1 bin]#

    Jim


  • Moderator

    @jim-graczyk said in Possible Image and Snapin Replication Problem w/ Working Branch:

    I will assume that group to group replication replicates from node to node between groups

    Right, but in group -> group replication, it goes from the master of the one group to the master of the other group. After that’s done, then normal replication happens from master-> all other nodes in the master’s group. There’s an article about it here: https://wiki.fogproject.org/wiki/index.php?title=Replication



  • @wayne-workman

    Disregard my confusion about group to group and group to node replication. I see that now in the log, accounting for each image and snapin appear twice in each log. I will assume that group to group replication replicates from node to node between groups and group to node replicates from node to node within a group. If I have that wrong, let me know.

    Jim


  • Moderator

    @jim-graczyk said in Possible Image and Snapin Replication Problem w/ Working Branch:

    This implies to me there is no db connection issue, but what do I know.

    You’re right, but I needed you to check - I don’t want to assume things.

    So, next place to look at for issues is /var/log/httpd/error_log Look in here for any errors that look related to replication.
    This would tail the last 500 lines: tail -n 500 /var/log/httpd/error_log You can post logs here also if you find errors.



  • @wayne-workman

    I did a wget http://myfognode/fog/service/getversion.php from my main FOG server where I substituted the IPs of my 3 FOG servers (main and 2 storage) in for myfognode.

    Each returned a file and each response was the correct version number, 24. Also, FOG Settings reports the same, as does the Dashboard page. This implies to me there is no db connection issue, but what do I know.

    I’m not clear on the group->group replication verses the group->node issue you’re talking about. Sooner or later the system needs to replicate to a node, something that the replication service fails to initiate, as far as I can tell. To be clear, I see logs up to Aug 26th describing replication actions, but after that, there are no logs, like from that time on, the replication process never saw a need to start any replication actions.

    I didn’t try to delete all the storage groups and recreate them. Is that the next step?

    Jim


  • Moderator

    @jim-graczyk said in Possible Image and Snapin Replication Problem w/ Working Branch:

    Any idea how to proceed?

    Looks like you’re doing group -> group replication only. So there would not be any group-> node replication. Also, snapins use the same code that images use for replication - so that’s why those are broke too.

    I’m going to guess it’s a DB connection problem (I could be wrong) - test it like this on each problem node:
    http://10.0.0.28/fog/service/getversion.php
    Replace the IP address with the node addresses. It should show a version number. If the node has a database connection problem, it’ll say so right there.

    After you check this, we’ll go from there.


Log in to reply
 

374
Online

38982
Users

10712
Topics

101678
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.