Possible Image and Snapin Replication Problem w/ Working Branch



  • Server
    • FOG Version: v1.5.0 RC-8 v24 w/ Location Plugin
    • OS: CEntOS 7
    Description

    We have 2 Storage Nodes with same as above, 3 Storage Groups, 3 Locations, Each Location has 1 storage node in 1 storage group.

    Snapin and Image Replication has stopped as of Aug 26th Working Update. We’ve removed and re-connected the storage nodes and we’ve removed, reinstalled and reconfigured the Locations Plugin, but the problem persists.

    All images and all snapins are configured to replicate to all storage groups. The main FOG server is the primary for all images and snapins. On the dashboard, Storage nodes are shown to be online and Storage Groups report as expected.

    Image Replicator Log from the main FOG server (the Primary for all Images) shows 2 sorts of issue in a single pass:
    It shows images fail to replicate because the storage nodes are offline (but they aren’t):
    [09-07-17 11:23:23 am] * Starting Image Replication.
    [09-07-17 11:23:23 am] * We are group ID: 1. We are group name: default
    [09-07-17 11:23:23 am] * We are node ID: 1. We are node name: DefaultMember
    [09-07-17 11:23:23 am] * Attempting to perform Group -> Group image replication.
    [09-07-17 11:23:23 am] | Replicating postdownloadscripts
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name:
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] | Replicating postinitscripts
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name:
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Found Image to transfer to 3 groups
    [09-07-17 11:23:23 am] | Image Name: W10Prox64BIOSSysprep
    [09-07-17 11:23:23 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] * Found Image to transfer to 3 groups
    [09-07-17 11:23:23 am] | Image Name: W7ProSp1x32ReamDrivers
    [09-07-17 11:23:23 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] * Found Image to transfer to 3 groups
    [09-07-17 11:23:23 am] | Image Name: W7ProSP1x64ReArmDrivers
    [09-07-17 11:23:23 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:23 am] sal2fogsnl01 Server does not appear to be online.
    .
    .
    .
    Second, it shows some images are not configured to replicate:
    [09-07-17 11:23:23 am] | Image Name: Win7ProSP1x64DriversRearm
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Attempting to perform Group -> Nodes image replication.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name: W10Prox64BIOSSysprep
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name: W7ProSp1x32ReamDrivers
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    [09-07-17 11:23:23 am] | Image Name: W7ProSP1x64ReArmDrivers
    [09-07-17 11:23:23 am] | There are no other members to sync to.
    [09-07-17 11:23:23 am] * Not syncing Image between nodes
    .
    .
    .

    Note that some of the images are listed twice in one replication pass.

    Similarly, the Snapin Replication Log from the Main FOG Server (Primary for all Snapins) shows the same two issues:
    First that the Storage Nodes are offline:
    [09-07-17 11:23:26 am] * Starting Snapin Replication.
    [09-07-17 11:23:26 am] * We are group ID: 1. We are group name: default
    [09-07-17 11:23:26 am] * We are node ID: 1. We are node name: DefaultMember
    [09-07-17 11:23:26 am] * Attempting to perform Group -> Group snapin replication.
    [09-07-17 11:23:26 am] | Replicating ssl less private key
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name:
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name:
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: -DeliverFogExe
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: -ExtendDisk
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: -Timeout
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] * Found Snapin to transfer to 3 groups
    [09-07-17 11:23:26 am] | Snapin Name: 0-AdminSet
    [09-07-17 11:23:26 am] roa1fogsnl01 Server does not appear to be online.
    [09-07-17 11:23:26 am] sal2fogsnl01 Server does not appear to be online.
    .
    .
    .

    and Second that the snapin isn’t configured for replications:
    09-07-17 11:23:26 am] * Attempting to perform Group -> Nodes snapin replication.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: -DeliverFogExe
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: -ExtendDisk
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: -Timeout
    [09-07-17 11:23:26 am] | There are no other members to sync to.
    [09-07-17 11:23:26 am] * Not syncing Snapin between nodes
    [09-07-17 11:23:26 am] | Snapin Name: 0-AdminSet
    [09-07-17 11:23:26 am] | There are no other members to sync to.

    And, again, some, if not all, Snapins are listed twice in the single log pass.

    This all worked in previous versions of the working branch of v1.5.0 at the end of August.

    In the current system we have, Images and snapins fail from storage nodes and work from the main FOG server. It appears the only problem is replication. Our next step is to manually copy files around and test deployment to verify the problem is limited to replication alone.

    Any idea how to proceed?

    Any suggestions would be appreciated.

    Thanks,
    Jim



  • @Tom-Elliott

    I have been keeping an eye out on this issue with replication myself, im running V52 of working branch, should I wait until RC10 to come out to update? So far I think everything is functioning normally for replication but figured I could wait until RC10.



  • @tom-elliott

    Fantastic …

    Updated the lab and replication is working to the Storage Nodes, Group to Group, as it should be AND the Replication Log UI is not only working but provides much more detail.

    Please consider this issue Solved…

    Just an FYI - since we have a script working in the new system, we may or may not move to it to the Working branch. It’ll depend on how much pain our script causes us as we add Locations and Storage Nodes v fear of instability this is part and parcel to using the Working branch. Our lab will stay up to date on the working branch. At this point, we think we can hold out for the Dev release of RC10.

    Thanks very much.

    Jim


  • Senior Developer

    Working should fix the GUI not displaying the logs for you now too.


  • Developer

    @Jim-Graczyk Try less /opt/fog/log/fogreplicator.log on the FOG server command line.


  • Senior Developer

    @jim-graczyk I’ll fix that later. Sorry that’s a missed thing in jquery call. The logs are still ‘working’, you just can’t view them currently from the GUI.



  • 0_1506455177528_d68906d4-e947-4a64-a330-f2ff89017587-image.png



  • @tom-elliott

    Updated Lab to v52. Can no longer check replications logs.

    Fog Configuration / Log Viewer gives me a Files: pulldown that is now empty.

    Jim


  • Senior Developer

    @jim-graczyk Switching to dev-branch is as easy as git checkout dev-branch (re-run installer of course too).

    However, reinstalling would put it back into state where it isn’t working properly again. I don’t expect the current working to remain working for too much longer though. Probably by end of this week I’ll make it into RC-10.



  • @tom-elliott

    Tom,

    I’m using the working branch on my lab set up. I’ll pull it there.

    I’m also OK trying the working branch on my new installation, as long as I can switch back to the Dev branch after pulling the current Working version.

    Is it just a matter of changing the git checkout back to dev branch?

    FYI - My new installation is showing SVN 6079 while the lab is showing SVN 6080 - if that helps any.

    thanks,

    Jim


  • Senior Developer

    @jim-graczyk Can you try on working branch please? There was an issue with implementation of finding “isAvailable” nodes which I’m pretty sure has been corrected for.



  • I just wanted to let the FOG team know that I have just completed the build of a new FOG server using v1.5.0 RC9, Dev Branch, with associated storage nodes. We created storage groups and storage nodes, after uploading images. Installed the Locations Plugin and set locations. We moved our images and snapins to the new FOG installation…

    And we’ve reproduced the same problem we’ve had on FOG system on which this original posting was based. We have no replication of images or snapin, even though the storage nodes are working as expected.

    We’ve written a bash script to replicate both snapins and images from the main fog server to the storage nodes, so our installation works, but based on my experience, it’s very easy to reproduce this replications problem - just do an installation on fresh servera and the problems ensue.

    Thanks,

    Jim


  • Moderator

    @sebastian-roth I have plans (in my head) to build automated functional testing for this, I’m not setup to test replication at the moment.


  • Developer

    @Moderators @Testers Is anyone able to replicate this issue?



  • @tom-elliott

    thanks Tom… we’ve tested FTP and found it working on all nodes…

    Jim


  • Senior Developer

    @wayne-workman while quality assurance is always a good thing, the code I added would not have broken what is being reported here. I added a simple check to find out if it can reach the server on port 21, the ftp port. If it cannot communicate it will report it cannot. If it does communicate it will perform replication tasks.



  • @wayne-workman

    I prefer to move forward rather than backward so I’ll choose Option B - manually replicate and test that it’s only a replication problem. We’re already working on that. This will work for a few days, but we’ll have problems as soon as we have to upload PCs before imaging.

    I’ll post here if all images and snapins work to clients at each site.

    Jim


  • Moderator

    @jim-graczyk Ok then. I think you’ve found an issue with replication in RC8/working. Without waiting for the developers, you have two options. 1 is to go back to RC 7 where it worked. 2 is to just manually scp the image & snapin changes to the nodes as appropriate and see if that works.

    Here’s how to go back to rc7:

    git checkout 31a61db2c12ebc394ea167f9b37ba6ef4da7ea99
    cd bin
    ./installfog.sh -y
    

    Normally I don’t recommend downgrading but it looks like no DB changes have happened since then and now, so it should work in this case. (future readers, it will not work for you).

    When our developers come back from vacation, hopefully they can resolve the issue.

    AND - I feel it’s time to build some quality-checking for replication - so I’ll be working on that in my free time in the coming weekends so that we can immediately know when this stuff isn’t working in one of the branches.



  • @wayne-workman All nodes are already up-to-date (as reported by git pull). We had updated to v24 at 10 am this morning and retested replication before posting this issue to the forum.

    Jim


  • Moderator

    @jim-graczyk I’m looking through the recent commits to the working branch of FogProject: https://github.com/FOGProject/fogproject/commits/working

    Tom did push a comment to correct issues with replication in there. So what I’d recommend is updating to the working branch to see if it’s fixed or not.

    If you’re unfamiliar with how to switch branches and update, go to your git repo location on your servers (each server needs done). Do this first:
    git pull
    Then this:
    git checkout working
    then the usual:
    cd bin
    ./installfog.sh -y


Log in to reply
 

405
Online

39.3k
Users

11.0k
Topics

104.4k
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.