Image replication between nodes keeps restarting
-
I’ve got two servers running 1.4.0.
The master storage node is located in Denmark and the other storage node is located in Norway.
The replication works, but every time it has replicated an image, it just deletes the image on the server in Norway and starts over once again. It has been doing this for about a week, any ideas? -
This is an interesting issue.
The fog replicator compares the md5sum of a certain number of bytes of each file before it decides if they are different. If the md5sum key is not the same then it makes the decision that the storage node is out of sync and starts replicating the image again.
The fog replicator logs are in /opt/fog/logs does the master replication log or the replication log for the norway server give you an idea why it just keeps replicating?
-
Hmmm maybe it is trying to replicate to often and that’s the issue.
Any clue where I can see/change how often it tries to replicate an image? -
@AndersHoeg said in Image replication between nodes keeps restarting:
Any clue where I can see/change how often it tries to replicate an image?
FOG Web Interface -> FOG Configuration -> FOG Settings -> FOG Linux Service Sleep Times -> IMAGEREPSLEEPTIME
FOG Replication logs are also viewable in the web interface:
FOG Web Interface -> FOG Configuration -> Log Viewer -> Image Replicator
-
@Wayne-Workman Thank you so much! It worked, it sync the images and don’t delete them over and over again.
-
I know this is a really old post and I apologize if I should have started a new topic. I’m experiencing the same symptoms after only accidentally discovering the replication keeps deleting my images from the nodes. By increasing the value of the FOG Linux Service Sleep Times - IMAGEREPSLEEPTIME, will this stop the image from being deleted?
Basically would I need to set the value to a point beyond the time it takes FOG to finish replicating? If the value is less than the time it takes FOG to finish replicating, would this explain why the images keeps being deleted and replication restarting?
-
@jgallo If the image replicator is deleting the files if it’s not completed by the next iteration - this is a bug. I thought this was fixed completely somewhere between 1.2 and 1.3… I guess the issue has cropped up again. So yeah, just to see if this is the case, set your
IMAGEREPSLEEPTIME
to something huge and see if it’s fixed. If that fixes it, then that’s bug confirmed. -
I increased it to 10800 which is 3 hours just for the hell of it. Seems to work fine on 1.5.0 RC7 unless replication takes longer than that LOL but issue I ran into was when I try to go and upload an image I get some weird DHCP issues. So I went ahead and went to RC9 and then replication does’t work. I came across a post regarding same issue and suggestion was to go back to RC7. So I’m in a bind and not sure what to do. I’m going to start from scratch and add each node and tail the fogreplicator.log file and see if the same errors occur. I’m getting a (storage node) server does not appear to be online. This only occurs in RC9 and as soon as I go back to RC7 replication works fine.
-
@jgallo the only reason I can think rc9 is broken is the available checks in place. However this would lead me to think it’s a firewall issue. You c an fix this by commenting the new lines in the file. That is the only thing that would be causing the issue from what I can see.
-
@jgallo specifically lines 533 - 549 of
/var/www/fog/lib/service/fogservice.class.php
-
Thank you Tom. I went ahead and made the changes. Worked as soon as a reboot occurred.
-
Mind installing working branch and see if it is working for you now? I looked over the available code and apparently I made one mistake on a variable for the “domain” element being checked. This variable was not defined anywhere. So i have now setup what should be accurate and just need something to test off of.
git checkout working git pull cd bin ./installfog.sh -y
-
To see what the change I made is see here:
https://github.com/FOGProject/fogproject/commit/a0eca190b1f85e4ea7c7586f683f81fc947bb81f -
I don’t mind at all. I will install working branch once this image that is currently replicating is complete. Once I install working branch, I will upload an image and check if replication i working.
-
I uploaded an image and tried to manually start the replication. I did a tail on the fogreplicator.log file and it seemed like it wasn’t working. So I went to lunch and came back. Looks like it eventually worked. I see the image replicated to my only storage node right now. There are some other weird things occurring but I will create separate post as it has to do with my custom fog.man.reg being copied. Thank you for your help.
-
Added another storage node and replication seems to operate normally. I still have the linux sleeptimer at 10800 seconds so I think we are good for now. My question is do I stay on the working branch or do I go back to RC9? I’m not too familiar when changes are made to RC’s and working branches.
-
@jgallo Stay on the working branch until the next RC.
-
So today I got a chance to start adding even more storage nodes. The FOG Server is on the working branch as well as the storage node. Once I’m finished installing the storage node, I go to the console and make changes like enable graph and change the storage node name. I also place a value in the Replication Bandwidth because it won’t let me use 0. So when I tail the fogreplicator.log file, it states a
*Type: 2, File: /var/www/fog/lib/fog/fogftp.class.php, Line: 463, Message: ftp_login(): Login incorrect., Host: 10.225.100.21, Username: fog
Now when I ran the installer, I did type the storage node password from FOG correctly. I know this because the IP shows up under storage. This is awkward because my first two storage nodes worked but were on RC9. Should I keep the storage nodes at RC9 or put the storage nodes on working branch?
-
@jgallo I’d put everything on the same particularly as it addresses replication directly. The ftp error however is not because of the password you add when installing a new node. That password is what’s created for the node automatically. Sometimes, however, it’s possible it creates unusable characters.