Replication Issue
-
@mronh Ok, I have dug through a lot of code in the last two days, found and fixed a couple of issues with replication. All that will be in the next release. Hopefully coming soon. Let me know if you are keen to test those changes beforehand.
-
@Sebastian-Roth sure, right now Im using only the server due this issue.
if all became fixed my summer here in the next months will be sooooo much easier… hahaha
what I have to do?
-
@mronh The current changes are on a new branch
replication
(link) which I will merge intoworking
after a first round of feedback.Not sure if you have ever installed FOG unstable/testing. This is done using git to checkout the current code and install from that.
git clone https://github.com/FOGProject/fogproject/ cd fogproject git checkout replication cd bin ./installfog.sh
Important notice: I had to change some of the hashing code too and therefore nodes being on different versions (1.5.4 or working VS. replication branch) will end up replicating images over and over again. So you need to have all nodes on the replication branch or setup up a separate test environment!!
Please make sure you stop replication first (
systemctl stop FOGImageReplicator
), then update the storage node and after that update master node. -
@Sebastian-Roth Will those hashing code changes you made help with Ubuntu servers specifically 16.04? I remember earlier this summer that there was replication issues looping due to a hash file not matching but resolved to an extent in the working branch. I’m curious because I’m have many storage nodes and I can switch over from working branch to replication if your changes help.
-
@JGallo I have tested a fair bit and fixed a couple of issues that still were in the working branch. Also the replication branch is based on working and so it has even more replication issues fixed since 1.5.4!
I can’t promise you this is issue free yet. As I don’t have a test setup with many nodes. But I am sure it’s better than 1.5.4 and working branch were. So I would be very happy if you’d give it a try and post feedback and maybe logs if you still see issues.
-
@Sebastian-Roth Of course!! I will update server and nodes today to replication branch and get it ready for an image upload. I think the issue was with images being updated and then uploaded to existing images on the FOG server. Replication of a new image definition was fine even to the storage nodes. It will probably be a bit before I can have some concrete information since I don’t have many images that replicate across all nodes since I have storage groups defined in the image.
-
@Sebastian-Roth right. I’ll do it next week then.
this week im on the leash again… oh boy how I hate the end of the year…=/ hahahaha
-
@Sebastian-Roth I switched over to replication branch and updated all storage nodes along with my fog server. I uploaded image and it seemed like its working fine for original image. I haven’t updated the image since its very new but I when I have a chance, which will be very shortly since a project im currently working on will allow me to update an image on an existing one, I will go ahead and updated it and tail the replication log.
I also noticed in a different post that another user did the same thing and tested replication. Looks like the changes in the replication branched have work. I will update as well once i upload an image to an existing one to see if the updated image replicates properly to storage node.
-
@JGallo The more feedback I get on this the better. Looking forward to hear from you.
-
@Sebastian-Roth Hey man, make the upgrade from booth server and storage ( following ur previous instructions) and now we get some data to think bout. Justo to make it clear, I deleted all images in the storage to force a full replication from begining
I’ll up the logs from server and storage again, but ‘short history’ I see some “Erro fatal: max-retries exceeded (421 There are too many connections from your internet address.)” and “File size mismatch”
Server Side:
2_1542111667714_SERVER_php7.0-fpm.log
1_1542111667714_SERVER_fogreplicator.log
0_1542111667714_SERVER_error.logStorage Side:
1_1542111678025_STORAGE_php7.1-fpm.log
0_1542111678024_STORAGE_error.logBesides that, the improves are really good, logs more accurate and steps of the algorithm way more “solidified” way to go man!
-
@mronh Thanks for testing and reporting back. The first thing that jumps at me in the logs are many lines of hash mismatch like this:
File hash mismatch - d1p2.img.002: c8a2b5f37de6e0c7a5eeb0843b9164bac05cc984cada2cfb8da6132ba938bc2a != 7e56e1209070f2b8494e3d60cb6a27c103925bb442056ba43438c456126f027849baf5547ca1e0fec8accc309aae64ba1ae569e8698fe5e8041052cb627ed6b1
See the different length of the hash sums. I am fairly sure the storage node is not updated to the latest replication commit!!
Please check your web directory, maybe there is some link issue and you have two different versions mixed up. Run
ls -al /var/www /var/www/html /var/www/fog
and post results here.Beside that I’d stop replication for now on your master node and maybe try upgrading to the replication branch on the storage node again!
-
@Sebastian-Roth on the server side
ls -al /var/www /var/www/html /var/www/fog /var/www: total 20 drwxr-xr-x 4 root root 4096 nov 12 16:06 . drwxr-xr-x 12 root root 4096 ago 28 13:16 .. drwxr-xr-x 10 www-data www-data 4096 nov 12 16:14 fog drwxr-xr-x 2 root root 4096 ago 28 13:22 html -rw-r--r-- 1 root root 41 out 10 11:10 index.php /var/www/fog: total 408 drwxr-xr-x 10 www-data www-data 4096 nov 12 16:14 . drwxr-xr-x 4 root root 4096 nov 12 16:06 .. drwxr-xr-x 2 www-data www-data 4096 nov 12 16:06 api drwxr-xr-x 2 www-data www-data 4096 nov 12 16:06 client drwxr-xr-x 2 www-data www-data 4096 nov 12 16:06 commons -rw-r--r-- 1 www-data www-data 370070 nov 12 16:06 favicon.ico lrwxrwxrwx 1 www-data www-data 13 nov 12 16:06 fog -> /var/www/fog/ drwxr-xr-x 2 www-data www-data 4096 nov 12 16:06 fogdoc -rw-r--r-- 1 www-data www-data 572 nov 12 16:06 index.php drwxr-xr-x 13 www-data www-data 4096 nov 12 16:06 lib drwxr-xr-x 10 www-data www-data 4096 nov 12 16:06 management drwxr-xr-x 3 www-data www-data 4096 nov 12 16:06 service drwxr-xr-x 2 www-data www-data 4096 nov 12 16:06 status /var/www/html: total 20 drwxr-xr-x 2 root root 4096 ago 28 13:22 . drwxr-xr-x 4 root root 4096 nov 12 16:06 .. lrwxrwxrwx 1 root root 13 ago 28 13:22 fog -> /var/www/fog/ -rw-r--r-- 1 root root 10701 ago 28 13:17 index.html
on the storage side
ls -al /var/www /var/www/html /var/www/fog /var/www: total 16 drwxr-xr-x 4 root root 4096 nov 12 15:59 . drwxr-xr-x 13 root root 4096 jul 18 11:56 .. drwxr-xr-x 10 www-data www-data 4096 nov 12 16:00 fog drwxr-xr-x 2 root root 4096 jul 18 12:03 html /var/www/fog: total 408 drwxr-xr-x 10 www-data www-data 4096 nov 12 16:00 . drwxr-xr-x 4 root root 4096 nov 12 15:59 .. drwxr-xr-x 2 www-data www-data 4096 nov 12 15:59 api drwxr-xr-x 2 www-data www-data 4096 nov 12 15:59 client drwxr-xr-x 2 www-data www-data 4096 nov 12 15:59 commons -rw-r--r-- 1 www-data www-data 370070 nov 12 15:59 favicon.ico lrwxrwxrwx 1 www-data www-data 13 nov 12 15:59 fog -> /var/www/fog/ drwxr-xr-x 2 www-data www-data 4096 nov 12 15:59 fogdoc -rw-r--r-- 1 www-data www-data 572 nov 12 15:59 index.php drwxr-xr-x 13 www-data www-data 4096 nov 12 15:59 lib drwxr-xr-x 10 www-data www-data 4096 nov 12 15:59 management drwxr-xr-x 3 www-data www-data 4096 nov 12 15:59 service drwxr-xr-x 2 www-data www-data 4096 nov 12 15:59 status /var/www/html: total 20 drwxr-xr-x 2 root root 4096 jul 18 12:03 . drwxr-xr-x 4 root root 4096 nov 12 15:59 .. lrwxrwxrwx 1 root root 13 jul 18 12:03 fog -> /var/www/fog/ -rw-r--r-- 1 root root 10701 jul 18 11:56 index.html
I’ll make the git pull to the replic rep e install again on the storage and return here
-
@Sebastian-Roth right…look at this
server side “git checkout replication
Already on ‘replication’
Your branch is up-to-date with ‘origin/replication’.”storage side “git checkout replication
Already on ‘replication’
Your branch is up-to-date with ‘origin/replication’.” -
@Sebastian-Roth I will be updating a image definition this week. I ran into an issue with imaging a lab with storage nodes. I’m testing the solution out today and then I will be updating image to storage group that has storage nodes. Should I force the replication or let it run on it’s own? I’m curious if it matters how to let the replication start.
-
@mronh Can’t seen an issue in the output you posted. Can we do a Teamviewer session today? Will be available the next hours.
-
@Sebastian-Roth unfortunately remote sessions is not an option here, the outside traffic is controled/blocked, beyond my jurisdiction =/
-
@mronh Give me 20 minutes to get home and get some commands together to verify that you have the right code running…
-
We figured out that storage node wasn’t properly updated somehow. Re-running the installer fixed this. Not sure what exactly went wrong but logs are looking way better now. We’ll see in the morning. @mronh Please let us know.
-
@JGallo said in Replication Issue:
Should I force the replication or let it run on it’s own? I’m curious if it matters how to let the replication start.
How do you mean force the replication?
-
@Sebastian-Roth From my point of view, He can do a “service FOGImageReplication restart” and it will force the replication to do the job, otherwise he will need to wait de time of the cron job