Replication to storage nodes not working - Trunk version 4487
-
Updated to latest trunk and all looking good apart from image replication. I have the Location plugin installed on one Normal node in head office plus a storage node (also in head office) and storage nodes at 7 remote sites.
I have 2 storage groups defined - “default” and “local”.
I have a storage node configured for each actual storage node. The head office storage node (called “localstorage”) points to the “local” group - this is for images at head office which don’t need to replicate to other sites.
The remote storage nodes are all set with the “default” storage group.
The “DefaultMember” and “localstorage” nodes are both set as Master Node so should replicate to any storage nodes within the same storage group, correct?
Log viewer shows:
[12-06-15 10:22:37 pm] * Starting Image Replication. [12-06-15 10:22:37 pm] * We are group ID: #1 [12-06-15 10:22:37 pm] | We are group name: default [12-06-15 10:22:37 pm] * We have node ID: #1 [12-06-15 10:22:37 pm] | We are node name: DefaultMember [12-06-15 10:22:37 pm] * Not syncing Image between group(s) [12-06-15 10:22:37 pm] | Image Name: MyNewShinyImage [12-06-15 10:22:37 pm] * | I am the only member
The image itself has the Replicate? check box checked and is associated with the default storage group. Oddly if I associate the image with the “local” group replication starts to work.
The correct fog username and password has been added to each storage node under Management Username and Management Password and I tested I could ftp to the nodes with these credentials.
The correct IP is set on each node as well. Permissions on the /images directory look ok too - I’ve manually set as fog:root.
It would appear the master fog server knows its the master, can see the correct image but cannot see any other nodes within the same storage group for some reason.
Ideas anyone?
thanks, Kiweegie.
-
From what I’m reading of the error log, the “not syncing image between groups” is correct, but it’s not replicating to nodes within the same group?
-
I verified the issue and hopefully this is now fixed. It will also go back to allowing successive transfers as that was broken also.
-
@Tom-Elliott Good morning Tom,
Confirm since upgrade to latest release Image Replication is working once more.
Couple of related quesions please if you don’t mind. Does this need to be updated (FOG version) on the storage nodes as well or just on the master node? I’ve only updated latter and seems to be working but double checking.
I noticed that in the FOG Log viewer when checking Image Replicator log that it shows the fog user password for the storage nodes in plain text. Is this by design?
[12-07-15 8:53:02 am] * Started sync for Image NewShinyImage [12-07-15 8:53:02 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-rate 0:64000;set net:limit-rate 0:64000;set net:limit-rate 0:64000; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/NewShinyImage /images/NewShinyImage; exit' -u fog,<FOG user pw in plain text> 10.223.40.15 [12-07-15 8:53:02 am] * Started sync for Image NewShinyImage
Lastly the fogstorage password which is used on the main server for nodes to connect back to. That can be edited in the GUI under FOG Configuration > FOG Settings > FOG Storage Nodes > FOG_STORAGENODE_MYSQLPASS. What config file if any does this password sit in and is amending in the GUI sufficient? Wondering if a new server has to be provisioned if we can just edit the fogstorage password here on new machine to match what was set during storage node setup. Or on the other hand can the new fogstorage password be added to a config file on the storage nodes?
Thanks again for the help and (exceedlingly) quick fix/reply.
regards Kiweegie.
-
I only recently started displaying the command for the replicator services just as a way for me to ensure the command was starting properly and give people the command line that would be used if they wanted to try to debug an issue with replication. That said and due to the nature of security I will probably remove that element shortly. The password is always going to be plain text though. While ftp has some security the username and password are normally handled “in the clear” as the protocol was developed during a time when security was not thought of. That all said the fog password is not the same as the fog storagenode mysql pass from fog configuration fog settings. That is the password other storage nodes are using. The MySQL pass you referenced can be found in one file. The Config.class.php file will have the MySQL pass in it unless you opted to use a different username/password to connect to the MySQL server. The password pair used for ftp/lftp is based on the node receiving/getting the file or files and is stored with the storage node. There is not a Config file in use for this.
-
I’m on SVN Revision 4502 cloud 5662 running CentOS 7
Firewall and SELinux are off for both Master and non-master.I’m seeing the same thing at my site.
The Master Node and non-master node are in the same storage group. Passwords are set correctly, and I’ve reset them manually too.
I can FTP into the remote node fine using the password that shows in the logs. Permissions on /images are fine.
When I manually execute the commands in the logs, nothing happens. No errors, no spike in bandwidth, nothing.
here’s the logs:
[12-07-15 11:12:55 am] ___ ___ ___ /\ \ /\ \ /\ \ /::\ \ /::\ \ /::\ \ /:/\:\ \ /:/\:\ \ /:/\:\ \ /::\-\:\ \ /:/ \:\ \ /:/ \:\ \ /:/\:\ \:\__\ /:/__/ \:\__\ /:/__/_\:\__\ \/__\:\ \/__/ \:\ \ /:/ / \:\ /\ \/__/ \:\__\ \:\ /:/ / \:\ \:\__\ \/__/ \:\/:/ / \:\/:/ / \::/ / \::/ / \/__/ \/__/ ########################################### # Free Computer Imaging Solution # # Credits: # # http://fogproject.org/credits # # GNU GPL Version 3 # ########################################### [12-07-15 11:12:55 am] Interface Ready with IP Address: 10.51.1.53 [12-07-15 11:12:55 am] Interface Ready with IP Address: acfog.OMITTED.k12.mo.us [12-07-15 11:12:55 am] * Starting ImageReplicator Service [12-07-15 11:12:55 am] * Checking for new items every 600 seconds [12-07-15 11:12:55 am] * Starting service loop [12-07-15 11:12:55 am] * Starting Image Replication. [12-07-15 11:12:55 am] * We are group ID: #1 [12-07-15 11:12:55 am] | We are group name: AC-Storage-Group [12-07-15 11:12:55 am] * We have node ID: #1 [12-07-15 11:12:55 am] | We are node name: AC-Master [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: 6073admin [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: 7010admin [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: 7303admin [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: 8808admin [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: 9020admin [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: dell9020instuctional [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Not syncing Image between group(s) [12-07-15 11:12:55 am] | Image Name: tecraa10s3501 [12-07-15 11:12:55 am] | I am the only member [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: 6073admin [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/6073admin /images/6073admin; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image 6073admin [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: 7010admin [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/dell7010admin /images/dell7010admin; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image 7010admin [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: 7303admin [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/7303admin /images/7303admin; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image 7303admin [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: 8808admin [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/8808admin /images/8808admin; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image 8808admin [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: 9020admin [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/9020admin /images/9020admin; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image 9020admin [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: dell9020instuctional [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/Dell9020BaseImageOct2015 /images/Dell9020BaseImageOct2015; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image dell9020instuctional [12-07-15 11:12:55 am] * Found Image to transfer to 2 node(s) [12-07-15 11:12:55 am] | Image name: tecraa10s3501 [12-07-15 11:12:55 am] * Starting Sync Actions [12-07-15 11:12:55 am] | CMD: lftp -e 'set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; mirror -c -R --ignore-time -vvv --exclude 'dev/' --exclude 'ssl/' --exclude 'CA/' --delete-first /images/tecraa10s3501 /images/tecraa10s3501; exit' -u fog,OMITTED 10.65.2.20 [12-07-15 11:12:55 am] * Started sync for Image tecraa10s3501
-
-
@Wayne-Workman Are the AC-Master and Annex nodes on the same Server? If they are, are they pointing at the same image location?
-
@Tom-Elliott No, they are two geographically separated nodes. They previously had the Location plugin setup for them but when we started having issues with replication, I uninstalled the location plugin to just eliminate variables.
Both are FULL server installations, but the Annex node has it’s
/opt/fog/.fogsettings
set to:snmysqluser="fog" snmysqlpass='OMITTED'; snmysqlhost="10.51.1.53";
Replication worked fine on our previous version - we updated to get some bug fixes and now we have this issue.
-
@Wayne-Workman And the mysql user being fog was actually setup for your database environment? Can you try using the fogstorage user as defined in the master node?
-
@Wayne-Workman Also, is it possible the Annex node already has the files in question?
-
@Tom-Elliott I setup the fog user manually a while back. I’ll switch it to the fogstorage user just so it’s more standard.
I should start with backstory… seems like I always bring it up later… anyways.
This morning, our image builder in the building that the AC-Master node is in - she uploaded a new image this morning. We had previously had issues with restoring the image in the building where the Annex node is. I wasn’t physically there to see any of it.
So, I found out she uploaded a new image. Via CLI on my buildings fog server, I grabbed a copy of the image by just mounting the remote /images directory to a temp directory using NFS and just doing a recursive copy and then unmounting. I created the image definition for the image as it was on their FOG DB.
I was able to successfully restore the image to the right hardware model with no problems. An Optiplex 9020.
The people at the Annex could not. I compared file sizes for the 9020admin image on both the Master and the non-master nodes. They were identical… which is strange but maybe that will help you figure out what’s going on…
I then walked them through manually (via CLI) just deleting the 9020admin directory on the Annex node and then manually copying it via NFS like I had done.
We were able to deploy the image from the Annex then - but the location plugin is uninstalled at that point so it might have been pulling from AC-Master… don’t know.
-
I just re-did the location stuff, this time I enabled the TFTP checkbox on both locations.
-
If the filesize a are the same this would explain why they where in defunct status. The commands run but have no work to perform. I believe the defuncts you’re seeing are simply because of this. If you’re daring you could delete one of the images from the annex node and restart the replicator on the master. Then check your bandwidth and see if things are happening.