Node and Replication Issues

LPetelik

CentOS 7
Version 1.5.4 on masters and all nodes.
I’m moving from one fog system to another to get the system off of a VM. I have the new master set up, most things are running well. I have four issues that I’m aware of and I’ll concentrate on two similar ones in this thread and the others in separate threads. I mention them just in case they are tied together.

Two nodes show “A valid database connection could not be made” in the Fog Configuration area. One is a newly built node, the other has been fine for several days prior to today.
Replication- It’s checking the other nodes but not replicating. I’m manually moving images, running touch on both .mnt files and fixing ownership and permissions. It seems that one group master is willing to replicate to another group master but not down to the nodes. I do have the masters in the group with the nodes that they won’t replicate to.

For the replication issue: Things I’ve tried- I’ve scoured the /opt/fog/.fogsettings to compare passwords. fog and fogsettings match on all of the nodes. fog matches on the master, fogsettings is blank on the master. They match in the web interface under Fog Settings and under the node settings for each node. I also did a password reset on fog in CentOS.

I’ve tftp’ed into the master from an ssh into the node. I can pass a txt file back and forth to /images.

I did see one message about a password on a node in the replication logs. I fixed it.

I see two sites with their own log files. It looks like it’s comparing files but finds nothing to send or it thinks a node is offline. I do have some servers going offline but it seems to happen even when they are all online. (Summer work in schools and the waxing of floors is playing havoc with this a bit).

I’ve also tried resetting the fog password for mysql with this command minus hashes:
#UPDATE users SET uPass = MD5(‘password’) WHERE uName = ‘fog’;
#exit;

Any advice?

Wayne Workman

@lpetelik Can you post replication log snippits from the masters?

LPetelik

[07-18-18 1:17:40 pm] * Found Image to transfer to 12 s
[07-18-18 1:17:40 pm] | File Name: dev/postinitscripts
[07-18-18 1:17:40 pm] | Airport server does not appear to be online.
[07-18-18 1:17:40 pm] | Bermuda server does not appear to be online.
[07-18-18 1:17:43 pm] | Files do not match on server: Central
[07-18-18 1:17:43 pm] | Deleting remote file: /images/dev/postinitscripts/fog.postinit
[07-18-18 1:17:43 pm] * Starting Sync Actions
[07-18-18 1:17:43 pm] | CMD:
lftp -e ‘set xfer:log 1; set xfer:log-file “/opt/fog/log/fogreplicator…transfer.Central.log”;set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-total-rate 0:12800000;set net:limit-rate 0:12800000; mirror -c --parallel=20 -R --ignore-time -vvv --exclude “.srvprivate” “/images/dev/postinitscripts” “/images/dev/postinitscripts”; exit’ -u fog,[Protected] 10.23.1.11
[07-18-18 1:17:43 pm] * Started sync for Image dev/postinitscripts
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Robinwood server does not appear to be online.
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Not syncing Image: 7040CafeJune29th2018
[07-18-18 1:17:43 pm] | This is not the primary group.
[07-18-18 1:17:44 pm] * Found Image to transfer to 2 s
[07-18-18 1:17:44 pm] | Image Name: main_instr2
[07-18-18 1:17:44 pm] | Replication already running with PID: 5038
[07-18-18 1:17:44 pm] * Attempting to perform Group -> Nodes image replication.
[07-18-18 1:17:56 pm] * Found Image to transfer to 12 s
[07-18-18 1:17:56 pm] | Image Name: 7040CafeJune29th2018
[07-18-18 1:17:56 pm] | Airport server does not appear to be online.
[07-18-18 1:17:56 pm] | Bermuda server does not appear to be online.
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Robinwood server does not appear to be online.
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:18:09 pm] * Found Image to transfer to 12 s
[07-18-18 1:18:09 pm] | Image Name: main_instr2
[07-18-18 1:18:09 pm] | Airport server does not appear to be online.
[07-18-18 1:18:09 pm] | Bermuda server does not appear to be online.
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Robinwood server does not appear to be online.
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038

LPetelik

@wayne-workman Interestingly, Central is one of the servers telling me that it can’t reach the database. They were able to completely image the site then the message appeared in the Fog Configuration window under Kernels. I made quite a few changes in that time but not to Central. A snip of a few of the sites: 0_1531938696685_Screen Shot 2018-07-18 at 1.30.59 PM.png

Tom Elliott

Issue #1 may be occurring because the nodes are not accessible at the time the check is made. This is likely related to replication in that when a bunch of files are sent, the nodes will become taxed. Sometimes this “taxing” can make it difficult for other requests to be made appropriately.

Tom Elliott

@lpetelik Issue #2 is already known about and for the most part seems to be much more appropriate with the working branch. Please install this and let me know if you’re still having issues with overall replication.

Of note: I’m told the larger files tend to not like being checked. During their checking, it fails and returns the large file needs to be deleted and re-replicated. This might happen over and over again. I believe I know what the problem is, (script execution timeout), though I’m not exactly sure of a method to actually correct this action. Even when I forced execution time to be unlimited, something else was breaking the connection (most likely apache timing out as well).

Tom Elliott

The SQL statement you tried would only impact logging into the GUI. This is (normally) not the same as the ftp fog user/pass pair.

For the FTP based Username/Password, you have a “Management Username/Password” field in the Storage Nodes. There is also one other place, in the GUI, which deals with this information as well, but only when it comes to kernel updates. (Typically the main gui you’re accessing tends to be the main node machines are booting too/from. Eventually I hope to make kernel updates a more dynamic thing but it really will require a lot of thought and effort which I’m more worried about the GUI functionality at this point.)

The Kernel Update username/password is found in:

FOG Configuration Page->FOG Settings->TFTP Server->FOG_TFTP_FTP_USERNAME/FOG_TFTP_FTP_PASSWORD

LPetelik

@tom-elliott I’ll try the update and see if that helps Issue #2. I’ll report back.

I could easily see timing out being a lot of the issue. There are a lot of clients attached and a lot of nodes with several sites imaging. I’ll keep an eye on this and see if it settles down. Wayne mentioned that I needed to lengthen the check-in time for the clients too. Hopefully, that helps.

Thanks!

LPetelik

@tom-elliott When I updated the master to the working branch, the three nodes immediately reconnected. I then updated the nodes too. (Two nodes are offline, will get them updated when they get back online). I also realized that some of the memory in the server wasn’t showing up. I think that is a large part of the problem. Ordering some more now…

Node and Replication Issues

84

12.6k

17.5k

156.4k