Posts made by LPetelik

LPetelik

@george1421 We just ran the engine update. Thanks for those links. We’ll watch it for a bit to see how it looks. The GUI is a bit slow at the moment, hoping that clears up.

On a side note, while working with this issue, we ran .installfog and during the install it picked a new STORAGENODE MYSQLPASS password that had a closing bracket in it. It broke the node’s connections to the master even with the password updated in all the spots it should be. Once we created a new SQL password (on a whim to test, without special characters) and updated it in .fogsettings, reran the .installfog file, the nodes started talking to the master again.

I think we are in much better shape. We need to fix our node password issue and start imaging again to know for sure.

LPetelik

@george1421

Here is the top screenshot. It does look like a lot of processes running using quite a bit of CPU.

LPetelik

@george1421
Sorry for the late reply

I will need to check on this.
1.5.9 current stable on all nodes and master.
Master is on regular hardware. One node is also. The rest of the nodes are on VM. 24 nodes
A bit over 4000.
Not “too many connections” errors. I am seeing a memory error and will try to address that today.
A mix between Centos and Rocky. We are trying to move all of the nodes to Rocky.

LPetelik

@george1421

Checking on the other answers, might be Monday before I can reply.
2. 1.5.9
3 Master on regular hardware, nodes on VM. 24 nodes
4 I will get a more exact answer. Should be near 2500-3000

I should also mention that the GUI dropping just started about a week ago.

LPetelik

During work hours, our FOG GUI is crashing about every 30 minutes or so. We are using it daily with a pretty heavy client load and maybe 3 to 6 machines imaging at a time. Imaging seems to still work when this happens. Running “service mysql start and stop” seems to temporarily fix it. The server is a Lenovo S30 so old but a workhorse. Master running Centos 3.10.0-1160.76.1.e17 Nodes are running on VM on desktops at each site. 20~ish sites.

What we have tried/checked:
We have updated the main server and all the nodes.
We found something about updating the kernels online. We did that and it seemed better yesterday.
(This item has been ongoing for a while, a nuisance that we keep hoping to work on as we have time but may be related.) I also suspect we might have some data corruption (I think we might have a duplicate mac address situation). I can’t look at all the hosts on the host page. We can only search. I was able to export from the Configurations page. I can’t export hosts from the hosts’ page though. The csv is hard to look through just because of the sheer volume of clients.
We checked memory usage, it seems good.
We have run the database clean-up commands. They come back pretty clean.
We have checked passwords on the server and nodes.
We have looked through logs and checked on any errors that seem like they could be the cause.

Any suggestions or places we should start looking?

LPetelik

@tom-elliott When I updated the master to the working branch, the three nodes immediately reconnected. I then updated the nodes too. (Two nodes are offline, will get them updated when they get back online). I also realized that some of the memory in the server wasn’t showing up. I think that is a large part of the problem. Ordering some more now…

LPetelik

@wayne-workman Thanks. I changed it to 15 minutes. I’ll see how that does.

LPetelik

@tom-elliott Thanks!

LPetelik

@tom-elliott I’ll try the update and see if that helps Issue #2. I’ll report back.

I could easily see timing out being a lot of the issue. There are a lot of clients attached and a lot of nodes with several sites imaging. I’ll keep an eye on this and see if it settles down. Wayne mentioned that I needed to lengthen the check-in time for the clients too. Hopefully, that helps.

Thanks!

LPetelik

@tom-elliott Sorry, I didn’t mean to make it confusing. I wasn’t sure what might be tied together. I sort of wondered if I had an overall slowness issue and a password that I might not be noticing. If you would like me to repost with details removed, I will.

LPetelik

LPetelik

@wayne-workman Interestingly, Central is one of the servers telling me that it can’t reach the database. They were able to completely image the site then the message appeared in the Fog Configuration window under Kernels. I made quite a few changes in that time but not to Central. A snip of a few of the sites: 0_1531938696685_Screen Shot 2018-07-18 at 1.30.59 PM.png

LPetelik

[07-18-18 1:17:40 pm] * Found Image to transfer to 12 s
[07-18-18 1:17:40 pm] | File Name: dev/postinitscripts
[07-18-18 1:17:40 pm] | Airport server does not appear to be online.
[07-18-18 1:17:40 pm] | Bermuda server does not appear to be online.
[07-18-18 1:17:43 pm] | Files do not match on server: Central
[07-18-18 1:17:43 pm] | Deleting remote file: /images/dev/postinitscripts/fog.postinit
[07-18-18 1:17:43 pm] * Starting Sync Actions
[07-18-18 1:17:43 pm] | CMD:
lftp -e ‘set xfer:log 1; set xfer:log-file “/opt/fog/log/fogreplicator…transfer.Central.log”;set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-total-rate 0:12800000;set net:limit-rate 0:12800000; mirror -c --parallel=20 -R --ignore-time -vvv --exclude “.srvprivate” “/images/dev/postinitscripts” “/images/dev/postinitscripts”; exit’ -u fog,[Protected] 10.23.1.11
[07-18-18 1:17:43 pm] * Started sync for Image dev/postinitscripts
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Robinwood server does not appear to be online.
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Replication already running with PID: 29224
[07-18-18 1:17:43 pm] | Not syncing Image: 7040CafeJune29th2018
[07-18-18 1:17:43 pm] | This is not the primary group.
[07-18-18 1:17:44 pm] * Found Image to transfer to 2 s
[07-18-18 1:17:44 pm] | Image Name: main_instr2
[07-18-18 1:17:44 pm] | Replication already running with PID: 5038
[07-18-18 1:17:44 pm] * Attempting to perform Group -> Nodes image replication.
[07-18-18 1:17:56 pm] * Found Image to transfer to 12 s
[07-18-18 1:17:56 pm] | Image Name: 7040CafeJune29th2018
[07-18-18 1:17:56 pm] | Airport server does not appear to be online.
[07-18-18 1:17:56 pm] | Bermuda server does not appear to be online.
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Robinwood server does not appear to be online.
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:17:56 pm] | Replication already running with PID: 24061
[07-18-18 1:18:09 pm] * Found Image to transfer to 12 s
[07-18-18 1:18:09 pm] | Image Name: main_instr2
[07-18-18 1:18:09 pm] | Airport server does not appear to be online.
[07-18-18 1:18:09 pm] | Bermuda server does not appear to be online.
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Robinwood server does not appear to be online.
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038
[07-18-18 1:18:09 pm] | Replication already running with PID: 5038

LPetelik

CentOS 7
Version 1.5.4 on masters and all nodes.
I’m moving from one fog system to another to get the system off of a VM. I have the new master set up, most things are running well. I have four issues that I’m aware of and I’ll concentrate on two similar ones in this thread and the others in separate threads. I mention them just in case they are tied together.

Two nodes show “A valid database connection could not be made” in the Fog Configuration area. One is a newly built node, the other has been fine for several days prior to today.
Replication- It’s checking the other nodes but not replicating. I’m manually moving images, running touch on both .mnt files and fixing ownership and permissions. It seems that one group master is willing to replicate to another group master but not down to the nodes. I do have the masters in the group with the nodes that they won’t replicate to.

For the replication issue: Things I’ve tried- I’ve scoured the /opt/fog/.fogsettings to compare passwords. fog and fogsettings match on all of the nodes. fog matches on the master, fogsettings is blank on the master. They match in the web interface under Fog Settings and under the node settings for each node. I also did a password reset on fog in CentOS.

I’ve tftp’ed into the master from an ssh into the node. I can pass a txt file back and forth to /images.

I did see one message about a password on a node in the replication logs. I fixed it.

I see two sites with their own log files. It looks like it’s comparing files but finds nothing to send or it thinks a node is offline. I do have some servers going offline but it seems to happen even when they are all online. (Summer work in schools and the waxing of floors is playing havoc with this a bit).

I’ve also tried resetting the fog password for mysql with this command minus hashes:
#UPDATE users SET uPass = MD5(‘password’) WHERE uName = ‘fog’;
#exit;

Any advice?

LPetelik

CentOS 7
Version 1.5.4 on masters and all nodes.
I’m moving from one fog system to another. I have the new master set up, most things are running well.

Hosts had green dots, after a restart the green dots turned to exclamation marks. If I delete the fog client and reinstall, it’s green again until another restart. It seemed fine for several days but it just might be that I didn’t restart during that time.

I found out this morning that I didn’t have a DNS record for the master. I do now. nslookup is still failing though.
I’ve run the sql database maintenance steps in the wiki trying to fix this and trying to fix the slowness.
I’ve tried clicking on the Encrypt data button under the hosts’ records.

LPetelik

@wayne-workman Thank you. So far we seem to be working through them but we aren’t quite back to normal yet.

LPetelik

@wayne-workman Thanks. We do have the nodes showing now however, I think the info you gave will be helpful for the nodes pointed to the previous master that are still functional.

Our problem was in DHCP in the location where the master was at. I was scouring the nodes and missed changing it at the location where the main server was grabbing it’s DHCP Scope Options.

LPetelik

@quazz

Thank you for the advice.

Yes, the node sites have DHCP on the win servers pointing to the fog nodes.

I did see that document after I started the migration a slightly more complicated way. If I can’t make this work right, I’ll redo it all and go with this document.

LPetelik

I forgot to give version info.

The master is on CentOS7, current build with FOG 1.5.3 ( I do see there is another update).
Nodes are on Windows Server 2012 r2 in Hyper-V.

LPetelik

Hi,

Backstory: We have a master set up on a VM that is running really slow on the web interface. I’ve tried to troubleshoot it a large part of the year. We also had a problem with the nodes and the size of the VM for them. The thought was that we’d build a new master on a dedicated machine with a different IP and move the nodes over during the summer. The new master is up. We’re starting to rebuild the nodes. It’s not working quite right.

In the web interface, I can see the nodes under Storage Management but not under the Dashboard nor under FOG Configuration, Kernel versions. We are also unable to capture an image. The first client error was ftfp relate. I have that working now (mismatched tftp and fogstorage passwords and permissions on /images). Now we’re getting the PXE-T01: File not found error as the client is trying to boot. We handle our own DHCP and we have it set up the same as we have the other master set up. I’ve spent a good part of a day looking over settings, some of that time with our networking person. I don’t think we need to set up the dnsmasq since we can configure our DHCP options.

Any advice would be appreciated. Thanks in advance.