FOG with Galera, snapin problems
-
Hey, @Neil messaged me recently about this. Our conversation is below. Thought I’d get this out in the open so more eyes can see it.
Hi Wayne, sorry for messaging you directly but was hoping to pick your brains. I work at a university in the UK where we’ve used fog since its first public release. We use fog to manage about 2000 computers which is a 1/3 of the university (thats complicated, dont ask). Since moving to 1.x branch we’ve seen serious performance issues. We’ve done a lot of work to fix these performance issues but are now seeing other odd issues. We are now running 1 x master node, 2 x storage nodes, with each connected to a 3 node galera cluster. We are running nginx and php7.2-fpm on the master node, haven’t gotten round to doing anything with the storage nodes with regards to that. Everything is running smoothly from a performance point of view. Galera is being hammered we are seeing roughly 700 - 1,400 queries a second from fog. The issue we are currently seeing is when you deploy snapins via groups. One of our IT Suites have 71 machines in, if we push something out to it, only a 3rd of them get the snapin, this is consistant no matter how many machines are in the group. An entry appears in the all task list for each, but only 1/3 appear in the active snapin task list. I thought this might be something to do with galera so took one of the nodes out to find that the number of machines that get the snapin changes to 50%. Can’t be a co-insidence? You see to be the only person on the fog forums thats using galera so wondered if you’ve seen anything like this?
Again sorry to bother you! Neil
Wayne Workman
Neil,
I’m sure you’ve read this on the forums already, but I suggest you significantly increase your fog client chekin period, maybe to once per 5 minutes, maybe larger.
About the Snapin issues you are seeing with Galera, this is really interesting.
I think because fog is making so frequent queries to the database, changes may be in one of your galera nodes but not replicated to the others yet.
if you increase the checkin rate, it’s possible the issue gets resolved.
Are you sure you have Galera replicating?
And I assume your Galera cluster is load balanced?
And is FOG set to use the load balancer?
-
+1 on the checkin time increase. My server struggled until I did this with the thousands of clients I have.
I am interested in this topic otherwise. Weighing the pros and cons of external/isolated DB to better handle the amount of clients we use.
-
I have an environment with 15500 clients on many diferent sites and i’m very interested in a discusion or topic like this.
-
Off the top of my head I can think of a few things (not in line with the OPs question about Galera).
- Surely increase the client check in time to 600 or 900 seconds (guess) with over 1000 target computers running the FOG client.
- Move the mysql database (server) to a dedicated server that can be tuned and targeted for mysql performance.
- Make sure you have sufficient RAM and vCPUs allocated to the FOG Master node and database server.
- Starting with FOG version 1.5.2, FOG started using php-fpm to process the php code over the built in apache php engine. This was done for few reasons. A dedicated php-fpm engine processes php code faster than apache’s php engine. This freed up apache to process http requests faster instead of doing both tasks.
- You will probably want to tweak the php-fpm engine to allow more children php processes to run the default is 35 in FOG 1.5.x series.
- You will probably need one (or more) 10GbE network adapters for both the fog master node and database server. I know on a 1 GbE network we can saturate it with just 3 simultaneous unicast streams.
- If your FOG server is physical, then make sure your disk subsystem is either flash based or running on a raid array 0 or 10 with many spindles.
I have to say that FOG really hasn’t been performance tuned for such a large campus. I know there are some forum members that do have large campuses that are using fog for imaging.
-
Just wanted to add to George’s great list:
- Starting with FOG 1.5.6 I have worked on improving client checkin and storage node communication performance. This is still far from perfect but fixes a couple of really bad performance issues that we still had in 1.5.5.
@Neil said:
An entry appears in the all task list for each, but only 1/3 appear in the active snapin task list.
Can you please check the fog-client log on these machines when this happens. My guts say that it’s more the clients not properly finding the snapin task rather than a DB issue. But that’s just a guess.
-
@Wayne-Workman What are we going to do with this topic? Seems like @Neil doesn’t want to join the discussion. So what do we do?
-
Just close it. Maybe I’ll put together a Galera tutorial one of these days for FOG.