How increase the FOG server performance?

Fernando Gietz

Hi FOGers!

I need help to customize the setting of my FOG server to increase the performance.

Environment:

7000 host in the IT rooms
300 IT rooms
9TB of images (increasing)
60 technicians
1 FOG server and 1 storage node

Actually we use an old FOG version (0.30) and works fine … very fine. But we need to migrate the FOG version to the last version.
To do this step I installed two FOG servers with the 1.5 RC x version (dev and preproduction environments) but I have performance problems.

The web UI goes fine until you send a multicast tasks or you want to see the membership of one group [more info here]
I don’t know if is normal but the mysqld process uses 1,3G of RAM

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2073 mysql     20   0 3770600 1,372g   3920 S   0,3 11,8   3448:19 mysqld

I use mytop tool to see the mysql performance

MySQL on localhost (5.5.56-MariaDB)     up 48+03:43:38 [17:36:15]
 Queries: 397.4M  qps:  100 Slow:   953.0         Se/In/Up/De(%):    87/01/01/00 
             qps now:   84 Slow qps: 0.0  Threads:    8 (   1/   0) 86/00/00/00 
 Key Efficiency: 100.0%  Bps in/out: 31.1k/109.1k   Now in/out: 16.5k/144.6k

84 queries per second, are not a lot of?
3) FOGImageReplicator and FOGSnapinReplicator. If I have only one node, these two daemons, are neccessaries?
4) Can I enable the php-fdm to increase the performance [https://forums.fogproject.org/topic/10717/can-php-fpm-make-fog-web-gui-fast]?

george1421

@fernando-gietz Lets get a bit more details here.

How many vCPUs does your FOG server have?
Do you use the fog client? If so what is your check in interval?
How many network adapters do you have in this fog server?
Is this fog server virtual or physical?
What kind of disk subsystem do you have? (raid, single disk, ssd,??)

Fernando Gietz

@george1421

How many vCPUs does your FOG server have?
6 vCPU and 12 GB RAM
Do you use the fog client? If so what is your check in interval?
Yes, but is not installed in all of them. Actually the client is installed in 600 computers. CLIENT CHECKIN TIME = 90
How many network adapters do you have in this fog server?
Two adapters. One for clients and one for the storage.
Is this fog server virtual or physical?
Is virtual
What kind of disk subsystem do you have? (raid, single disk, ssd,??)
I dont know But is not bad, we use the Production environment of the university. I can do download tasks at 13 GB/min, then I suppose that the disks are not the problem

OS: RHEL 7 64 bits

george1421

@fernando-gietz It would be interesting to see what top had to say. With 6 vCPUs, it would be interesting to know how many cores your server has. If it has way more than 6, then 6 vCPUs is OK. Otherwise adding more vCPUs than necessary will slow down your VM.

My initial reaction is to take your client check in time to 15 minutes, in stead of 90 seconds. At 90 seconds you have 600 hosts hitting your FOG server (at an average lineralized rate) of 6 hosts per second. We all know host check in at random. So you might have 15 check in, in 1 second and 2 check in, the next second. So drop your check in period to 10-15 minutes.

Second I would surely enable php-fpm and memcache to see how well it improves your performance. I have only done this on a small scale and that really helped me with web server responsiveness.

Hopefully your vm host server uses more than one network interface to the building switches. For a university I might expect that they use 10 - 40GbE networking. Also look at what interface your VM is using to interface with your vm host server. If your hypervisor is ESX (vSphere) then ensure you are using the VMX3 network interface. That should give you 10G to your vSwitch.

Lastly, you may be at a scale (number of users) where you might consider removing the sql server from FOG and running an independent SQL server specifically configured to run MS SQL.

I think I might do the first 2 in the list and check on the 3rd one. Leave extracting mssql server out of the fog server until last.

Fernando Gietz

top command:

top - 18:41:55 up 48 days,  4:49,  2 users,  load average: 0,19, 0,23, 0,29
Tasks: 282 total,   1 running, 278 sleeping,   0 stopped,   3 zombie
%Cpu(s):  8,2 us,  2,2 sy,  0,0 ni, 89,6 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem : 12138956 total,   177100 free,  2809672 used,  9152184 buff/cache
KiB Swap:  1023996 total,   199544 free,   824452 used.  8521144 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
26061 apache    20   0  543340  45800   6768 S  11,3  0,4   6:29.34 httpd                                                            
13607 apache    20   0  700016  47256   8016 S   9,0  0,4  14:19.99 httpd                                                            
16160 apache    20   0  678892  27200   9160 S   7,3  0,2   1:32.28 httpd                                                            
 2073 mysql     20   0 3770600 1,372g   3920 S   6,0 11,8   3452:06 mysqld

atop command:

PRC | sys    0.13s  | user   0.20s  | #proc    285  | #trun	 3  | #tslpi   328  | #tslpu     0  | #zombie    3  | #exit      7  |
CPU | sys       3%  | user      4%  | irq	0%  | idle    593%  | wait	0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
cpu | sys	1%  | user      0%  | irq	0%  | idle     99%  | cpu003 w  0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
cpu | sys	1%  | user      2%  | irq	0%  | idle     98%  | cpu005 w  0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
cpu | sys	1%  | user	1%  | irq	0%  | idle     99%  | cpu004 w  0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
cpu | sys	1%  | user	0%  | irq	0%  | idle     99%  | cpu000 w  0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
cpu | sys	0%  | user	1%  | irq	0%  | idle     99%  | cpu001 w  0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
cpu | sys	0%  | user	0%  | irq	0%  | idle    100%  | cpu002 w  0%  | guest     0%  | curf 2.67GHz  | curscal   ?%  |
CPL | avg1    0.08  | avg5    0.19  | avg15   0.27  |               | csw     5925  | intr    5744  |               | numcpu     6  |
MEM | tot    11.6G  | free  147.2M  | cache   8.5G  | buff    0.1M  | slab  221.8M  | shmem 428.8M  | vmbal   0.0M  | hptot   0.0M  |
SWP | tot     1.0G  | free  194.9M  |               |               |               |               | vmcom   2.9G  | vmlim   6.8G  |
LVM |   Datos-root  | busy	1%  | read	 5  | write	 4  | KiB/w	 8  | MBr/s   0.19  | MBw/s   0.01  | avio 4.56 ms  |
LVM |    Datos-tmp  | busy	0%  | read	 0  | write	 1  | KiB/w	 4  | MBr/s   0.00  | MBw/s   0.00  | avio 1.00 ms  |
DSK |          sda  | busy      1%  | read       5  | write      5  | KiB/w      7  | MBr/s   0.19  | MBw/s   0.01  | avio 4.20 ms  |
NET | transport     | tcpi	10  | tcpo	12  | udpi    1924  | udpo    1920  | tcpao      2  | tcppo      2  | tcprs      3  |
NET | network       | ipi     2102  | ipo     2088  | ipfrw      0  | deliv   2102  |               | icmpi      0  | icmpo      0  |
NET | ens192  ----  | pcki    2108  | pcko    2088  | si  220 Kbps  | so 1754 Kbps  | erri       0  | erro       0  | drpo       0  |
NET | ens224  ----  | pcki       1  | pcko       1  | si    0 Kbps  | so    0 Kbps  | erri	 0  | erro	 0  | drpo	 0  |

The checking time, what checks? The computer state? 15 minutes is a lot of for us. Take note that if you send a multicast tasks, the computers will shutdown in very differents moments and some ones will be out of the tasks (if you have a multicast timeout of 5 minutes)

george1421

@fernando-gietz I think maybe we are not talking about the same check in time.

Also your CPU usage doesn’t look bad (according to top).

Fernando Gietz

@george1421 We are talking about the same check time This check time, what means?

I am worry about the mysql performance and the huge use of RAM, 1,3GB.

 2073 mysql     20   0 3770600 1,372g   3920 S   6,0 11,8   3452:06 mysqld

And when I want to see the membership of one group, the apache use the 100% vCPU and I spend two minutes to see the list of them.

The swap use, is normal? circa 100%

george1421

@fernando-gietz said in How increase the FOG server performance?:

We are talking about the same check time This check time, what means?

What this means, it tells the client “Check back with the server every XX seconds to see if there is something for you to do”. So the clients will query the FOG server every XX seconds to see if there are snapins to deploy or system rename events, or what ever you can schedule with the FOG Server. This I feel the FOG server and MySQL are busy servicing these client check ins to do much of anything else. As I suggested change the check in time to 900 (15 min) and see if this resolves your problem, or makes it easier on the FOG server. If not, you can change it back.

Normally with that much ram, swap is never used. 800MB does seem like a lot. 1.3GB of ram for mysql process does seem to be a lot too. Again drop your check in time and wait 30 minutes to see if the resources free up on your fog server.

Fernando Gietz

I have restarted the mysql server and the usage has downed

8895 mysql     20   0 1300380  93492   9236 S   7,0  0,8   0:05.37 mysqld

I have config the check_time to 900 seconds

Fernando Gietz

The activity of mysql server is huge. I have restarted the server and in seven minutes:

MySQL on localhost (5.5.56-MariaDB)     up 0+00:07:00 [16:13:04]
 Queries: 38.1k  qps:   93 Slow:     0.0         Se/In/Up/De(%):    94/00/00/00 
             qps now:  102 Slow qps: 0.0  Threads:    5 (   1/   0) 85/01/00/00 
 Key Efficiency: 100.0%  Bps in/out: 13.5k/43.9k   Now in/out: 41.3k/190.2k

      Id      User         Host/IP         DB      Time    Cmd Query or State                                                       
       --      ----         -------         --      ----    --- ----------                                                           
      664      root       localhost       test         0  Query show full processlist                                                
      782      root       localhost        fog         4  Sleep                                                                      
      768      root       localhost        fog        10  Sleep                                                                      
      746      root       localhost        fog        19  Sleep                                                                      
       10      root       localhost        fog       414  Sleep

38k queries??

Fernando Gietz

I have config the mysql to log the queries and seems that some queries are fool.

180228 16:38:32	  364 Connect	root@localhost as anonymous on fog
		  364 Query	USE `fog`
		  364 Query	SET SESSION sql_mode=''
		  365 Connect	root@localhost as anonymous on fog
		  365 Query	USE `fog`
		  364 Quit	
		  365 Query	SET SESSION sql_mode=''
		  366 Connect	root@localhost as anonymous on fog
		  366 Query	USE `fog`
		  365 Quit	
		  366 Query	SET SESSION sql_mode=''
		  366 Query	SELECT `vValue` FROM `fog`.`schemaVersion`
		  366 Query	SELECT `pName` FROM `plugins`   WHERE `plugins`.`pInstalled`='1' AND `plugins`.`pState`='1'   ORDER BY LOWER(`plugins`.`pName`) ASC
		  366 Query	SELECT `settingValue` FROM `globalSettings`   WHERE `globalSettings`.`settingKey` IN ('FOG_DEFAULT_LOCALE','FOG_HOST_LOOKUP','FOG_MEMORY_LIMIT','FOG_REAUTH_ON_DELETE','FOG_REAUTH_ON_EXPORT','FOG_TZ_INFO','FOG_VIEW_DEFAULT_SCREEN')   ORDER BY LOWER(`globalSettings`.`settingKey`) ASC
		  366 Query	SELECT COUNT(`hosts`.`hostID`) AS `total` FROM `hosts` WHERE `hostPending` = '1' LIMIT 1
		  366 Query	SELECT COUNT(`COLUMN_NAME`)AS`total`FROM`information_schema`.`COLUMNS`WHERE`TABLE_SCHEMA`='fog'AND`TABLE_NAME`='hostMAC'AND`COLUMN_NAME`='hmMAC'
		  366 Query	SELECT COUNT(`hostMAC`.`hmID`) AS `total` FROM `hostMAC` WHERE `hmPending` = '1' LIMIT 1
		  366 Query	SELECT `settingValue` FROM `globalSettings`   WHERE `globalSettings`.`settingKey` IN ('FOG_URL_AVAILABLE_TIMEOUT','FOG_URL_BASE_CONNECT_TIMEOUT','FOG_URL_BASE_TIMEOUT')   ORDER BY LOWER(`globalSettings`.`settingKey`) ASC
		  366 Query	SELECT `globalSettings`.* FROM `globalSettings`  WHERE `settingKey`='FOG_QUICKREG_PENDING_MAC_FILTER'
		  366 Query	SELECT COUNT(`hostMAC`.`hmID`) AS `total` FROM `hostMAC` WHERE `hmMAC` IN ('40:b0:34:39:57:ac') AND `hmPending` IN ('0','') LIMIT 1
		  366 Query	SELECT `hmMAC` FROM `hostMAC`   WHERE `hostMAC`.`hmMAC` IN ('40:b0:34:39:57:ac') AND `hostMAC`.`hmPending` IN ('0','')   ORDER BY `hostMAC`.`hmID` ASC
		  366 Query	SELECT `hmMAC` FROM `hostMAC`   WHERE `hostMAC`.`hmMAC` IN ('40:b0:34:39:57:ac') AND `hostMAC`.`hmIgnoreImaging`='1'   ORDER BY `hostMAC`.`hmID` ASC
		  366 Query	SELECT `hostMAC`.* FROM `hostMAC`  WHERE `hmMAC`='40:b0:34:39:57:ac'
		  366 Query	SELECT `hmHostID` FROM `hostMAC`   WHERE `hostMAC`.`hmPending` IN ('0','') AND `hostMAC`.`hmMAC` IN ('40:b0:34:39:57:ac')   ORDER BY `hostMAC`.`hmID` ASC
		  366 Query	SELECT `hosts`.*,`hostMAC`.*,`images`.*,`os`.*,`imagePartitionTypes`.*,`imageTypes`.*,`hostScreenSettings`.*,`hostAutoLogOut`.*,`inventory`.* FROM `hosts`  LEFT OUTER JOIN `hostMAC` ON `hostMAC`.`hmHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `images` ON `images`.`imageID`=`hosts`.`hostImage`  LEFT OUTER JOIN `os` ON `os`.`osID`=`images`.`imageOSID`  LEFT OUTER JOIN `imagePartitionTypes` ON `imagePartitionTypes`.`imagePartitionTypeID`=`images`.`imagePartitionTypeID`  LEFT OUTER JOIN `imageTypes` ON `imageTypes`.`imageTypeID`=`images`.`imageTypeID`  LEFT OUTER JOIN `hostScreenSettings` ON `hostScreenSettings`.`hssHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `hostAutoLogOut` ON `hostAutoLogOut`.`haloHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `inventory` ON `inventory`.`iHostID`=`hosts`.`hostID`  WHERE `hostID`='7502'  AND `hostMAC`.`hmPrimary` = '1'
		  366 Query	SELECT COUNT(`hookEvents`.`heName`) AS `total` FROM `hookEvents` WHERE `hookEvents`.`heName`='QUEUED_STATES' AND `hookEvents`.`heName` <> '0'
		  366 Query	SELECT COUNT(`hookEvents`.`heName`) AS `total` FROM `hookEvents` WHERE `hookEvents`.`heName`='PROGRESS_STATE' AND `hookEvents`.`heName` <> '0'
		  366 Query	SELECT `taskID` FROM `tasks`  LEFT OUTER JOIN `images` ON `images`.`imageID`=`tasks`.`taskImageID`  LEFT OUTER JOIN `os` ON `os`.`osID`=`images`.`imageOSID`  LEFT OUTER JOIN `imagePartitionTypes` ON `imagePartitionTypes`.`imagePartitionTypeID`=`images`.`imagePartitionTypeID`  LEFT OUTER JOIN `imageTypes` ON `imageTypes`.`imageTypeID`=`images`.`imageTypeID`  LEFT OUTER JOIN `hosts` ON `hosts`.`hostID`=`tasks`.`taskHostID`  LEFT OUTER JOIN `hostMAC` ON `hostMAC`.`hmHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `hostScreenSettings` ON `hostScreenSettings`.`hssHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `hostAutoLogOut` ON `hostAutoLogOut`.`haloHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `inventory` ON `inventory`.`iHostID`=`hosts`.`hostID`  LEFT OUTER JOIN `taskTypes` ON `taskTypes`.`ttID`=`tasks`.`taskTypeID`  LEFT OUTER JOIN `taskStates` ON `taskStates`.`tsID`=`tasks`.`taskStateID`  LEFT OUTER JOIN `nfsGroupMembers` ON `nfsGroupMembers`.`ngmID`=`tasks`.`taskNFSMemberID`  LEFT OUTER JOIN `nfsGroups` ON `nfsGroups`.`ngID`=`nfsGroupMembers`.`ngmGroupID`   WHERE `tasks`.`taskHostID`='7502' AND `tasks`.`taskStateID` IN ('0','1','2','3') AND `hostMAC`.`hmPrimary` = '1'  ORDER BY LOWER(`tasks`.`taskName`) ASC
		  366 Query	SELECT `hostMAC`.* FROM `hostMAC`  WHERE `hmMAC`='40:b0:34:39:57:ac'
		  366 Quit

In one second queries

How increase the FOG server performance?

130

12.1k

17.3k

155.4k