SVN 4972 to SVN 5046 server load
-
189 processes
lsof output.txt -
Updated to svn 5020 server load is averaging 32 now.
-
What does top show you? That will tell you what is consuming your clock cycles.
-
@george1421 Likely a ton of apache stuff… it comes down to how the fog hosts/fog clients interacts regularly with the web interface for instructions…
@Developers it might be worth exploring some sort of other way to provide data to the hosts.
-
top - 14:17:35 up 3:08, 1 user, load average: 33.54, 34.93, 33.54 Tasks: 465 total, 22 running, 443 sleeping, 0 stopped, 0 zombie %Cpu(s): 35.5 us, 4.3 sy, 0.0 ni, 44.0 id, 15.7 wa, 0.0 hi, 0.6 si, 0.0 st KiB Mem: 24678932 total, 9619308 used, 15059624 free, 122692 buffers KiB Swap: 4844540 total, 0 used, 4844540 free. 7271580 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1398 mysql 20 0 8730436 336472 7328 S 31.2 1.4 115:48.88 mysqld 14010 www-data 20 0 333584 23144 7052 R 26.2 0.1 0:09.48 apache2 13967 www-data 20 0 327440 16604 5688 S 22.5 0.1 0:11.39 apache2 14066 www-data 20 0 327216 17696 7016 S 16.2 0.1 0:11.28 apache2 13942 www-data 20 0 327184 16116 5456 S 15.9 0.1 0:12.02 apache2 12817 www-data 20 0 333356 22988 7260 R 14.9 0.1 1:13.09 apache2 14045 www-data 20 0 327180 16100 5444 S 14.9 0.1 0:07.29 apache2 13317 www-data 20 0 333616 21536 5724 S 14.6 0.1 0:40.15 apache2 13991 www-data 20 0 333328 21088 5424 R 14.6 0.1 0:08.65 apache2 12793 www-data 20 0 333356 21516 5772 S 14.3 0.1 1:19.05 apache2 13996 www-data 20 0 333324 21148 5384 S 14.3 0.1 0:08.57 apache2 12825 www-data 20 0 333392 22816 7076 R 13.9 0.1 1:24.51 apache2 14211 www-data 20 0 328404 16856 5280 S 13.9 0.1 0:00.42 apache2 13997 www-data 20 0 327180 17644 6984 S 10.9 0.1 0:18.68 apache2 14128 www-data 20 0 327184 16040 5384 S 10.6 0.1 0:04.77 apache2 14171 www-data 20 0 327180 16040 5384 S 10.3 0.1 0:03.64 apache2 14054 www-data 20 0 327216 16132 5444 S 9.9 0.1 0:08.68 apache2```
-
From the top output it does show you have a lot of apache stuff and they are all doing something since it also appears that msyql is running pretty hard too. So your system IS doing something.
What is going on at the time this snapshot was run? Were you deploying? Do you have a large campus where you have a lot of devices talking back to the fog server?
/var/log/httpd/access_log and /var/log/httpd/error_log might give insight on what is hitting your apache server.
-
Can you please try reupdating? I removed some destruct code a while back and based on what it sounds like is happening is the links are completing but not releasing the links. Maybe with it back all will be better?
-
I have one of the largest deployments running SVN but like I said this was fine under a previous SVN one of the earlier changes that was made was to remove some sql validation checks that were used previously to clean incomplete records from the db it seems innodb didn’t like the function calls I wonder if these have been readded since?
-
@Joseph-Hales I have not added back any invalid entry checks.
-
@Joseph-Hales Thank you so much for the lsof output text file. I feel like this is kind of proofing that it’s the clients causing this massive storm of apache childs and probably mysql load as well. This is why Tom and I are a bit helpless debugging this issue as we don’t have such a big test environment. I just grepped through the output and found roughly 120 established connections to distinct hosts/clients. And another 50 or so in FIN_WAIT state.
I think Tom is pretty convinced that it is something he’s done in the code. Maybe it is but I still wonder why clients keep open connections and even don’t close them properly. Maybe this was introduced with the new client? Sorry if I am heading the wrong way here. It’s just me guessing…
-
I am only running the legacy client on my hosts.
-
After updating SVN to 5054 and restarting the server completly it still seems high to me.
System Overview Username jhales Web Server 10.200.10.150 TFTP Server 10.200.10.150 Load Average 29.00, 31.61, 20.87 System Uptime 13 min, 0 users```
-
FYI I can load the pages but they are extremely slow and sometimes fail to load at all.
-
@Joseph-Hales said:
I am only running the legacy client on my hosts.
Thanks for clarifying!! So this idea was really a dead end. That’s good news as I think we can rule out the possibility of it being an issue introduced by the client. Must be the PHP code then I reckon.
-
How about making the checkin time even greater?
I mean, there is no reason for 10,000 hosts in the middle of the school year to be checking for tasks every 2 minutes… that’s kind of absurd.
Why not set it to 2 hours?
-
Do they still checkin on boot? The reason I ask is that I would like a newly imaged machine to pull snapins as soon as possible.
-
SVN 5058 fixes the main issue or at least makes it manageable.
Username jhales Web Server 10.200.10.150 TFTP Server 10.200.10.150 Load Average 6.79, 13.52, 13.55 System Uptime 1:28, 1 user
-
@Joseph-Hales Not sure…
but is it worth 10,000 hosts hitting one server every few minutes just so the occasional imaged computer gets snapins fast? To me it’s not worth it.
-
we image 20 to 50 pc’s a day on average.
-
@Joseph-Hales said:
we image 20 to 50 pc’s a day on average.
Well that’s a lot…
You would benefit a lot from having multiple “Full installations” of FOG, configuring them all to point to the main server for MySQL (for one master DB), and then just setup each one as a storage node. Then, you could slowly migrate all the clients to point to their local server. Long as you have a copy of your SSL key on all the servers (from the main server), you should be good to go.
Dispersing this massive load is ultimately going to be the best way to solve the performance issues.