Not sure if its a bug or a feature, high FOG server CPU on dashboard
-
@Tom-Elliott The results were initially seen by ESXi reporting high CPU usage on the VM. I connected to the vm and ran top and sorted by CPU usage. There was the number of running httpd processes. I noticed that I had 3 different browser windows sitting on the dashboard (shame on me). I closed all of the browser windows except one. All but 4 of the http processes went away. Changing the remaining browser to any other page the http processes disappeared from the top list. Going back to the dashboard and they returned to the top processes. Closing the browser caused the top processes to go away, relaunching the browser and logging into the dashboard they came back. Adding additional browsers sitting on the dashboard brought back the overall CPU consumption.
This specific fog server is in the dev environment so there are no fog clients hitting the server. This fog server is a master replicator server with 2 storage nodes attached. This issue is not related to the replicator since all of the servers are currently in sync.
-
When I have a browser sitting on the dashboard page the following URL is called every second.
192.168.1.43 - - [02/Dec/2015:21:17:41 -0500] "POST /fog/management/index.php?node=home HTTP/1.1" 200 56 "http://192.168.1.88/fog/management/index.php" "Mozilla/5.0 (Windows NT 5.2; rv:42.0) Gecko/20100101 Firefox/42.0"
Based on Wayne’s information I hacked a bit on fog.dashboard.js
Through some trial and error and careful placement of some returns I found that if I comment out these calls cpu usage is normal (I understand by doing this I’m disabling major parts of the dashboard). I can either comment them out individually or together. If I enable either one there are 3 httpd processes launched with about 9% CPU consumed by each process, as reported by top.
UpdateClientCount();
UpdateBandwidth();
Again I’m not saying this IS an issue, its just could be an issue if you have several techs sitting on the dashboard page. This can be managed by workflow.
-
Hopefully I have made strides to limit the load/cpu usage.
It’s relatively simple and probably not perfect but should be better than before.
It simply takes a timeout of 700 milliseconds.
-
There is no real difference. There are 3 threads running 8% cpu ea. But now I see one thread spike to 11% ever few seconds. Strange indeed.
-
So I’d ask if there is any way possible to send the bandwidth info to the client and then let the client draw/render the graph instead of the server. Or is my understanding not accurate with how it works presently?
-
@Wayne-Workman I don’t understand what you mean?
If you make the client do the rendering as you describe, you’re not testing the server, you’re testing the client’s bandwidth.
The way it works, the client does do all the rendering already, but it needs to get the data from the server.
So the first request just starts the read, but it knows nothing at that point. The second request (1 second later) is how it determines the bandwidth (and of course all the subsequent stuff).
-
@Tom-Elliott So how does that process you just described use so much CPU ?
-
Ever seconds there’s two polling requests to the server.
With multiple pages (let’s just say 2 tabs for now) you have 2 more polling requests to the server.
So add 3 you are up to 6 polling requests, add 3 more, you’re at 12 polling requests.
The requests are processed by the server, and every second there’s the number of tabs open being opened.
It’s just getting DDOS’d at that point. More tabs, equals more polling each time.
There isn’t a simple way to handle it, though I could just make the bandwidth a selectable element.
Now mind you, I also do the same type of checks for the client count. So really it’s being hit with 4 polling requests per second per tab. this is why things can get lost in action.
-
This is only pure speculation here. But looking in the http access log, as I posted below this page is being requested every 1 second.
POST /fog/management/index.php?node=home
What would happen if it takes longer than 1 second to render the page, before the next post request comes in? Would increasing this page refresh to 10 seconds have any negative impact? I tried looking for a meta refresh tag in the page but it looks like the page is being refreshed with javascript. I got lost tracing the source of the page refresh. Tom. without digging to deep in your memory, do you know where this page refresh request is coming from so I can change the refresh to 10 seconds to see if that has any negative impacct?
-
@george1421 The request is indeed being handled by jquery. It’s a literal timeout (setTimeout(<function>,1000)) in /var/www/fog/management/js/fog/fog.dashboard.js.
The function that calls the “refreshes” are (as you pointed out) UpdateClientCount and UpdateBandwidth. The $.ajax functions of those are the callers and the timeouts are performed at the complete sub functions.
Updating to 10 seconds will have a negative impact, not in the sense of polling, but the actual bandwidth determinations are calculated based on a 1 second interval. Also, the limiters (2 minutes, 10 minutes, 30 minutes, and 1 hour) are set based on 1 second intervals.
So to properly update to 10 second refresh rates, (particularly for bandwidth) we would need to recalculate the timings. For example, the 1 second 2 minute limiter is 120, at 10 seconds, it’s 12. So not difficult, just a bit of coding changes.
I’ll make those edits and hopefully things will be a bit better.
-
My intent wasn’t to have you change the official image. I was looking to test to see if it was a positive move or not. This would avoid mucking of something that is working.
But in the back of my head, even for dynamic bandwidth management a 10 second refresh is more that adequate. While its “nice to know” information, a 1 second update cycle is very fast. And what will the tech do with that information if bandwidth was to spike to 100MB/s for 3 seconds. Understand this is just my distorted view, but since this is only FYI info, lets not tax the system too much where it could dedicate those CPU resources to actually pushing the image or managing the client.
-
And I’m done. I added the code to make it functional and accurate for a 10 second span.
-
@Tom-Elliott Well you either fixed the issue or really broke it.
Actually watching top I see the http process pop up in the list but never over 0.4% of the CPU. And it is only a single process that is showing up. So unless something else is impacted by this change, it looks like you nailed it.
-
I consider this issue resolved. Thank you!!
-
Cool. Thanks for the report and hopefully this helps others. Though I still don’t think 1 second was an issue, as most (I imagine) don’t just load the fog management page and leave with it sitting on the dashboard.
-
The only reason I visit the home page is for two things - the storage pie chart and bandwidth. That’s just me though.
I’m glad to see this improvement made - it’ll make FOG work a lot better on older lower powered systems.