Not sure if its a bug or a feature, high FOG server CPU on dashboard

george1421

During some testing [GIT 5596] I had 3 different browser pages open to the fog dashboard. CPU usage was about 95% on the FOG server not actively doing anything. Looking at the running process there was ~15 http processes consuming about 9% CPU each (yeah, I know the numbers don’t add up). Moving the browsers to any other page than the dashboard dropped total CPU usage to normal levels. With just one browser sitting idle on the dashboard there were either 4 or 5 httpd processing running at 8% ea. IMO this seems a little excessive for mostly static data. I think a 5 or 10 second update cycle would be sufficient for this page.

As I said above, I don’t know if this is expected behavior or there is something else going on.

Wayne Workman

I’ve pointed this out before in another thread.

I isolated it to the bandwidth monitor graph.

george1421

Is there anything we can do about it. As I said dropping the refresh rate on this graph would be OK with me.

Tom Elliott

I don’t agree.

While bandwidth does constantly poll the nodes, it is by far the problem with CPU.

What do you mean by “95%”? Is this what you’re seeing on top? Is top realtime updating? What’s the load average?

High cpu can be anything, but the load is what ultimately matters.

Wayne Workman

The load averages are conveniently located on the dashboard, on the left.

Wayne Workman

https://forums.fogproject.org/topic/6020/fog-svn-5020-and-above-cpu-hammered-thread/14?page=2

@Wayne-Workman said:

I’m fairly certain that the FOG Dashboard is causing the high CPU usage.

after moving the dashboard’s js file and refreshing the FOG Dashboard, I saw a 0.02% 1-minute CPU Load average while sitting on the FOG Dashboard page.

mv /var/www/html/fog/management/js/fog/fog.dashboard.js /var/www/html/fog/management/js/fog/fog.dashboard.js.moved

After putting it back and refreshing the FOG Dashboard, I saw a 0.24% 1-minute CPU Load average while sitting on the FOG Dashboard page.

mv /var/www/html/fog/management/js/fog/fog.dashboard.js.moved /var/www/html/fog/management/js/fog/fog.dashboard.js

Opening three tabs with the file in the correct place, the 1-minute CPU Load average was at 0.96%

Also, each FOG Dashboard page creates approximately 2 httpd processes. So with 3 pages open and sitting on the dashboard, I was seeing about 6 httpd instances in top. So, my hypothesis is that organizations with multiple FOG users under 1 main server probably have the FOG Dashboard open all the time, causing the load.

I’m not saying they should close the dashboard or not leave it open all day… it’s probably a recent code change somewhere on the dashboard causing this new issue…

SO… if you all could please test, and just temporarily move the file and then tell everybody to refresh their FOG tabs, just see what the CPU does and please report back.

george1421

@Tom-Elliott The results were initially seen by ESXi reporting high CPU usage on the VM. I connected to the vm and ran top and sorted by CPU usage. There was the number of running httpd processes. I noticed that I had 3 different browser windows sitting on the dashboard (shame on me). I closed all of the browser windows except one. All but 4 of the http processes went away. Changing the remaining browser to any other page the http processes disappeared from the top list. Going back to the dashboard and they returned to the top processes. Closing the browser caused the top processes to go away, relaunching the browser and logging into the dashboard they came back. Adding additional browsers sitting on the dashboard brought back the overall CPU consumption.

This specific fog server is in the dev environment so there are no fog clients hitting the server. This fog server is a master replicator server with 2 storage nodes attached. This issue is not related to the replicator since all of the servers are currently in sync.

george1421

When I have a browser sitting on the dashboard page the following URL is called every second.

192.168.1.43 - - [02/Dec/2015:21:17:41 -0500] "POST /fog/management/index.php?node=home HTTP/1.1" 200 56 "http://192.168.1.88/fog/management/index.php" "Mozilla/5.0 (Windows NT 5.2; rv:42.0) Gecko/20100101 Firefox/42.0"

Based on Wayne’s information I hacked a bit on fog.dashboard.js

Through some trial and error and careful placement of some returns I found that if I comment out these calls cpu usage is normal (I understand by doing this I’m disabling major parts of the dashboard). I can either comment them out individually or together. If I enable either one there are 3 httpd processes launched with about 9% CPU consumed by each process, as reported by top.

UpdateClientCount();
UpdateBandwidth();

Again I’m not saying this IS an issue, its just could be an issue if you have several techs sitting on the dashboard page. This can be managed by workflow.

Tom Elliott

Hopefully I have made strides to limit the load/cpu usage.

It’s relatively simple and probably not perfect but should be better than before.

It simply takes a timeout of 700 milliseconds.

george1421

There is no real difference. There are 3 threads running 8% cpu ea. But now I see one thread spike to 11% ever few seconds. Strange indeed.

Wayne Workman

So I’d ask if there is any way possible to send the bandwidth info to the client and then let the client draw/render the graph instead of the server. Or is my understanding not accurate with how it works presently?

Tom Elliott

@Wayne-Workman I don’t understand what you mean?

If you make the client do the rendering as you describe, you’re not testing the server, you’re testing the client’s bandwidth.

The way it works, the client does do all the rendering already, but it needs to get the data from the server.

So the first request just starts the read, but it knows nothing at that point. The second request (1 second later) is how it determines the bandwidth (and of course all the subsequent stuff).

Wayne Workman

@Tom-Elliott So how does that process you just described use so much CPU ?

Tom Elliott

Ever seconds there’s two polling requests to the server.

With multiple pages (let’s just say 2 tabs for now) you have 2 more polling requests to the server.

So add 3 you are up to 6 polling requests, add 3 more, you’re at 12 polling requests.

The requests are processed by the server, and every second there’s the number of tabs open being opened.

It’s just getting DDOS’d at that point. More tabs, equals more polling each time.

There isn’t a simple way to handle it, though I could just make the bandwidth a selectable element.

Now mind you, I also do the same type of checks for the client count. So really it’s being hit with 4 polling requests per second per tab. this is why things can get lost in action.

george1421

This is only pure speculation here. But looking in the http access log, as I posted below this page is being requested every 1 second.

POST /fog/management/index.php?node=home

What would happen if it takes longer than 1 second to render the page, before the next post request comes in? Would increasing this page refresh to 10 seconds have any negative impact? I tried looking for a meta refresh tag in the page but it looks like the page is being refreshed with javascript. I got lost tracing the source of the page refresh. Tom. without digging to deep in your memory, do you know where this page refresh request is coming from so I can change the refresh to 10 seconds to see if that has any negative impacct?

Tom Elliott

@george1421 The request is indeed being handled by jquery. It’s a literal timeout (setTimeout(<function>,1000)) in /var/www/fog/management/js/fog/fog.dashboard.js.

The function that calls the “refreshes” are (as you pointed out) UpdateClientCount and UpdateBandwidth. The $.ajax functions of those are the callers and the timeouts are performed at the complete sub functions.

Updating to 10 seconds will have a negative impact, not in the sense of polling, but the actual bandwidth determinations are calculated based on a 1 second interval. Also, the limiters (2 minutes, 10 minutes, 30 minutes, and 1 hour) are set based on 1 second intervals.

So to properly update to 10 second refresh rates, (particularly for bandwidth) we would need to recalculate the timings. For example, the 1 second 2 minute limiter is 120, at 10 seconds, it’s 12. So not difficult, just a bit of coding changes.

I’ll make those edits and hopefully things will be a bit better.

george1421

My intent wasn’t to have you change the official image. I was looking to test to see if it was a positive move or not. This would avoid mucking of something that is working.

But in the back of my head, even for dynamic bandwidth management a 10 second refresh is more that adequate. While its “nice to know” information, a 1 second update cycle is very fast. And what will the tech do with that information if bandwidth was to spike to 100MB/s for 3 seconds. Understand this is just my distorted view, but since this is only FYI info, lets not tax the system too much where it could dedicate those CPU resources to actually pushing the image or managing the client.

Tom Elliott

And I’m done. I added the code to make it functional and accurate for a 10 second span.

george1421

@Tom-Elliott Well you either fixed the issue or really broke it.

Actually watching top I see the http process pop up in the list but never over 0.4% of the CPU. And it is only a single process that is showing up. So unless something else is impacted by this change, it looks like you nailed it.

george1421

I consider this issue resolved. Thank you!!

Not sure if its a bug or a feature, high FOG server CPU on dashboard

71

12.7k

17.6k

156.8k