Another slow Deployment qustion

Arsenal101

Hey all,

We have been noticing some random problems with our fog server lately. It seems at random (more often then not) the fog server will negotiate speeds of like 60MB per-minute while other computers (of the exact same hardware) will get speeds of 6GB per-minute. And to add to the equation, if we reboot the computer half way through it will sometimes catch a fast speed and image at 6Gig per-minute other times it will still download at 60MB per-minute.

Things I have tried/checked

Check all switches from fog to computer, From 1gig card in fog to 10gb copper port over 40gb trunk to 10gb fiber to 10gb switch to 1gb nic card in end computer. everything that I can check is running 1 or 10gb full duplex.
Tried swapping kernel’s on the PXE menu
checked all switchports to end PC for errors/Dropped Packets/etc…
Checked IO stat on fog server to see if there were any HDD Errors…

I am at a loss, I don’t understand why these would be going so slow. We have never in all years previous had fog run this slow.

I am on Fog 1.2.0 SVN 4301 (master) + FOG 0.32 (storage)
Image cap is 10 per server. Nothing crazy. nothing else fancy.

george1421

Wow that is very strange indeed. If I did the math correct 60MB/min translates to 8Mb/s (which is just below 10Mb/s net link). I’m only looking for parallels here.

Transfer rates are determined by link speed, image compression ratio, and target system CPU performance (to decompress the image).

Is the master node and storage node located at the same site? Are they comparable machines (performance and link speed)?

Just for clarity when we talk about updating kernels that is for the FOS client. These kernels are kept in /var/ww/html/fog/service/ipxe The latest kernels (I’m aware of) are 4.6.2

For system demographics
Are you deploying to the same model computers and see this speed difference?

Are you deploying the same image to these same model computers seeing the speed differences?

For clarity on the svn/git number, what are the numbers on/near the fog cloud on the fog web management gui?

Arsenal101

Master and storage node are physically right next to each other. Same IP Subnet and plugged into the same switch. The port on the switch side is actuall a 10gb port but obviously negotiated at 1gb due to the nic in the FOG server. the storage node is older. Master is a Intel MB with a core 2 duo processor, 4gb of RAM. Storage is a Intel board with a Celeron D processor and 2gb of ram(don’t judge! they are recycled).

Where can I find newer kernel updates? the Kernel update page only has up to 4.1.2 and the SVN I am on comes with 4.1.3. Also, where can I change the “default” kernel? All of our machines are pointed to 4.1.2… unless I specify the kernel on the machine level I am not sure how to change the “default”

I can have 2 computers right next to each other with Identical hardware and one may get the top speed of around 6gb per min and the other would get the 60MB… seems random. but seems to negotiate 60MB per min more than the 6GB…

The SVN/GIT I am on is 4301 (little numbers next to FOG in the cloud at the top of the page)

george1421

@Arsenal101

You can get the kernels from here
https://fogproject.org/kernels/bzImage
https://fogproject.org/kernels/bzImage32

It appears you are on an old version of the 1.2.0 trunk then which I think is near the beginning of where the trunk forked off of 1.2.0 stable. The current trunk release is somewhere north of 8100.

I’m curious how you have the storage node sitting right next to the FOG server. Are you using the location plugin to directly the clients to the storage node? If the master node and the slave mode are in the same storage group they should have the same content because of replication.

Is it possible that when you have the slow speeds one of the servers are consistently giving you the under performance? Have you confirmed on the switch side and the linux host side that both servers are indeed negotiating 1GbE link speeds?

You might want to consider upgrading one or both of those to the latest trunk (I’d hold off a day or so, there seems to be a number of upgrade errors today).

Sebastian Roth

@Arsenal101 I am really wondering why you didn’t mention if this is multicast or unicast…

Quazz

Have you tried using different ethernet cables?

I don’t know much about networking, but perhaps something’s going wrong with the autonegotation function?

Arsenal101

@george1421 Thanks I dont know Why I didn’t see those pages before!

The Master and the storage are in the same Storage group. they just happen to be two physical machines that happen to be placed right next to each other. We did it more for Network overhead rather than just upping the client limit on the master server. I did verify that the master and the storage node both have the same images and the master is definitely replicating to the storage.

I could try to unplug the storage server since that machine is the oldest and has the most likelyhood of wonking things up.

I have verified all the way to the end PC there is a 1gb negotiated connection. and its a good auto negotiated connection. not one like I had to tell the port to go to 1gb full duplex, it did it on its own.

I did think about upgrading to the latest SVN but I wanted to hold back as we are right in the middle of the Summer imaging projects which is about ~400ish machines and has to be done ASAP, so if I screwed it up we would be screwed for the summer…

@Quazz I would agree with you if we were having problems with multiple different pieces of hardware all at the same station and cable… but its sporatic. It’s super random. We could have one PC imaging at 60mb per min and then give it a reboot and it cranks at 6gb per min…

@Sebastian-Roth We are unicast. we haven’t done much with multicast. We find its just as easy to group PC’s and just deploy an image to the group or one by one. IS there any advantages to multicast over unicast?

Wayne Workman

@Arsenal101 I have some questions for you.

Are all imaging tasks experiencing the slowness? Or just a few?

I ask because if your maximum client limit is set really high, You’ll see major slow down on the hosts that start up last. This is due somewhat to network limits, but the more you have going, you actually get less and less of that 1Gbps link due to HDD seek times on the servers. The more that are going at once, the more lag because the server HDDs just can’t keep up with all that unicasting and seeking. Also, if a host starts out slow, it stays slow unless rebooted as you described.

How fast does ONE client all by itself image?

Additionally, my questions and what I’ve stated are in-line with what you described:

if we reboot the computer half way through it will sometimes catch a fast speed and image at 6Gig per-minute other times it will still download at 60MB per-minute.

If you think this is your case, turn down the maximum connections, way down (you want a solution, right?).
At work, I have built a 9-server strong FOG system. Each node has it’s maximum connections limited to 3, but honestly it’d probably perform even better if i set them to 2.

I’d recommend you set both nodes maximum connections to 2 and give it a whirl. I think you’ll be very surprised at the increase in computers imaged given an allotted time frame.

Arsenal101

@Wayne-Workman It’s not all that are expiriencing the slowness, but it’s most. We could have 18 computers running at one time and they all could be running at 60MB-PM then we can fire up number 19 and have it crank at 6GB-PM! Random… if we restarted it somewhere in the middle it could get 60 or it could get 6 again… hit or miss…

FOG comes with the maximum client limit set to 10 so I figured that was a pretty good benchmark of where it should be. I could try turning it down. I don’t want to go as low as 3 or 2 though otherwise it would be ton of work to add to our summer plans.

Would it be better to multicast the image instead of unicast it?

We have set up one computer at time and it will image at 60MB-PM … number two we set up could crank out at 6GB-PM, or it could get 60MB-PM… See what I mean. It almost seems like a negotiating problem and not a overhead problem.

george1421

@Arsenal101 How does the FOS engine (that runs on the target computer) know which storage node to connect to?

As for multicasting, there are pitfalls there too. If all of your clients are on the same subnet as your fog servers then it works pretty smooth. If you have to cross vlans/router then you add complexity to your setup.

Tom Elliott

@george1421 Don’t forget about the weakest link in the case of multicast.

If one client is pulling the data in at 50MB/s due to a cabling issue, or on a different speed of switch (Think all systems on gig network, but this one has a 10/100MB switch connected to it).

Arsenal101

@george1421 I am not really sure? I thought all that was handled automatically and it just filled the first 10 slots on the Master Node and the started filling slots on the storage node

We will probably continue to stay away from multicast then, we would have to route to get to the subnet the current devices are imaging on.

Should I set up a location plug in so that each location knows what server to pull from?

We have 5 locations 4 schools and one SAU, all of the schools are connect with 10gb multimode fiber and the converted to copper to 1gb switches from there. We image in the computer labs/libraries that are hardwired to HP Procurve 2910al switches. All 1gb. So I am confident that there is no 10 or 100 mb switch in the way.

At our high school it is a possibility that it is cabling, it is old and i think cat 3? It could 5 though I am not sure. I was ruling out cabling though because of that "one station could be imaging at 60 and as soon as you reboot it, it cranks at 6GB.

We can try it at our Elementary school which is wired with relatively new Cat 6 cabling. So that should weed any cabling issues out.

Arsenal101

On a side 100% side note. If I don’t define a kernel on the “Hosts” Page the machine defaults to bzImage4.1.2. Is there anywhere I can change which bzImage it chooses if nothing is defined?

Arsenal101

@Arsenal101 Ignore me sorry!.. Nothing a simple search wouldn’t have solved…

Quazz

Do you have storage nodes set up at each location? If so, it is probably best to use the location plugin yes.

george1421

This is going to sound abstract.

You have 400 computers to image over the summer.

Why not take 2 models that are the same and of moderate performance and give them a “field promotion” to FOG server? (hint" “field promotion” comes from the military when an officer dies in battle and a private becomes in charge in the field of battle). Set these two moderate performance systems up as a pair of FOG 1.2.0 trunk version servers. The trunk version of FOG/FOS will give you better performance than your current setup. It will also tell you if your slowdown is in FOG or somewhere on your network. At the end of your imaging task just reimage these two servers as desktops using your old fog server and build a plan for what to do next.

Arsenal101

@Quazz right now we have a master and a storage node at the same location. Same IP Subnet Same switch.

Wayne Workman

@Arsenal101 said in Another slow Deployment qustion:

FOG comes with the maximum client limit set to 10 so I figured that was a pretty good benchmark of where it should be.

It’s not, I want the default set to 3, actually.

Wayne Workman

@Arsenal101 Why can’t you set both nodes to 2 and try it? I am not understanding this. You’re not ruling out possibilities. This problem is so simple, and turning down the maximum connections will almost surely solve this.

Sebastian Roth

@Arsenal101 said:

IS there any advantages to multicast over unicast?

YES! If network is setup properly you can send the image to two, ten or 50 machines without a major speed dropdown using multicast because the data is being sent only once over the network and all the clients “hear” it. Think of it like a telephone conference. If you want to tell the exact same thing to ten different people you better get together or meet for a conference call. Other than calling them one by one and telling the same story over and over again…

Another slow Deployment qustion

142

12.6k

17.5k

156.3k