Very slow unicast deploy when imaging 2 or more machines

Haz

Server

FOG Version: 1.3.0-RC-25
OS: Ubuntu 16.04 LTS

We have recently upgraded our Fog server to the latest version 1.3.0 (previously running 0.32 on Ubuntu 12.04! Old I know, but it worked fine for what we needed) and I am experiencing extremely slow deploy speeds when I image more than 1 machine at a time (between 300MB/M and 800MB/M). I understand that if I add the hosts to a group and assign a Multicast task to the group, then the speeds will be much better, but this doesn’t really tie in well with how we work, and separate unicast deploys on our old Fog server used to work fine with no bottlenecking.

I am just wondering - is this normal? A few factors I suspect could be affecting this:

Could it be something to do with my current Kernel? I am using the default one at the moment, but maybe somebody knows of a different Kernel which copes better with a few unicast deploys at roughly the same time?
The image it is affecting most currently is a Windows 7 build with a bit of extra software, and also a recovery partition; so the image type is a "Single disk, multiple partition (non resizeable). It is quite an old image now (around 20GB), and was captured from a deploy from our old Fog server. Perhaps the image is a bit messy and needs re-building?
Could it be related to the image compression option when initially creating the image? This is a completely new feature to me, so I have been leaving it at the default “6”. Space is not really an issue with our new Fog server, so if I need to pull this value down to “1”, it wouldn’t be a problem.

When the new server was first set and working, I tested a Windows 10 image capture and deploy (the main reason I upgraded to Fog 1.3.0 is for the Windows 10 support as it is becoming much more popular with our customers now) and I was blown away with how quick it was! The capture speed was hovering around 6GB/M and when it was deploying it was topping around 9GB/M. The 10GB Windows 10 image was deployed in 1 minute 12 seconds - that’s crazy fast compared to the old server!

Any ideas or suggestions on this would be greatly appreciated.

I would just like to send a big thank you to @Jaymes-Driver @Tom-Elliott for a previous discussion on setting up DNSMASQ on an (very) old thread. Although I never got around to setting it up 100% on the old server, I am happy to say it’s working perfectly on my new server. I revisited the old thread a few times recently, and with your guidance and suggestions I was able to get it going. Many thanks once again!

Wayne Workman

So if you’re deploying to just one host, you get 9GB/M but if you are deploying 2 hosts at once you get 300-800MB/M ?

That doesn’t make a lot of sense, these numbers just don’t align.

First, you need to have tests where you change just one thing, and not many things. I don’t know how you tested and came to these numbers, but the test probably wasn’t consistent which is why the numbers make no sense. Here’s what I would do:

Pick an image and stick with that.
Pick 2 computers of the same model and stick with them.
Don’t move the computers, use the same drops for both tests.
Image one computer, note the ending speeds towards the end of imaging.
Image the other computer, note the ending speeds again for this. Write these things down.
Then for the last phase, image both at the same time, write down the ending speeds for each.
Think about the results, share the results. Also share if their physical network links were 100Mbps or 1Gbps.

Also,
It may not be the FOG Server. It could be that the two computers you imaged at once are on a 100Mbps switch, and all the ports on that switch share a 100Mbps uplink to the next-tier device. It could also be that these two computers you imaged at the same time are just old, and that’s as fast as they can go. Cheap little netbooks get terrible speeds because of their low-power economy processors.

This new server, you’re sure it has a SATA 3 (6Gbps) disk in it? How much RAM does it have? Is it at least dual-core ?
Do you have it in a VM, if so, perhaps the VM Host’s storage was just very busy at the moment?

Also, A compression of 1 will most likely degrade your speeds, and a compression of 9 would also degrade your speeds. The compression is what balances network and cpu. If you have super-fast CPUs but a super-slow network, it makes sense to put compression on 9. If you have a super-fast network and super-slow CPUs, it makes sense to put compression on 1. Most people aren’t in these two extremes but are somewhere in the middle. 6 has been found to be the best balance for a 1Gbps network and Sata3 disks with modern Core i5+ processors. Refer to this for community-research on compression: https://wiki.fogproject.org/wiki/index.php?title=Image_Compression_Tests

Could it be related to the image compression option when initially creating the image?

Maybe, doubt it. If this image deployed fine before, it would not suddenly get terrible after being migrated to the new server - unless you deployed/captured it at a compression of 1 or 9 instead of just transferring the files from old->new.

Could it be something to do with my current Kernel?

Unlikely.

george1421

I’m personally a bit suspicious of the 9GB/min simply because the theoretical maximum for a single GbE link is 7.5GB/min. I realize the number given by partclone is a bit misleading because it includes decompression transfer rates too. In practice I see about 6GB/Min on a typical basis. It does vary some based on target computer and network utilization. But 6GB/min is a nice round number. To get the average transfer speeds you really need to look at the transfer rate at the 1 min mark. This is where the transfer settles down and is consistent.

Just so I don’t get off track here 300MB/min is 5MB/sec (or about 1/2 10Mb link) and 800MB/min is 13MB/s (or about the speed of a 100Mb/s link). I’m not drawing any conclusions here only setting the scope of transfer.

Now you might want to tell us a bit more about this new FOG server. What type if hard disk subsystem do you have in this FOG server? Is it just a single spinning hard disk? What is the RPM it runs at? I can see a slow disk subsystem doing this (assuming its not the network). On a single unicast deployment the disk image is read sequentially from the hard drive. During 2 unicast streams the hard drive has to move between the two data streams to feed both streams so the disk subsystem can optimize the reads (simply because it doesn’t know where the next request will come for). I’m not saying this IS the problem, but could be one of the issues.

I would follow Wayne’s guidance on checking the network side. If you can’t find it externally then lets look a bit closer at the fog server.

Also how much free RAM does the FOG server have when its fully running (top) will give you an idea here.

Haz

Thank you both for your replies, your input on this is very appreciated on this.
So I think maybe I should have done a little more testing on this before posting my original question, as Wayne has suggested.

I have found that these speed issues only occur when “staggering” deployments on multiple machines. Let me explain from the start -

I have tested deploy of a Windows 7 multiple partition non resizeable image to an Optiplex 780 SFF on all deploy stations that we have, just to eliminate a possible faulty Ethernet cable. All stations were fine with speeds hovering around 5.5GB/m - 6GB/m. Also, no pesky 100Mbps switches anywhere on the network - all are 1000Mbps.
I then used 2 identical machines to the one above and used station 1 and 2 to deploy the same image to each machine, one at a time. All results fine, same as the above.
I then imaged them both at the same time on the same stations. While the initial start speed was slightly slower (3.6Gb/m) they eventually got back up to speed by the time the second partition was deploying, topping around 6.3GB/m.
I then added a 3rd machine to station 3, all same specs, same image etc. Again, slightly slow start to begin with but gradually gathering speed up to about 4.7GB/m in the end.

I repeated the process above with all 3 machines, however I “staggered” the deploy task to give around 1 minute between each one before the deploy starts.
This is where the speed issues began.
The first started normally around 6GB/m, but when the second machine started roughly 1 minute later, it was getting no more than 500MB/m for the first 15 minutes of deploying. The third machine started a minute after again, with no more than 350MB/m for the WHOLE deploy.

You both may be thinking this is a rather strange method of imaging and using FOG, but that is just how our routine has always been; air compressor in the machine to remove all dirt/dust, fit the required RAM, HDD, add-in cards etc for the customers requirements, begin image through FOG, then onto the next machine. Rinse and repeat. So by the time you have built the next system and want to image it, the first system is only around half way through its deploy.

I am not sure if FOG was designed to meet requirements of this nature, It’s just odd that our old 0.32 version could handle this routine quite happily.

Now, onto the FOG server specs -
Optiplex 380 SFF
Intel Core 2 Duo E7500 2.93Ghz
6GB RAM
I’m pretty sure it’s SATA2 on that board (correct me if I’m wrong)
I’m not running VM, it’s dedicated.

The old 0.32 server is an ancient Dell Precision 470 workstation, and I’m pretty sure this would have been SATA2 as well, if not lower.

@george1421 said in Very slow unicast deploy when imaging 2 or more machines:

Also how much free RAM does the FOG server have when its fully running (top) will give you an idea here.

I’m not sure how to check this, and I think you may have tried giving me a link here to find out but it hasn’t displayed properly?

Thanks again for your input so far on this.

Tom Elliott

@Haz I’m assuming you’re using “Partimage” for the images then?

I would almost think this being a part of the problem. Partimage does native decompression of the gzipped data, where Partclone we’re passing through a highly improved item.

Maybe try a re-uploading the image after it is on one system and give it a go? I’m just grasping at thoughts and while it shouldn’t seem to impact things, we’re essentially (as I’m hearing it) doing two times the work on the partimage files.

george1421

@Haz said in Very slow unicast deploy when imaging 2 or more machines:

Well that is a pretty comprehensive test. I’m still thinking it could be that single sata disk that is causing a problem.

One question comes to mind, when you do a 1 minute stagger start, after unit 1 finishes does unit 2 increase from 500MB/s up to 6GB/m or does it stay at 500MB for the duration of the imaging.

<edit> OK second question can you get the details (model number) of that hard drive. If you are using a dell then you can get this info from the bios screens.

Haz

Thank you again for the replies.

@Tom-Elliott said in Very slow unicast deploy when imaging 2 or more machines:

I’m assuming you’re using “Partimage” for the images then?

As far as I can remember, it is Partimage. I am out of the office now and will not be back until tomorrow (UK time) so will confirm it then. It is just the default imaging software that came with my installation (I installed FOG 1.2 from scratch and then used git pull to update to the latest version). If it does turn out to be this, would somebody be able to point me towards some instructions on how to install it to work with FOG?

@george1421 said in Very slow unicast deploy when imaging 2 or more machines:

I’m still thinking it could be that single sata disk that is causing a problem.

I am also starting to think the same thing. I’ll get the model number for you tomorrow but I can tell you it’s a 160GB Western Digital. Sometimes the performance on these drive can be pretty horrible, and simply running the WD Drive Utilities on the HDD in question can work wonders for drive performance - even when the drive is completely unreadable.

Yes, unit 2 does slowly increase in speed after unit one has finished imaging. Very slowly.

Just to add some confusion to this matter, I am actually going to be building a brand new FOG server for when we move premises in a weeks time. The current one was just a test run really, and wanted to try and iron out any potential problems before we get into our new office. I’ll be using a Precision T3500 with at least 1x Xeon processor in and I’m pretty sure the board is SATA3 too. I’ll be very mindful of the HDD I choose this time around…

I’ll obviously update everyone on progress of the current server and will let you know how the Precision compares once it is built.

george1421

@Haz In reference to Tom’s post. Was the image recorded with FOG 1.2.0 or newer or is this an old 0.3x image? The newer version of FOG and new captured image with 1.2.0+ will net better results than and older captured image with newer FOG.

If you have the option (if you only have a single disk) use an SSD instead of a HDD. The cost is trivial and the speed is great. Your specs on the 380 is sufficient, just the HDD is in question (in my mind only). FWIW I was able to deploy ~4.5GB/m for a dual unicast deployment using a Dual Core celeron in an intel nuc running off an SSD. So I know its possible to get pretty good speeds out of a low end system. The fog server itself is not really taxed much at all during deployment. The only thing the fog server does during a deployment is move data from disk to the network. The target computer does all of the heavy lifting during imaging.

Wayne Workman

@george1421 said in Very slow unicast deploy when imaging 2 or more machines:

I’m personally a bit suspicious of the 9GB/min simply because the theoretical maximum for a single GbE link is 7.5GB/min.

That speed is not transfer rate, it’s write rate, the speed at which partclone is writing to disk. If you want to see actual network speeds, install iftop on the FOG server, and then run iftop -n during imaging on the server.

Wayne Workman

@Haz said in Very slow unicast deploy when imaging 2 or more machines:

I repeated the process above with all 3 machines, however I “staggered” the deploy task to give around 1 minute between each one before the deploy starts.
This is where the speed issues began.
The first started normally around 6GB/m, but when the second machine started roughly 1 minute later, it was getting no more than 500MB/m for the first 15 minutes of deploying. The third machine started a minute after again, with no more than 350MB/m for the WHOLE deploy.

That is what I’ve found too honestly. This I believe is normal. A while back, I debated for changing the default Max Clients setting to be 2 or 3 instead of 10. Why? Because this prevents one computer from taking 2 hours to image, by limiting how many are going at once to just 2. I’d recommend you limit your max clients to 2 or 3 and you will most likely see the same overall duration decreases for imaging a lab of 30 computers as I did. Computers that try to image when 3 are already going will simply wait in line for their turn via FOG’s queuing system (you don’t have to do anything for this)- and that’s perfectly fine.

Haz

@george1421 said in Very slow unicast deploy when imaging 2 or more machines:

Was the image recorded with FOG 1.2.0 or newer or is this an old 0.3x image?

The image I have been using for these tests was captured using the new 1.3.0. I was thinking of transferring our old images to the new server, however I didn’t have much faith in it actually working or at best there would be issues somewhere down the line so I just captured from fresh.
Using an SSD was discussed, and I noticed an old thread here where somebody was thinking about using one, but I don’t think the results were actually shared. Now that you have clarified this for me I believe I will be going down the SSD route. Thanks.

@Wayne-Workman said in Very slow unicast deploy when imaging 2 or more machines:

That speed is not transfer rate, it’s write rate, the speed at which partclone is writing to disk.

Thanks for confirming this. I was getting rather confused after the server was continuing to display this speed after you stated it was not possible!

@Wayne-Workman said in Very slow unicast deploy when imaging 2 or more machines:

I’d recommend you limit your max clients to 2 or 3

That is a good idea. It will be a much better option for hosts to simply wait in line so that they can image at full speed rather than pull the rest of the hosts down. Considering the vast speed increase of 1.3.0 on just 1 or 2 hosts, this will be just as fast, if not faster than our old method on 0.32 anyway.

Many thanks to all of you for your input and suggestions, great help as usual.
Keep up the good work!

Harrison

Quazz

This thread is strange to me, because we can unicast like 7 hosts with barely any speed lost (7-8GB/min). Our Fog server has a refurbished HDD, so I don’t know if putting an SSD in your FOG server would help much.

Sounds more like your clients are being very slow for some reason (or network derping)

george1421

@Quazz said in Very slow unicast deploy when imaging 2 or more machines:

This thread is strange to me, because we can unicast like 7 hosts with barely any speed lost (7-8GB/min)

Please describe your fog server then (hardware, and number of nic connected to your lan from fog server)

Quazz

@george1421 It’s an HP Elite 6200 I think? Core i5 3470 or something, 6GB RAM, refurbished 5200RPM HDD, one 1gbps NIC, nothing fancy.

george1421

@Quazz Hmm, its less than I expected. But impressive speeds none the less.

Quazz

@george1421 I should note, that we don’t use FOG client and storage nodes or anything, so it literally only does network boot menu and imaging, which might explain it a bit.

Wayne Workman

@Quazz I’ve not had the same luck you’ve had. Our bottleneck was always the network connection to the FOG server, I could clearly see the NIC being maxed out by watching iftop -n

george1421

@Wayne-Workman Create a LAG trunk of 2 or more links. Its not fool proof but will help if you are using a static lag or lacp (802.3ad).

Depending on the algorithm used it will select a lag channel based on host target IP or MAC hash. This will spread the traffic out. I have to say it carefully because LAGs don’t multiplex the traffic over all links only allocate the traffic between two devices to different links.

Wayne Workman

@Haz The thread has gone off topic and has been forked to here:
https://forums.fogproject.org/topic/9085/host-already-registered-and-memtest-issue

Very slow unicast deploy when imaging 2 or more machines

Server

51

12.7k

17.6k

156.8k