Mysterious Data Rate Slow Downs on Deploy
I’ve been banging my head on this for the last week and any insight would be greatly appreciated. The issue is that when I try to deploy an image on a standalone gigabit network, the data rates are what I would expect for this type of network (4.5 GB/Min…ie 77 MB/sec), but at some point it just drops to 4-7 MB/sec. That point is ALWAYS the same for a given image. Deploying by multicast or unicast does not make a difference. I even tried to unicast three machines using a one minute staggered start. When the first machine hit the point in the image where it slowed down, the other two kept going at full speed until they each individually hit the same point in the image. I’m using FOG 0.32.
I have no problems uploading the image…that goes all the way through without a slow down. I’m doing a multi- partition, single disk image of a Windows 7 installation. I have taken images from three different machines in various stages of configuration (and therefore different sizes). [FONT=mceinline] I always do a Windows check disk (with surface scan) and defrag before I image. Each of the three images has a different point where the deployment slows down but its always fairly early on during the second partition…13.4 GB, 19.0 GB, and 21.8 GB into a rougly 126-148 GB partition.[/FONT]
[FONT=mceinline]I started with the server as a Dell OptiPlex 380 with 2 GB RAM but then converted one of the 50 new HP Pro 6300 with 4 GB RAM that I’m trying to image into the current server. I’ve tried Centos 6.3 and Ubuntu 12.04 LTS. I’m using a Netgear ProSafe GS108T gigabit switch, but I also tested with a Netgear 10/100 unmanaged switch (FS108) but I got the same results. Cables have been changed as well. [/FONT]
[FONT=mceinline]I’ve downloaded the 10 or 12 newest pre-compiled kernels in the repository as well as the kernel that ships with 0.32. No difference. I have tried to compile my own, but I keep getting kernel panics so I’ve abandon that approach for now.[/FONT]
[FONT=mceinline]The one thing I have not changed is the server disk that I purchased for this project, a [/FONT][FONT=mceinline]Seagate Barracuda ST2000DM001 (2 TB - SATA-600) using LVM for the (EXT4) partitions. I have checked the drive and the file system a number of times. I even tried the official Seagate program, SeaTools to check the drive. I have not found any errors with the drive. I’m going to try the drive that came with the HP Pro 6300 without LVM just for the hell of it. since I can’t think of anything else to change. I’ll update when that’s done.[/FONT]
[FONT=mceinline]Any others suggestions??[/FONT]
There’s a right mixed bag of stuff going on here… hard to pinpoint problems of this nature. A thought did occur though as I was reading… what are the network interface cards on client end, and server end for that matter… and do you use any form of traffic management.
As a test, take a client machine and disable the onboard nic if thats what your using, and place in an alternative nic. Try to find one thats not the same chipset, purely so we can eliminate the nic as a cause. If the card in the server is the same you’ll need to change it also to test, and you’ll have to mess about with ip addresses etc in linux… but anyhow, should at very least eliminate it from possible causes. I have my suspicions that its not actually a software issue at least. The NICs is where I’d go looking first though.
[quote=“Matt Harding, post: 11231, member: 1207”]So the disk is the common component here? You used it on the Dell and moved it to the new system when you tried that out?[/quote]
That is correct. I did a fresh install of the OS and FOG just in case. Thinking that it might be the drive, even tho it never produced any errors, I just went through the whole install process with a different drive. It made no difference. I tried again but this time without LVM (I used primary partitions) and Ext3 instead of Ext4. It made no difference. At this point, every single bit of hardware and software has been replaced without any difference in the result. The only constant left is the client machines. So…I pulled to identical XP machines out of storage to image and restore. This actually worked fine, but the images were just a fraction of the size of the HP images I’ve been handling. I can’t honestly say that it was a fair test.
[quote=“Matt Harding, post: 11231, member: 1207”]To be honest, if you’re using only one disk to push images out and its a standard SATA3 drive, I’d be very surprised if you didnt see the data dip after a while, but its curious that when you stagger machines that it starts off well then dips on each. Is it dropping off at the same point no matter what machine or how many at once you’re trying to do? If thats the case I suspect the image is being cached as it would be normally in disk reads to RAM, and you’re hitting a limit somewhere… either with the drives ability to sustain pushing the data out, or the interface used etc…[/quote]
I suspected it might be something like you described, but I did swap my gigabit switch for a 10/100 megabit switch early on in my testing. Obviously, the data rates were a lot slower but it still dropped at the same point for the image I was testing. I don’t think I reimaged with 10/100 switch.
During this last reconfiguration I noted above, I pulled a new image while I had top running. To be honest, I never really payed that much attention while I was uploading an image since I wasn’t experiencing any slow downs during that operation. I paid attention this time and noticed something. When I started the upload to the server, the load average was around .75 with bounces between .6 - 1.2 with the top processes being 3-5 different nfsd processes. The bandwidth graph (receive) was pretty jagged and then it suddenly flatlined at 0 MB/S and the same time I noticed all the nfsd processes had disappeared from top. Average load also began to drop to idle (.06 - .13). I did a ps -A and saw the nfsd processes were still running…they just weren’t doing anything. About 20 seconds later, the bandwidth graph jumped up to its previous levels and nfsd processes were active again. Then 10 seconds later, it flatlined again and stayed flat at zero. Nfsd was also inactive. The graph was showing a brief spike 5-6 MB/s every 40 seconds or so before returning to zero. Interestingly, the client was still showing data was being copied (without any slow down in data rate…it was around 3 GB/min) and continued to do so until it got to the end of the partition (45 minutes later). The total size of that partion was 126 GB. I did note these two slow downs happened at 19.6 and 21.9 GB copied. I then deployed that image to a new machine and it followed the exact same pattern as above with the slow downs occurring at the exact same point in the image (19.6 and 21.9 GB). The only difference is that the data rate does drop (as reported by the client) so what took 45 min to image takes several hours to complete.
Any ideas what might be going on?
So the disk is the common component here? You used it on the Dell and moved it to the new system when you tried that out?
To be honest, if you’re using only one disk to push images out and its a standard SATA3 drive, I’d be very surprised if you didnt see the data dip after a while, but its curious that when you stagger machines that it starts off well then dips on each. Is it dropping off at the same point no matter what machine or how many at once you’re trying to do? If thats the case I suspect the image is being cached as it would be normally in disk reads to RAM, and you’re hitting a limit somewhere… either with the drives ability to sustain pushing the data out, or the interface used etc…
Out of curiousity when you’re pushing an image out to machines, run top on the server and see what the processor is doing. Is it flat out?
Also forgot to mentioned that I booted up one of the client machines with a Centos Live CD, fomatted the local drive with Ext4, mounted the server over NFS and copied one of the image files without a problem.