Painfully slow image upload / hung_task_timeout
I’ve had a Ubuntu 12.04LTS Fog Server running for a few weeks now, and I have a nice collection of XP and 7 images for different deployments we use. I am currently uploading my second custom image for a tablet some of our doctors use. They are HP Elitebook 2760p Tablets running Windows 7. My first image was this system running Office 2007, the current image is running Office 2013.
Now to the issue: This most recent image upload is failing miserably. I am getting this message every few minutes on the client. It looks like it is trying to overlay the message on top of the progress screen.
[CODE]task pigz:307 blocked for more than 120 seconds.
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message[/CODE]
In the web task management, I still see the queued task, but it is currently 1h out of 16h complete (5%) and it has uploaded 1.33GiB of 20.56GiB (0.33 MiB/sec)
The most recent change I’ve made to the server is adding disks, however this is not the first image I have uploaded on the new disks. I’ve included a screenshot of my System Monitor/File Systems tab, as you can see I have plenty of space.
This post is deleted!
They work great for FOG, I have no complaints with the FOG portion, I managed to get everything working with the free and trial software, my management software however REFUSES to work or communicate, probably because of the ports it operates on.
Always been more of a Netgear guy myself… what models are the Ciscos?
I use the Dlink DGS-1248T switches in my backbones, they are managed switches so I can enable/disable features, or ports on the fly.
I also have some newer Cisco Switches… unfortunately I don’t speak highly of them. They were purchased on contract for a new addition to my building, when they were set up the contractor did all the work, and now they aren’t on my contract or on my service license so I CAN’T SERVICE THEM OR EDIT SETTINGS, THANKS CISCO!!! The switches themselves work GREAT but I can’t enable/disable features that get in the way.
Yep, I just recently did a Fog deploy to ~30 machines our organization was getting rid of and giving to the public. I made a hardware independent XP image and pushed it to those machines all in one go, and it took a while for me too. I had an old 3COM 10/100 switch with all of them plugged into it. I knew I had taken that switch offline for a reason lol. Later when I had another machine that I needed to add to the group I imaged it alone, and the speed was much higher. My guess is there is a combination of what the switch can actually throughput and the switch’s onboard memory. Now I know this isn’t too much of a thing for switches, but later I sent another image operation through a newer 10/100 switch and it went even faster (differences in host hardware aside).
So clearly we should always have the fastest switches we can afford in our network’s backbone, and only allow for a 10/100 (if we have to) if it is only supporting a couple of workstations. But I wonder if there are any other specifics we can pick up on such as brand, model series, or other specifications that tend to choke Fog up?
AWESOME!!! Gotta love the easy fixes, sure they cause you a lot of heart ache, but when it comes down to it, replacing a switch is not that big of a deal to me :)
Glad we could help!
Holy snog. I was an archaic 10/100 switch that was holding this whole thing up. Switched it out with gigabit and its fine. This does concern me a little bit because we have some off-site machines I was hoping to FOG at a later date. That is a bridge I’ll cross when I come to it.
Hopefully you can find the snag and get it off the ground again. If you want to eliminate the switches as a possibility, use a small home network router to dole out DHCP and set up a test network, and see if the speed increases, if it does, there may be something in your network causing the limitation and it may not have anything to do with the image or the hardware.
If the imaging goes off without a hitch on a small network, you can image your machines off the network and then deploy them in the environment if that makes the job faster, then you can troubleshoot the issues when time allows.
Actually it’s good you say that, I am currently doing exactly what you suggested, because I reverted back to my initial FOG server installation with a single 140 drive and the problem still persists. I am fairly certain this is nothing to do with FOG and something to do with a network change. I’m testing on multiple computers now to see if it could be either a bad NIC in my test PC, or maybe the networking level at which the imaged computer resides.
The bad part is I just got 15 new machines delivered that are supposed to get deployed next week.
Since you upgraded, try a different image, build one virtually and push it up and pull it down, it doesn’t have to be extravagant, but it should be somewhat close in size to your current images. If you want to really test, make a few images a different size and see where it hits it’s snag, maybe it’s part of your image set up that is causing the problem and not FOG’s fault. I’m not saying that it’s NOT FOG’s fault, but try a different image to see if you can re-create the issue.
I have a Raid 5 server set up running the latest FOG 0.33b, it’s an outdated server from our production environment, but it serves the purpose well, and I don’t experience issues with upload or download speeds. My images sit around 30-45 gigs depending if it is the student or the staff image.
Update: So I completed my install. Now FOG is installed on a Raid 5 will four disks. All 140gb SAS drives. Giving me 410ishGB of effective space. After restoring all my images and such, it still is taking 8-12 hours per image to deploy or upload.
I am just going to assume FOG and my cpu/controller combo do not get along and go back to a single disk solution.
If anyone has any ideas, I’ll keep these disks around so I can swap back to this config and test.
Please let me know if any more detailed information would be helpful.
Currently I have two separate logical partitions. The OS and FOG are installed on a single SAS drive. The images are stored on a separate Raid 5 containing three 72gb SAS drives.
Anyway, my solution now is backing up my system, and re installing Linux fresh. This time I am going with four 146gb SAS in a RAID 5 from the start. We’ll see if it was the drives, or the multiple logical partitions that caused the issue. I was outgrowing that space anyway.
Great tool guys, I’ll update when I see how it runs with this new environment.
What type of “array of disks” is it setup as? RAID 1/5/6/10?
RAID is good for redundancy, but remember the write cycles are double expensive (especially so in RAID 1/10) as it’s mirroring the data to another drive. It probably isn’t the drives write cycles themselves, but the CPU taxing to make sure the tasks are performed.
Update: After waiting 18 hours the image finally completed, I am now pulling a different, previously working image back down to the device and it is estimating 5 hours to complete. I should also note that the pigz timeout error is not present in this deployment, only in the previous upload task.
What happened to my 12-minute imaging? What could this new array of disks have done that would kill performance this immensely?
Thanks for any input.