Capturing image always hangs fog 1.4.4

MotherFogger

Server

FOG Version: 1.4.4
OS: Ubuntu 16.04

Client

Service Version:
OS: W7X64

Description

Hello, I recently setup a fog 1.4.4 server on a VMware ESXI server running ubuntu 16.04. I set up a base image on a system (Dell T1700, W7X64) and every time I try to capture it, it hangs at various percentages and the transfer rate drops from about 7gb/min steadily all the way down to almost nothing. I have played with compression settings and image formats, and some will get me to a higher percentage of transfer than others, none of them have gotten me past 45% complete. I am running through an unmanaged gig switch connected directly to the closet, but I also tried direct connecting the computer in capturing from and that didnt help either. I tried searching for best practices on capturing an image with fog 1.4.4, but I havent had any luck finding concrete information. Any help you can provide would be appreciated.

Thank you

Tom Elliott

What’s the compression rating set to? What’s the image manager set to?

MotherFogger

Currently set to a compression of 7, but ive tested most of them between 0 and 22. Im using partimage right now, but ive tried partclone Gzip, and Partclone zstd, as well as using the 200mb split options.

Tom Elliott

@MotherFogger Partimage doesn’t do/mean anything as captures are now handled by partclone always. Just want to know what the compression medium is/was. (Gzip/Zstd etc…)

Any way we could get you to try to capture from another machine? This hanging seems strange in general, but maybe there’s a problem with the system you’re working with (considering you can only get to 45% captured). I suppose this could also be a disk space problem on the FOG Server? (If they always get to the same amount then hang).

MotherFogger

@Tom-Elliott Im setting up a new client to be captured right now, I will be using partclone gzip with a compression of 7 (unless you think there are more optimal settings). I dont believe its a space issue on the server, it has a 500gb partition on it and im capturing the image as single disk resizable, it says its a 7.9gb image. The drive im pulling off of is a 320gb drive, so even if it did capture the whole drive as raw, it still should have enough room for at least 1 image. I will report back once I test this new client capture out.

Thank you

Tom Elliott

@MotherFogger I’d actually say PartClone ZSTD compression at 19 for best results. A minor loss of speed in capture for the compression (but not as much as gzip on 9) and much better overall results on deployment as well as smaller on server disk size.

MotherFogger

@Tom-Elliott Ran the capture with partclone zSTD and a compression of 19. Things started off slowly (Averaged 1gb/min for the few few minutes compared to 7gb/min normally) but then shortly afterwards speeds began to fall off fast. The first 10 minutes got me to 30% captured, the next hour was spent getting to 42%, and after the speeds started dropping below the 50mb/min mark, I killed the task. A buddy I know who also uses fog suggested I change my fog_boot_exit_type setting from sanboot to grub, running a new capture now but its looking identical to the last one I ran.

Tom Elliott

@MotherFogger This might lead me to think there’s a problem with the FOG server or the network data is being transferred to. I am 100% certain changing the boot exit type won’t have any effect in regards to capturing/deploying an image at all.

Wayne Workman

@MotherFogger said in Capturing image always hangs fog 1.4.4:

the transfer rate drops from about 7gb/min steadily all the way down to almost nothing.

When the transfer rate is just slowly dropping forever - this is an average just calculating based on elapsed time, remaining time, and overall completion. It’s a thing that partclone does (and is sort of stupid). In reality, when this drop-off begins happening, there’s actually nothing transferring at all to/from that particular host you’re working with.

Check the free space on your server. df -h and look for partitions with 99% or 100% usage, or close to full. Try capturing from a different machine as a control, even a different model as a control - see how that goes.

Keep trying different things - keep troubleshooting the problem, you’ll find the issue. Might even be a duplicate IP or something dumb like that. Could be a bad port on the switch, it could be anything is the point I’m making. You have to just keep troubleshooting with various tests to see what works and what doesn’t to isolate the problem.

MotherFogger

@Wayne-Workman All drives have between 98 and 100% free space, plenty for storing an image. Is there a way I can check on the server to see what information is currently being transferred? Ive tried with multiple hosts, multiple images, multiple cables etc… I dont think its a duped IP, or the port on the switch, though I am looking into a possible network config issue. Tomorrow im going to try and create a VM on my farm and see if capturing it directly from there solves this issue.

Sebastian Roth

@MotherFogger As Wayne already mentioned the transfer rate is just an average number based on the time. Please pay attention to the actual bytes being read/transfered in the partclone view. Do those still rise?

Are you able to ping that client or connect via SSH when the transfer rate drops?

As well it would be interesting to see if there are still any packets being transfered between the server and that client when this is happening. Would you please capture a packet dump shortly after the transfer rate starts to drop down. Run the following command on your FOG server: tcpdump -w /tmp/dump.pcap ip x.x.x.x (put in the clients IP address instead of x.x.x.x). After maybe 20 to 30 seconds stop the dump (Ctrl-C) and upload the dump.pcap file and post a link here. In case you don’t want to publicly upload it I send you my mail address as a private message here in the forum as well. Whichever you like.

MotherFogger

@Sebastian-Roth Thank you, yes I am still able to ping clients even when the transfer looks like its stopped. I let a capture run all night, and it took almost 18 hours, but did eventually complete. I captured the log, but dont exactly know how to extract it for upload. If this was a windows server I could just xcopy it Whats the best way to get the dump off the server and onto another location?

Sebastian Roth

@MotherFogger There is SCP (e.g. use WinSCP from one of your Windows clients) to copy files from the FOG server to your client. Or use FTP (e.g. FileZilla). For both you should be able to login using the fog user account. Find the password in the FOG web GUI in FOG Configuration -> FOG Settings -> TFTP Server -> FOG_TFTP_FTP_PASSWORD (use the eye symbol)…

MotherFogger

@Sebastian-Roth I appreciate the quick response, just sent the dump off to your email. THanks for all the help, hopefully we can get this issue tracked down and resolved quickly.

Sebastian Roth

@MotherFogger Ok, this looks really strange. I see thousands of TCP Dup ACK packets for TCP port 2049 - which is NFS! After about 30 seconds those Dup ACKs stop (maybe because the server - NFS service - responded) and we start over again with TCP Dup ACKs for further NFS data plus TCP Window Update packets.

To me this looks like there are packets lost or heavily delayed somewhere along the way causing the drop of the speed because TCP needs to resend packets over and over again. Though I am not sure if those packets are lost within the physical network or maybe it has to do with the ESX server?

Did you modify the FOG server NFS service config by hand? Please post the full content of /etc/exports file here.

I am sorry, I got the tcpdump syntax wrong. Should’ve been tcpdump -w /tmp/dump.pcap host x.x.x.x (then we would have seen packets going both ways)

MotherFogger

@Sebastian-Roth So I did create a VM on my farm, and ran a deploy to that. The image process took under a minute and was very smooth, no issues at all. I have not modified any NFS configs, I assume theyre all default/auto managed. I have a copy of the exports, would you like me to re-run the tcpdump with the new syntax?

https://ufile.io/fpcep - Exports file

Sebastian Roth

@MotherFogger Interesting to hear that it’s all working fine when you stay within the ESX VM environment. As well the NFS config (exports) looks good to me.

What’s in between the client and the FOG server? Some kind of router / level 3-7 switch that might interfere here? Yes, please take another packet dump with the new syntax. Maybe the packet dump file will grow a little bigger then but it’s definitely worth it. In case it grows to 5 MB and more you might upload the file somewhere and send me the link via mail.

Capturing image always hangs fog 1.4.4

Server

Client

Description

131

12.1k

17.3k

155.4k