Capturing image always hangs fog 1.4.4
-
@MotherFogger Partimage doesn’t do/mean anything as captures are now handled by partclone always. Just want to know what the compression medium is/was. (Gzip/Zstd etc…)
Any way we could get you to try to capture from another machine? This hanging seems strange in general, but maybe there’s a problem with the system you’re working with (considering you can only get to 45% captured). I suppose this could also be a disk space problem on the FOG Server? (If they always get to the same amount then hang).
-
@Tom-Elliott Im setting up a new client to be captured right now, I will be using partclone gzip with a compression of 7 (unless you think there are more optimal settings). I dont believe its a space issue on the server, it has a 500gb partition on it and im capturing the image as single disk resizable, it says its a 7.9gb image. The drive im pulling off of is a 320gb drive, so even if it did capture the whole drive as raw, it still should have enough room for at least 1 image. I will report back once I test this new client capture out.
Thank you
-
@MotherFogger I’d actually say PartClone ZSTD compression at 19 for best results. A minor loss of speed in capture for the compression (but not as much as gzip on 9) and much better overall results on deployment as well as smaller on server disk size.
-
@Tom-Elliott Ran the capture with partclone zSTD and a compression of 19. Things started off slowly (Averaged 1gb/min for the few few minutes compared to 7gb/min normally) but then shortly afterwards speeds began to fall off fast. The first 10 minutes got me to 30% captured, the next hour was spent getting to 42%, and after the speeds started dropping below the 50mb/min mark, I killed the task. A buddy I know who also uses fog suggested I change my fog_boot_exit_type setting from sanboot to grub, running a new capture now but its looking identical to the last one I ran.
-
@MotherFogger This might lead me to think there’s a problem with the FOG server or the network data is being transferred to. I am 100% certain changing the boot exit type won’t have any effect in regards to capturing/deploying an image at all.
-
@MotherFogger said in Capturing image always hangs fog 1.4.4:
the transfer rate drops from about 7gb/min steadily all the way down to almost nothing.
When the transfer rate is just slowly dropping forever - this is an average just calculating based on elapsed time, remaining time, and overall completion. It’s a thing that partclone does (and is sort of stupid). In reality, when this drop-off begins happening, there’s actually nothing transferring at all to/from that particular host you’re working with.
Check the free space on your server.
df -h
and look for partitions with 99% or 100% usage, or close to full. Try capturing from a different machine as a control, even a different model as a control - see how that goes.Keep trying different things - keep troubleshooting the problem, you’ll find the issue. Might even be a duplicate IP or something dumb like that. Could be a bad port on the switch, it could be anything is the point I’m making. You have to just keep troubleshooting with various tests to see what works and what doesn’t to isolate the problem.
-
@Wayne-Workman All drives have between 98 and 100% free space, plenty for storing an image. Is there a way I can check on the server to see what information is currently being transferred? Ive tried with multiple hosts, multiple images, multiple cables etc… I dont think its a duped IP, or the port on the switch, though I am looking into a possible network config issue. Tomorrow im going to try and create a VM on my farm and see if capturing it directly from there solves this issue.
-
@MotherFogger As Wayne already mentioned the transfer rate is just an average number based on the time. Please pay attention to the actual bytes being read/transfered in the partclone view. Do those still rise?
Are you able to
ping
that client or connect via SSH when the transfer rate drops?As well it would be interesting to see if there are still any packets being transfered between the server and that client when this is happening. Would you please capture a packet dump shortly after the transfer rate starts to drop down. Run the following command on your FOG server:
tcpdump -w /tmp/dump.pcap ip x.x.x.x
(put in the clients IP address instead of x.x.x.x). After maybe 20 to 30 seconds stop the dump (Ctrl-C) and upload the dump.pcap file and post a link here. In case you don’t want to publicly upload it I send you my mail address as a private message here in the forum as well. Whichever you like. -
@Sebastian-Roth Thank you, yes I am still able to ping clients even when the transfer looks like its stopped. I let a capture run all night, and it took almost 18 hours, but did eventually complete. I captured the log, but dont exactly know how to extract it for upload. If this was a windows server I could just xcopy it Whats the best way to get the dump off the server and onto another location?
-
@MotherFogger There is SCP (e.g. use WinSCP from one of your Windows clients) to copy files from the FOG server to your client. Or use FTP (e.g. FileZilla). For both you should be able to login using the
fog
user account. Find the password in the FOG web GUI in FOG Configuration -> FOG Settings -> TFTP Server -> FOG_TFTP_FTP_PASSWORD (use the eye symbol)… -
@Sebastian-Roth I appreciate the quick response, just sent the dump off to your email. THanks for all the help, hopefully we can get this issue tracked down and resolved quickly.
-
@MotherFogger Ok, this looks really strange. I see thousands of TCP Dup ACK packets for TCP port 2049 - which is NFS! After about 30 seconds those Dup ACKs stop (maybe because the server - NFS service - responded) and we start over again with TCP Dup ACKs for further NFS data plus TCP Window Update packets.
To me this looks like there are packets lost or heavily delayed somewhere along the way causing the drop of the speed because TCP needs to resend packets over and over again. Though I am not sure if those packets are lost within the physical network or maybe it has to do with the ESX server?
Did you modify the FOG server NFS service config by hand? Please post the full content of
/etc/exports
file here.I am sorry, I got the tcpdump syntax wrong. Should’ve been
tcpdump -w /tmp/dump.pcap host x.x.x.x
(then we would have seen packets going both ways) -
@Sebastian-Roth So I did create a VM on my farm, and ran a deploy to that. The image process took under a minute and was very smooth, no issues at all. I have not modified any NFS configs, I assume theyre all default/auto managed. I have a copy of the exports, would you like me to re-run the tcpdump with the new syntax?
https://ufile.io/fpcep - Exports file
-
@MotherFogger Interesting to hear that it’s all working fine when you stay within the ESX VM environment. As well the NFS config (exports) looks good to me.
What’s in between the client and the FOG server? Some kind of router / level 3-7 switch that might interfere here? Yes, please take another packet dump with the new syntax. Maybe the packet dump file will grow a little bigger then but it’s definitely worth it. In case it grows to 5 MB and more you might upload the file somewhere and send me the link via mail.