Image Capture seems to hang on "Cloning Successful"
-
I made sure to download the latest versions of the kernels yesterday as part of my installation, as that was a problem I had run into in the past. So… 5.15.34, it looks like.
I don’t have an answer for this one. When I make my next attempt I’ll make sure to take a picture of it. Hopefully, the next attempt will be shortly.
There’s 116 GB used during the last attempt which took 45 minutes. All of that is on 1 partition, the other major partition(excluding the system-built recovery partitions) was completely empty until last night when I started copying the rest of the files over.
The capture rate tends to start at 20ish GB/min, and steadily falls until it hovers around 1 GB/min.
The target computer is a physical laptop with 2 TB NVMe drive.
Hm. As for the FOG Clients… I’d guess 0, or nearly 0. Fresh install, different IP from the server that just failed that I couldn’t figure out. Even if all the previous clients were connected, it would be nearly 0 at the moment, as most of the laptops are put away. 100 at max load though.
-
@flipwalker ok good start on the answers.
The 20GB to 1GB, a decline in speed from initial to average run time is normal as buffers fill up and then network congestion sets in. 1GB/min transfer rate is a bit low for a well managed 1GbE network using a modern target computer. I would expect around 5.2GB/min or better. A 116GB image size is pretty large, and should take about 23 minutes in my estimation.
You don’t have many computers with fog clients hitting the server to overload the network connection or the server.
Knowing if partclone is capturing ntfs or raw image would be useful to know.
Also in the image definition for this computer what is the zstd compression number? 11 is a good balance between image size and compression. If you have it cranked up to 22, then I understand why its taking a long time to compress the image.
Have you tried to rule out your infrastructure as a problem by trying to image a target computer on the same network switch as the server?
You are running on a proxmox server, how busy is that server? Do you have a single network connection on that server or do you have a LAG trunk group setup to add additional bandwidth to the server?
-
The reason the speed is odd to me is with the old server speeds were significantly faster. I used to(a year or two ago) be able to capture a similar image in a matter of 15-20 minutes. I wish I had more current history, but I’ve been out of control of this server/image process for a year or more, but as I’m the one who originally set it up, when it broke it got dropped right back into my lap.
The compression number was left as default, which shows as 6 currently.
I have not tried to image in the server room, that’s something I could attempt, and I won’t try to argue against infrastructure, as part of this unfolding drama was a series of power failures about 2 weeks ago.
The Proxmox server is not busy. It has one secondary Domain Controller for Active Directory, and now this image server. The goal is to get 1-2 more servers built onto Proxmox, but currently, the old stack is still primary. Even when everything is on this server it should be a minor load, as it’s an isolated network supporting 4 classrooms with a maximum of 20 laptops per classroom.
I am currently running on a single NIC from this server though. I need to figure out that issue soon, but I hope it’s not already an issue. I wish it was as easy for me to get the equipment I need as some people seem to think it is. eyeroll But that’s not your issue, I just need to either find some more ethernet nics, or get my superiors to spring for fiber(which is what 2/4 of the nics in the server are, yay for hand-me-down equipment).
Okay, so sounds like the next step is to try capturing from the server room and check to see if it’s raw image capture. Hopefully, my robocopy finishes soon and I can try that.
As a side note, thank you for your help. I wish I had better answers for you, but I’m a completely self-taught admin, and as such my knowledge has huge holes in it, as I learn what I need to put out whatever fire happens to be dropped in my lap.
-
@flipwalker As you can tell from my questions, I haven’t been able to narrow in on a single problem so they are a bit all over the place.
If we look at building a truth table we still don’t know a lot. The problems I see:
- Could be the vm host server CPU or underlying disk infrastructure
- Could be the physical network interface on the vm host server
- Could be the virtual network interface on the vm
- Could be the networking infrastructure
- Could be the network jack or cable on the target end
- Could be the target computer
There are tests we can run to rule out disk subsystem and network issues. Swapping out a target computer for a different model just to see if we can get better performance from the same network jack.
-
I understand completely. I’ll try to provide feedback as I can.
-
I would personally doubt that this is the issue. The server is quite over powered for what it’s being asked to do (72 x Intel Xeon CPU E5-2697 v4 @ 2.30GHz (2 Sockets), 256GB RAM, RAID10 array of drives).
-
This could be the issue, or at least I certainly can’t think of a way to eliminate it.
-
Again, can’t eliminate it as I’m far from an expert in Proxmox
-
This I think I can at least put as a low probability. I’m currently running another image, and it’s showing the same symptoms. For instance, it went to 15% complete in 20 seconds. Then hung on the same block for over 3 minutes before completing a couple more percent and hanging again.
-
I think I can eliminate this as well, as I’ve just moved locations of imaging with the same result.
-
This is possible, I could try another laptop.
Here are two images illustrating what I’m talking about. Second image was taken just after I noticed it was going again, and then it stopped shortly after I took it.
-
-
@flipwalker Ok good you’ve eliminated a large chunk of what could be wrong. Testing a different computer of the same model and different model on the same switch as proxmox will again rule out several possibilities. This is more of a hunt to where the problem isn’t so what is left is easier to debug.
So intuition is telling me the possiblility of having a bad spot or corrupt disk to/where partclone needs to slow down to read it correctly, or there is a bottle neck on the network adapter on the server, but I might also suspect that the other VMs responsiveness would be impacted if it was the server nic.
-
So I’ve been doing some more experimentation. Just tried a different laptop, the same model. I deployed the image that I have successfully captured(though very slowly). It deployed okay. Not as fast as I’d hoped but seemed fairly steady at least.
I then made a few minor changes(that I had forgotten initially) and attempted a recapture, putting it back on zstd. This went blisteringly fast, up to 97%. As in within 5 minutes, it was 97%. Then it hung again and took another 10-15 minutes to ‘complete’ but then hung on the ‘Cloning Successful’ message. I do think I noticed something new, I don’t THINK it had attempted the 500MB recovery partition. I think it had captured the first small system partition and the second ‘main’ partition with the OS, but I don’t think it got to partition 3 (recovery) or the empty partition 4 for some reason.
Regardless, I powered the laptop down and rebooted it to make sure it would. I noticed that partitions 2 and 4 hadn’t been resized back to the original, so I expanded them back out, shut down again and I’m currently trying a new capture, this time using Gzip, to see if that works. Currently, I’m 21 minutes in, and only 43% on the main partition. It’s doing the same ‘hang’ for long periods of time with 0 progress. Update: It also hung at the same point.
Does any of this help?
Edit: So I’ve /maybe/ started to narrow it down. I went back to zstd. But I decided to delete the ‘recovery’ partition, as unneeded. I’ve now captured 2-3 times, in about 3-4 minutes each, except I’m still sticking on ‘Cloned Successfully’ before it hits the last partition(Currently empty). I’ve tried doing it both as resizeable, and nonresizeable. But the speed is what I’d expect to see. Does this point to anything?
-
@flipwalker said in Image Capture seems to hang on "Cloning Successful":
but then hung on the ‘Cloning Successful
Was this message posted on the blue partclone screen or was it on the black and white text screen? The question is what posted cloning successful so I can see what is next in the capture script.
-
@george1421
It was black and white under the blue cloning screen, as the blue screen moved up one line, and on the bottom, there was a black and white message. -
Any more ideas? I just went back in and deleted everything other than system created partitions and it’s still giving me this screen. Speed is where I’d expect it, but it refuses to complete the process.
-
New update. Because I was frustrated I just let the laptop sit on that screen as I worked on other things. I’m not sure how long it took, but it was eventually completed. However, when I look at the FOG image list, I’m confused again.
The presysprep image is reporting as 442 GB (should be around 30ish used, with the full partition being 442ish). I understand that this would be the client’s size, but shouldn’t it have shrunk the partition down to minimum size with resizing turned on? (like the image above it, which technically has much more data, but is reporting as a smaller size).
I’m sorry for all the questions, but I feel like I’m missing something obvious here.
-
@flipwalker In your last picture you see where it says Filesystem: NTFS. It can shrink those partitions down. If it says Filesystem: RAW that is a block copy of the data, where the captured image size should match exactly the size of disk you have. If your system has bitlocker enabled or under linux using LVM then FOG can’t do anything with that filesystem other than a RAW (dd) copy of the disk.
-
No bitlocker enabled, so I’m not sure why it’s not resizing that partition. I’m in the middle of another image capture attempt, and though it captured the OS partition quickly(and then took 15 ish minutes to move on to the next one after the ‘Cloned Sucessfully’ message) it’s back to the thing where it’s stalled for no apparent reason and dropped back down to miserable speeds.
I am entirely mystified about what is going on. It doesn’t seem to be consistent in what it’s doing. sigh Is there something else I can grab to help identify what’s going on?