Image Capture seems to hang on "Cloning Successful"
-
The reason the speed is odd to me is with the old server speeds were significantly faster. I used to(a year or two ago) be able to capture a similar image in a matter of 15-20 minutes. I wish I had more current history, but I’ve been out of control of this server/image process for a year or more, but as I’m the one who originally set it up, when it broke it got dropped right back into my lap.
The compression number was left as default, which shows as 6 currently.
I have not tried to image in the server room, that’s something I could attempt, and I won’t try to argue against infrastructure, as part of this unfolding drama was a series of power failures about 2 weeks ago.
The Proxmox server is not busy. It has one secondary Domain Controller for Active Directory, and now this image server. The goal is to get 1-2 more servers built onto Proxmox, but currently, the old stack is still primary. Even when everything is on this server it should be a minor load, as it’s an isolated network supporting 4 classrooms with a maximum of 20 laptops per classroom.
I am currently running on a single NIC from this server though. I need to figure out that issue soon, but I hope it’s not already an issue. I wish it was as easy for me to get the equipment I need as some people seem to think it is. eyeroll But that’s not your issue, I just need to either find some more ethernet nics, or get my superiors to spring for fiber(which is what 2/4 of the nics in the server are, yay for hand-me-down equipment).
Okay, so sounds like the next step is to try capturing from the server room and check to see if it’s raw image capture. Hopefully, my robocopy finishes soon and I can try that.
As a side note, thank you for your help. I wish I had better answers for you, but I’m a completely self-taught admin, and as such my knowledge has huge holes in it, as I learn what I need to put out whatever fire happens to be dropped in my lap.
-
@flipwalker As you can tell from my questions, I haven’t been able to narrow in on a single problem so they are a bit all over the place.
If we look at building a truth table we still don’t know a lot. The problems I see:
- Could be the vm host server CPU or underlying disk infrastructure
- Could be the physical network interface on the vm host server
- Could be the virtual network interface on the vm
- Could be the networking infrastructure
- Could be the network jack or cable on the target end
- Could be the target computer
There are tests we can run to rule out disk subsystem and network issues. Swapping out a target computer for a different model just to see if we can get better performance from the same network jack.
-
I understand completely. I’ll try to provide feedback as I can.
-
I would personally doubt that this is the issue. The server is quite over powered for what it’s being asked to do (72 x Intel Xeon CPU E5-2697 v4 @ 2.30GHz (2 Sockets), 256GB RAM, RAID10 array of drives).
-
This could be the issue, or at least I certainly can’t think of a way to eliminate it.
-
Again, can’t eliminate it as I’m far from an expert in Proxmox
-
This I think I can at least put as a low probability. I’m currently running another image, and it’s showing the same symptoms. For instance, it went to 15% complete in 20 seconds. Then hung on the same block for over 3 minutes before completing a couple more percent and hanging again.
-
I think I can eliminate this as well, as I’ve just moved locations of imaging with the same result.
-
This is possible, I could try another laptop.
Here are two images illustrating what I’m talking about. Second image was taken just after I noticed it was going again, and then it stopped shortly after I took it.
-
-
@flipwalker Ok good you’ve eliminated a large chunk of what could be wrong. Testing a different computer of the same model and different model on the same switch as proxmox will again rule out several possibilities. This is more of a hunt to where the problem isn’t so what is left is easier to debug.
So intuition is telling me the possiblility of having a bad spot or corrupt disk to/where partclone needs to slow down to read it correctly, or there is a bottle neck on the network adapter on the server, but I might also suspect that the other VMs responsiveness would be impacted if it was the server nic.
-
So I’ve been doing some more experimentation. Just tried a different laptop, the same model. I deployed the image that I have successfully captured(though very slowly). It deployed okay. Not as fast as I’d hoped but seemed fairly steady at least.
I then made a few minor changes(that I had forgotten initially) and attempted a recapture, putting it back on zstd. This went blisteringly fast, up to 97%. As in within 5 minutes, it was 97%. Then it hung again and took another 10-15 minutes to ‘complete’ but then hung on the ‘Cloning Successful’ message. I do think I noticed something new, I don’t THINK it had attempted the 500MB recovery partition. I think it had captured the first small system partition and the second ‘main’ partition with the OS, but I don’t think it got to partition 3 (recovery) or the empty partition 4 for some reason.
Regardless, I powered the laptop down and rebooted it to make sure it would. I noticed that partitions 2 and 4 hadn’t been resized back to the original, so I expanded them back out, shut down again and I’m currently trying a new capture, this time using Gzip, to see if that works. Currently, I’m 21 minutes in, and only 43% on the main partition. It’s doing the same ‘hang’ for long periods of time with 0 progress. Update: It also hung at the same point.
Does any of this help?
Edit: So I’ve /maybe/ started to narrow it down. I went back to zstd. But I decided to delete the ‘recovery’ partition, as unneeded. I’ve now captured 2-3 times, in about 3-4 minutes each, except I’m still sticking on ‘Cloned Successfully’ before it hits the last partition(Currently empty). I’ve tried doing it both as resizeable, and nonresizeable. But the speed is what I’d expect to see. Does this point to anything?
-
@flipwalker said in Image Capture seems to hang on "Cloning Successful":
but then hung on the ‘Cloning Successful
Was this message posted on the blue partclone screen or was it on the black and white text screen? The question is what posted cloning successful so I can see what is next in the capture script.
-
@george1421
It was black and white under the blue cloning screen, as the blue screen moved up one line, and on the bottom, there was a black and white message. -
Any more ideas? I just went back in and deleted everything other than system created partitions and it’s still giving me this screen. Speed is where I’d expect it, but it refuses to complete the process.
-
New update. Because I was frustrated I just let the laptop sit on that screen as I worked on other things. I’m not sure how long it took, but it was eventually completed. However, when I look at the FOG image list, I’m confused again.
The presysprep image is reporting as 442 GB (should be around 30ish used, with the full partition being 442ish). I understand that this would be the client’s size, but shouldn’t it have shrunk the partition down to minimum size with resizing turned on? (like the image above it, which technically has much more data, but is reporting as a smaller size).
I’m sorry for all the questions, but I feel like I’m missing something obvious here.
-
@flipwalker In your last picture you see where it says Filesystem: NTFS. It can shrink those partitions down. If it says Filesystem: RAW that is a block copy of the data, where the captured image size should match exactly the size of disk you have. If your system has bitlocker enabled or under linux using LVM then FOG can’t do anything with that filesystem other than a RAW (dd) copy of the disk.
-
No bitlocker enabled, so I’m not sure why it’s not resizing that partition. I’m in the middle of another image capture attempt, and though it captured the OS partition quickly(and then took 15 ish minutes to move on to the next one after the ‘Cloned Sucessfully’ message) it’s back to the thing where it’s stalled for no apparent reason and dropped back down to miserable speeds.
I am entirely mystified about what is going on. It doesn’t seem to be consistent in what it’s doing. sigh Is there something else I can grab to help identify what’s going on?