Imaging Jobs Freezing
-
@atarone Lets take down that LAG group. For testing, lets keep is simple, just a single GbE link. We have to start eliminating where the issue isn’t, to find out what’s left.
-
@george1421 I took down the LAG and connected one Cable to a different port on our switch and I am still getting the same issue. The target freezes after about 1 minute. Let me know what you think.
Thanks,
Anthony
-
@atarone Well let see if we can identify what we know so far. Please correct any assumptions I’ve made.
- Different images will pause/freeze going to the same target.
- The same image will freeze going to different targets.
- Both the FOG server and target remain “on the net” and are pingable
- You can reset the process by rebooting the target computer.
- Its a physical server (no intervening hypervisor to deal with).
- We’ve ruled out any strangeness with the LAG
I just thought of a process (not a solution) to test your system. Its a bit out there but it will tell us if FOS is freezing or if its operational and just partclone is freezing on us.
- Schedule a deployment to this target computer, but select the debug check box before submitting the task.
- PXE boot the target computer. You will see a few pages of commands on the target computer, just press enter a few times to get past them.
- On the target computer you should be dropped to a command prompt
- At that command prompt key in
ip addr show
and record the IP address of the FOS system. - Give root a password in FOS with
passwd
and use a simple password likehello
. Don’t worry since FOS executes out of memory, after a reboot this change is gone. - Now that you know the IP address of the target computer and have set’s roots password you should be able to connect to the target computer using putty (from a windows computer).
- Connect to the target computer using putty and leave the session open.
- Now back on the console of the target computer key in the master script calle
fog
- At each step in the process the script will pause waiting for an enter keypress. Do this until partclone freezes.
- Once partclone freezes go back to your putty session and key in
ls -la /images
and see if you get a response.
This will tell us if the target computer can still reach the images stored on the FOG server. Once we know this we can choose a direction.
-
@george1421 The only assumption that is incorrect is resetting by reboot does not work. When you reboot the target, it continues to try and boot off of the hard drive.
I will try the process you outlined and get back with you.
Thanks,
Anthony
-
@atarone said in Imaging Jobs Freezing:
incorrect is resetting by reboot does not work
I guess what I was getting at is that you can continue imaging if you only reset the target computer. Rebooting the fog server is not required to reimage (any) computer again.
-
@george1421 The imaging never continues. I can cancel the job in the GUI, re-schedule the task, reboot target and it will start, but lock up around the same time again. Sorry for the confusion.
Thanks,
Anthony
-
@atarone Is it ONLY this system, or multiple systems having the issue?
Sorry if this was already answered, I have been quite busy this week.
-
@Tom-Elliott No worries it has been crazy here too. This is happening on multiple systems.
-
@george1421
@Tom-Elliott I followed these steps and when when PartClone freezes I lose SSH connectivity to the target and pings to it timeout. I checked the switch that it is connected to and the port stays up and error free. Changes cables makes no difference. Could we be hitting a bug or driver error? Please let me know your thoughts.Thanks,
Anthony
-
I don’t see you actually say that you’ve tried using more than the one image.
And have you tried manually copying the image from the server to another hard drive?
-
@sudburr This issues occurs with different images to different devices. I am using the device/image combination because it is the smallest image and the most critical one I have. I can copy images via SCP from the server to my workstation.
Thanks,
Anthony
-
@atarone Sorry I’ve been unavailable almost all day.
OK so your target is “lunching-out”. You loose your ssh session and the system is unpingable. So its sounding like the FOS kernel is crashing or there is a network issue.
Your network is 100% GbE including the link to the workstation.
FOS is a multi-tasking, multi-user OS. You should not be able to take it down. A single thread or task may freeze but the OS should keep running. A hardware issue will take down a multi-tasking OS.
In your picture you are deploying to an NCR device. Are these the only devices you are deploying to?
I can say the test I setup did cover all of the basis. It didn’t give us an answer other than the OS is freezing.
-
@george1421 Not a problem. I have been out most of the day myself. Yes, once PartClone freezes I lose all connectivity to the target. We are GbE with the exception of the NCR Kiosk, that I think is only 10/100. But other images I deploy to other PCs are GbE all the way through and I still have the issue. I am using the NCR because it is the most critical at this point and its the smallest image so it is easier to troubleshoot with.
-
@atarone well this is a bit challenging. I have to think its something in your environment because (to this point) no one else has reported this issue.
I have two thoughts on this.
- Put the target computer on the same switch as the FOG server for testing. This will (should) eliminate any off core switch networking issues.
- Its still not clear in my mind that FOS is actually freezing. What we do know is the console session is locked because partclone is waiting for data and the network interface went off line because you can’t communicate with it.
With a traditional linux OS in command line mode there are multiple consoles enabled and you can switch between them using ctrl-Fx keys (I think). In the AM I’ll boot FOS into debug mode to see if I FOS supports multiple consoles. If I can switch to another console then we might be able to gain access to a command prompt. If that’s the case then FOS is running, just the network subsystem went off line. I’m not sure what that will tell us other than its not a FOS specific issue.
-
Try taking the NCR device and any other non Gb device off the network then try again.
Are you able to isolate the FOG server to it’s own subnet/vlan and work within it with purely Gb devices?
-
@sudburr I tried that before starting this thread as I thought I was having a network problem. The FOG server already resides in its own vlan, we setup it up that way on day one.
Thanks,
Anthony
-
@atarone said in Imaging Jobs Freezing:
The FOG server already resides in its own vlan, we setup it up that way on day one
And if you put a target computer on the same vlan on the same switch does it freeze?
-
@george1421 I thought it was my environment before I opened this forum topic. To this point I have tried the following without any success:
1.) Image different targets(different PC, Different image)
2.) Removing the bond interface on server and using single GbE
3.) Putting target on same switch as Server
4.) Used completely different switch
5.) Removing non-GbE devicesI am not opposed to standing up a totally new server and testing, but it will take some time. This worked without any issue until this past Tuesday. We updated to FOG 1.4.0 in late May or early June and we were imaging everything without issue. Then we got the issue we updated Linux and tested, then updated to FOG 1.4.2 and tested. As a side note, may have nothing to do with this, my server is still saying I am not running the latest version of FOG. I know there was a forum post that addressed that issue, but just thought I would throw it out there.
Thanks,
Anthony
-
@george1421 Yes. All imaging is done is the same vlan. So the target and server are on the same vlan.
-
@atarone Well I really think we are at the point I call divide and concur. We’ve tested about all I can think of with this current setup. Spinning up a new fog server on a desktop (if you don’t have the hardware) would be the next steps. Place the new fog server on a dedicated switch (unmanaged will do) plug the unmanaged switch into your business switch and a target computer into the unmanaged switch. See if that setup will image correctly. If that works then attempt to image a computer across your campus (again with the test FOG server). For the sake of this testing just capture a new image to this new fog server. It doesn’t have to actually run on the target. Our goal here is to get the image deployed completely. If that works then use a known image and deploy that. We are at the point were we may have to go with a greenfield approach to finding the root of the issue.