Imaging Jobs Freezing
-
@sudburr This issues occurs with different images to different devices. I am using the device/image combination because it is the smallest image and the most critical one I have. I can copy images via SCP from the server to my workstation.
Thanks,
Anthony
-
@atarone Sorry I’ve been unavailable almost all day.
OK so your target is “lunching-out”. You loose your ssh session and the system is unpingable. So its sounding like the FOS kernel is crashing or there is a network issue.
Your network is 100% GbE including the link to the workstation.
FOS is a multi-tasking, multi-user OS. You should not be able to take it down. A single thread or task may freeze but the OS should keep running. A hardware issue will take down a multi-tasking OS.
In your picture you are deploying to an NCR device. Are these the only devices you are deploying to?
I can say the test I setup did cover all of the basis. It didn’t give us an answer other than the OS is freezing.
-
@george1421 Not a problem. I have been out most of the day myself. Yes, once PartClone freezes I lose all connectivity to the target. We are GbE with the exception of the NCR Kiosk, that I think is only 10/100. But other images I deploy to other PCs are GbE all the way through and I still have the issue. I am using the NCR because it is the most critical at this point and its the smallest image so it is easier to troubleshoot with.
-
@atarone well this is a bit challenging. I have to think its something in your environment because (to this point) no one else has reported this issue.
I have two thoughts on this.
- Put the target computer on the same switch as the FOG server for testing. This will (should) eliminate any off core switch networking issues.
- Its still not clear in my mind that FOS is actually freezing. What we do know is the console session is locked because partclone is waiting for data and the network interface went off line because you can’t communicate with it.
With a traditional linux OS in command line mode there are multiple consoles enabled and you can switch between them using ctrl-Fx keys (I think). In the AM I’ll boot FOS into debug mode to see if I FOS supports multiple consoles. If I can switch to another console then we might be able to gain access to a command prompt. If that’s the case then FOS is running, just the network subsystem went off line. I’m not sure what that will tell us other than its not a FOS specific issue.
-
Try taking the NCR device and any other non Gb device off the network then try again.
Are you able to isolate the FOG server to it’s own subnet/vlan and work within it with purely Gb devices?
-
@sudburr I tried that before starting this thread as I thought I was having a network problem. The FOG server already resides in its own vlan, we setup it up that way on day one.
Thanks,
Anthony
-
@atarone said in Imaging Jobs Freezing:
The FOG server already resides in its own vlan, we setup it up that way on day one
And if you put a target computer on the same vlan on the same switch does it freeze?
-
@george1421 I thought it was my environment before I opened this forum topic. To this point I have tried the following without any success:
1.) Image different targets(different PC, Different image)
2.) Removing the bond interface on server and using single GbE
3.) Putting target on same switch as Server
4.) Used completely different switch
5.) Removing non-GbE devicesI am not opposed to standing up a totally new server and testing, but it will take some time. This worked without any issue until this past Tuesday. We updated to FOG 1.4.0 in late May or early June and we were imaging everything without issue. Then we got the issue we updated Linux and tested, then updated to FOG 1.4.2 and tested. As a side note, may have nothing to do with this, my server is still saying I am not running the latest version of FOG. I know there was a forum post that addressed that issue, but just thought I would throw it out there.
Thanks,
Anthony
-
@george1421 Yes. All imaging is done is the same vlan. So the target and server are on the same vlan.
-
@atarone Well I really think we are at the point I call divide and concur. We’ve tested about all I can think of with this current setup. Spinning up a new fog server on a desktop (if you don’t have the hardware) would be the next steps. Place the new fog server on a dedicated switch (unmanaged will do) plug the unmanaged switch into your business switch and a target computer into the unmanaged switch. See if that setup will image correctly. If that works then attempt to image a computer across your campus (again with the test FOG server). For the sake of this testing just capture a new image to this new fog server. It doesn’t have to actually run on the target. Our goal here is to get the image deployed completely. If that works then use a known image and deploy that. We are at the point were we may have to go with a greenfield approach to finding the root of the issue.
-
@george1421 I will get a lab environment setup and report back what I find. Just for the sake of argument I did perform a capture task with the NCR with the all of the normal settings and the capture task was successful.
Thanks,
Anthony
-
@george1421 I tried re-installing FOG without any success. I have a lab environment at home so I will re-create and test there. The only difference will be that the FOG server will be a VM instead of physical. I will report later next week what I find.
Thanks,
Anthony
-
@george1421
@Tom-Elliott
I setup a FOG server in a test environment. Different switch, different server,etc and the imaging jobs are still freezing and locking up. What could be causing the client to lockup?Thanks,
Anthony
-
@atarone Reading all the things you have already tried I think we need to tackle this from a different angle to get a step forward. So I modified the most current init.xz image to spawn a second virtual terminal. Usually FOS does not spawn a second one because there is no need to. In this case I thought this might be helpful for the first time.
On your test FOG server, download init.xz here and put it in
/var/www/fog/service/ipxe
(move the original out of the way).Then schedule a deploy task for your test client and boot it up. It should boot just as it used to and while FOG sets things up for imaging you should be able to switch between virtual terminal one and two with the well known keys Ctrl+Alt+F2 and Ctrl+Alt+F1… On the second terminal hit ENTER once and you get to a shell. Here you are free to run commands while FOG is doing it’s thing on VT1.
I’d start with getting VT2 ready but then switching back to VT1 and watching partclone. After freezing, are you still able to switch back to VT2? If yes, what do you see when running
dmesg
for example? If you cannot switch back to VT2 then … (I am still thinking about what to do next then!). -
@Sebastian-Roth said in Imaging Jobs Freezing:
Usually FOS does not spawn a second one because there is no need to. In this case I thought this might be helpful for the first time.
Really nice, we could have used that for the switch console test, to see if the target was freezing or just the session was hanging. Thanks!!
-
@Sebastian-Roth
@george1421 So I updated init.xz with the one created by Sebastian-Roth and now images deploy, but Windows no longer boots after imaging. So i did a capture job thinking that the image is just messed. It did the capture but returned errors trying to update the database using the username “fog”. I tested that login and it is still the default. What password is it trying to use and could this be the root cause of my deployments freezing?Thanks,
Anthony
-
@atarone said in Imaging Jobs Freezing:
So I updated init.xz with the one created by Sebastian-Roth and now images deploy
This all sounds totally weird! Is this in your small simple test environment? The only thing I did was downloading the most current init.xz and adding this one line to /etc/inititab:
tty2::askfirst:-/bin/bash
From my point of view it’s impossible that this change is actually solving your freezing problem!
but Windows no longer boots after imaging.
Which error do you see? Please post a picture!
It did the capture but returned errors trying to update the database using the username “fog”. I tested that login and it is still the default. What password is it trying to use and could this be the root cause of my deployments freezing?
Check this wiki page about database password. Again I highly doubt that this could cause the freezing!
-
This has been the weirdest issue since day one haha. This is all being done in the production environment. I got the password stuff sorted out. Below is what the target is doing. This occurs anytime FOG turns over booting to the OS.
-
@atarone I am not really sure where this is taking us. Would you mind testing with the provided init.xz file (second virtual terminal) over and over till you run into the freezing issue again. Then see if you can still switch terminals. If freezes don’t occur anymore then you might want to go back to the init.xz you had before and try again. Maybe there was something screwed with that initrd you had. Would be good to rule that out.
I am not sure about your other issue. Looks like deploy is not properly working. Is this a complete new image your uploaded freshly? Could you please run a debug deploy task on this client again (like when you schedule a normal deploy but tick the checkbox for debug). When you get to the shell run
fdisk -l /dev/sda
and post a clear picture of that here. -
@Sebastian-Roth Below is the screenshots from the fdisk -l
I will be running the tests shortly with old init.xz and will get back to you.
Thanks,
Anthony