Imaging stops after client boots up
-
Rogue DHCP?
Perhaps FOG’s static IP got assigned to something else by accident?
-
You know,
You could capture the traffic on the FOG machine, but depending on how far imaging gets, it might be too much data to hold in RAM.
There’s an article in the WiKi called TCPDump. You could run that either on the FOG machine, or use a hub (not a switch) connected between the FOG server and your switch, and capture traffic on a separate device.
Kill the capture moments after the failure. Open the resulting file, and look down at the bottom… see what’s going on…
This is starting to sound a lot like a network issue… (still thinking rogue DHCP, possibly) because you had a working FOG server, and then all of a sudden, everything is not working right… And I assume if you made changes on the FOG server prior to these problems, you would have brought that up here.
-
You should also try to run the script found in this post:
[url]http://fogproject.org/forum/threads/lets-make-scripts.12551/#post-46119[/url]It’s an info-gathering script. You can post the results here in a .zip file, but I also encourage you to look through it yourself, too.
-
Rouge dhcp. Well, frist thought was this, as not on fog, but we had some same situation, so we go and rested it. Not this kind of problem here (as far as tests went). And, for the fun of it, now we make tons of tests with different hardwares to see any pattern if we may see, and latest test machines produced this: if ever occure, nothing can change it (well, 10-12 reboot, no change on same box). BUT, cos always there is a but… if it stucks after.xz file load, i get the root@146 like prompt i can ANYTIME go on with imaging commencing ./linuxrc. Anytime!. So, if network issue anyhow, it is only in the boot process of the ipxe downloaded OS. It can be a clue
Now we collect motherboard info for listing working and not working machines and still there are machines we find with working and not working status.
I even replaced the fog server machine with other one (not the best solution, but I relocated the disk to other hw). Same issues: sometimes works, sometimes not. Pretty frustrating.
I will go through the suggestions you made above as soon as i am able, but as always, we have to do many things separately. And, as we were after testing phase, this system got into production, so, not the best to fall back to previous versions or go backup; but I think I will not be able to avoid that…
-
What version of FOG are you running? What OS distribution?
Also, before you do any more of these labor-intense tests, can you try to re-run your installer?
-
I run the troubleshoot.sh now, made a little change on host ip to be less… well, open . Here is the log it generated.
-
[quote=“Foglalt, post: 46977, member: 26236”]I run the troubleshoot.sh now, made a little change on host ip to be less… well, open . Here is the log it generated.
[url]http://pastebin.com/1PKpRzmN[/url][/quote]
The first thing I noticed is that a ton of your services aren’t even running/reporting (unless you deleted this from the output??):
[CODE]----------------------RPCBind status below
----------------------NFS status below
----------------------Firewall status below
----------------------FOGMulticastManager status below
----------------------FOGSnapinReplicator status below
----------------------FOGImageReplicator status below
----------------------FOGScheduler status below
----------------------Installation log below
[/CODE]Can you check your firewall?
Can you make sure NFS is running?
Can you make sure RPCBind is running?Also, the FOG services should be running, too.
and, your ownership for things in /images is not uniform, however it shouldn’t matter considering everything has 777 permissions.
-
Ask anything! All is welcome and maybe closer to solution. Btw, you said things strange to me (and I was puzzled to see them before posting log to you, but I thought i may be underinformed what is normal and what is not.
- Firewall iptables is totally empty (iptables -L, all default output)
- NSF can be reached (as I said linuxrc can read from it as manually started)
- rpcbind ha processes (well, I am not a pro of all, so, I olny tested processes; what should look for)
- Fog services. well, it surprised me to see this, I havent noticed. How could I acomplish that? It was a stock installation, so what went rong without notice…!? Well, ok, how to redo them without making a mess?
- ownership of files of the images are not uniform, yes, cos after testing was ok, some images from old storage was copied for further testing with new image vs old image context.
As for asking about removed parts of log: nothing was removed, only some personal info is hidden.
-
Can you take a picture of the error you’re seeing on clients? Or, the screen it sits on? Post the picture here, please.
-
well, actually nothing special (sorry, today i had zero time to test and make photo). after init files downloaded (.xz for example) it gets a normal looking root prompt. (will make photo, btw, next week)
-
Well, I am sure not a debug friendly solution I did, but I had to move on to a working state with my fog server. So, I replaced the machine, do a full reinstall, redo the personalized changes of the server, then tadaaa, all works as charm (btw, I keep the actual fog server and the buggy one to have a bit-by-bit comparison to see what is the difference. Up to that point we will know nothing unfortunatelly.
Wayne, I really appreciate your attention. If I find any detectable and – best case – reproducable reason for my issue, I will post it, send bug report if needed.
Bests
-
UPDATE: sad news, especially for me. Case reopened, as the fresh installation, on different box, with ultra zero modification (only pxe menu access was set) AND SAME ISSUE… means: I smell some bug in the infrastructure what the old version was resistent (or change was after v1.2 installed).
I would welcome all debug methods to detect where the problem lies (sure not in fog basic setups, maybe not in hardware of the server, as 2 totally different instal does same problem)
Any suggestion?
-
Try Trunk? [url]http://jbob.io/wiki/index.php/Upgrade_to_trunk[/url]
Couldn’t hurt at this point.
-
Tom thinks the drive HAD a GPT, but the update made it think it has MBR and that broke it. And Windows doesn’t properly remove GPT fragments from the disk.
You can use fixparts to repair it.
-
Trunk: at first I was not sure what are you talking about, for me, trunk meant a different thing. So, on to the latest version you suggest?
GPT: hm… so you meant if we do fixpart we can do a try to deploy and see if it helps? well, a try always can be ok. My problem is that not ALL deploy fails! I hate random errors as hard to get too many factors and users dont like this if i say “wait a while, we may solve it. or not”
In the past we had mbr issue when “never used disk we got with dealers recovery on it”. We killed mbr and deploy was ok. But, that time it was not same, 0.32 version and the actial deployement died with error. This time 1.2, and after init.xz and the other file loaded, it gives back prompt. From what point we can kick on with ./linuxrc (if gpt issue is in place, do you think it would stuck like it?)
We will do a try tomorrow with fixparts somehow (i havent ever used that as I remember, hope nothing extra is needed to fix it)
-
[quote=“Foglalt, post: 47312, member: 26236”]Trunk: at first I was not sure what are you talking about, for me, trunk meant a different thing. So, on to the latest version you suggest?
GPT: hm… so you meant if we do fixpart we can do a try to deploy and see if it helps? well, a try always can be ok. My problem is that not ALL deploy fails! I hate random errors as hard to get too many factors and users dont like this if i say “wait a while, we may solve it. or not”
In the past we had mbr issue when “never used disk we got with dealers recovery on it”. We killed mbr and deploy was ok. But, that time it was not same, 0.32 version and the actial deployement died with error. This time 1.2, and after init.xz and the other file loaded, it gives back prompt. From what point we can kick on with ./linuxrc (if gpt issue is in place, do you think it would stuck like it?)
We will do a try tomorrow with fixparts somehow (i havent ever used that as I remember, hope nothing extra is needed to fix it)[/quote]
You’d use a debug deployment to try it.
-
Well, it dont let me rest, can you enlighten me a bit, surely you know it better. The boot process is stuck, maybe somehow here is the key. Machine boots up, ipxe configurate the environment, then get the boot.php from the url where it should. As my poor knowledge helps it gets the tasklist and then generate the boot menu. If I understand well if the machine has task to do, that menu is an instant deployment. As no deployment starts, only I get a prompt, can it be because somehow the tasklist or the “bootmenu” generation fails?
Ofc I see active tasks, so not on this end (server). When that boot.php generates the “menu” what is the exact next step of booting? (is there a step by step list for me to understand?)
-
the easiest way to understand what the menu is telling you client is to go to the boot menu in a browser. it’s generated as a plain text file accessable via url
-
I forgot to post a screen of the “stop point” which I am hanging at after imaging stops. As I said, no error at all, only it stops here. And, I ran a bit around with mbr, gpt issue possibility. Put some old hdd into failing machine to see what it has (no gpt even touched those disks, a lot faster than remove what I have never saw).
What do you think?
-
Would you be willing to upgrade to the Trunk version?
read more here:
https://wiki.fogproject.org/wiki/index.php/Upgrade_to_trunkthis method (IMHO) is the simplist.
https://wiki.fogproject.org/wiki/index.php/SVN