FOG Pxe boots, doesn't register as having image to deploy
-
Hi,
So as I was fogging a classroom yesterday, I imaged about 7 or so at a time. When I got to my last 7, they wouldn’t pull an image.
It seems they will boot to fog on pxe, bring up the pxe menu, tell me they’re registered (Registered as specific hostname!) and even though they’re set to pull the image, they won’t. I have the bios/uefi exit type to the wrong one so it won’t boot to HDD and it will auto go to a black boot screen with nothing but a flashing cursor. If I select deploy image, the screen goes black for half a second, then back to the menu again.
In tasks, these 7 show they are waiting to deploy - but they aren’t populating the storage section in active tasks (as the other 23 did - but those 23 finished fine.)
I restarted the server, checked permissions the storage folder (/images mounted from a RAID10, but everything checks out and its mounted correctly and all).
TIA!
Scott
-
I guess first of all lets collect a bit more information.
- Where you using multicast or unicast imaging?
- What’s unique about the last group of 7 computers?
- Are these 7 computers uefi or bios mode firmware?
- Are these last 7 computers all of the same make and model? If so what are they? (my guess is Lenovo)
- When you select deploy image, I assume you see both bzImage and init.xz being transferred to the target computer, the screen goes black and then you see the FOG iPXE menu again, is that correct.
- What version of FOG are you using? What about the FOS (FOG) kernel?
I don’t think (at this moment) that your issue is the fog server. I’m betting on the target computer firmware or something unique about this hardware.
-
Unicast, nothing was unique about these 7 however. I thought there was a hiccup on the server - here was my process:
Deploy base win10 1803 image, with fog connecting to fog-srv through wifi. Setupcomplete added a wifi profile, reenabled fog, and rebooted. Rename and join domain & reboot. Deploy of 10 or so snapins (chrome, firefox, java, adobe reader, 7zip, few more wifi profiles, centrastage agent, testing browsers, vlc) reboot.
I had a weird machine in the batch before these 7 that was stuck at renaming - kept getting in a reboot loop. best thing I could find in the log was something about failure to authenticate after it got the cert…but kept trying to figure that one out, then went back to the 7 in question.
They’re all dell latitude 3340 or something similar. I’ve got about 200 of them. Successfully did about 60 before these 7. They’re all UEFI but booting to legacy network, with a UEFI image.
Honestly can’t see much when I select deploy image from the menu because its so fast. I usually don’t boot to menu and select that, I create the task from the webserver and just boot to PXE, then slide them back in their carts watching the LED’s to see when they finish (tasked with shutdown)
I installed everything this past weekend so 1.5.4 and I can’t tell you the kernel right now…its built on linux mint 18.3 newest kernel for it 4.13 or so because 4.15 breaks my raid drivers.
I keep having an issue occasionally where if I boot ~10, the first 6 (max) will boot into the blue imaging screen. 2 or so more will get to waiting in line, 2 more will get stuck at loading boot file, then the ones that image all finish imaging and get to updating database and fail…I thought the issue was with a crappy netgear switch I was using, started smaller deployment groups and that seemed to work but almost right before this batch I had that issue - had to restart then too. Was successful in imaging 7 or 8 more though.
-
@p4cm4n said in FOG Pxe boots, doesn't register as having image to deploy:
Boy there is so much going on here… Please understand I don’t know your environment so I have to build a picture of it in my mind with the relevant information.
Deploy base win10 1803 image, with fog connecting to fog-srv through wifi. Setupcomplete added a wifi profile, reenabled fog, and rebooted. Rename and join domain & reboot. Deploy of 10 or so snapins (chrome, firefox, java, adobe reader, 7zip, few more wifi profiles, centrastage agent, testing browsers, vlc) reboot.
How are you deploying the image over wifi since FOS does not support wifi by design?
They’re all dell latitude 3340 or something similar. I’ve got about 200 of them. Successfully did about 60 before these 7. They’re all UEFI but booting to legacy network, with a UEFI image.
Do these 7 have the same firmware release as the previous 60? Just considering the XX40 series laptops are about 4+ years old and the XX30 series of laptops were the first to support pxe booting over uefi.
Honestly can’t see much when I select deploy image from the menu because its so fast. I usually don’t boot to menu and select that, I create the task from the webserver and just boot to PXE, then slide them back in their carts watching the LED’s to see when they finish (tasked with shutdown)
There may be value in taking a 120fps video with a cell phone to see what is going on before it resets. I have seen bzImage and init.xz get transferred to the target computer then something happens to the hand off to bzImage and then the iPXE menu is redisplayed before.
I installed everything this past weekend so 1.5.4 and I can’t tell you the kernel right now…its built on linux mint 18.3 newest kernel for it 4.13 or so because 4.15 breaks my raid drivers.
This question is around what version of bzImage is installed in FOG. That is the FOS linux kernel. FOG has its a custom linux kernel built for imaging fast, that is the kernel I’m asking about. Since you have FOG 1.5.4 then your kernel is either 4.16.x or 4.17.0. Both of those have an issue creating MBR/GPT partitions in that it takes several minutes to create that structure. The current recommendation is to down grade to 4.15.2 for FOS. You can do this from the FOG Settings->FOG kernel menu. This isn’t your problem here, but it may be a problem based on certain hardware configuration.
Now there may be something that you should do to address another found issue after FOG 1.5.4 was released. This could potentially be the root of your issue with the fog server causing imaging to stop (again I don’t have a clear picture so this is a bit of shotgunning it too).
Search for the php-fpm configuration file (www.conf). It should be in the /etc directory but could be in a number of places depending on the version of php that’s installed. Search for the config file with this command.
find /etc -name www.conf
Edit that file and ensure the below values are set correctly.pm.max_children = 35 pm.max_requests = 2000 php_admin_value[memory_limit] = 256
Once they are updated reboot the FOG server. These settings will help with an out of memory condition on the FOG server.
I keep having an issue occasionally where if I boot ~10, the first 6 (max) will boot into the blue imaging screen. 2 or so more will get to waiting in line, 2 more will get stuck at loading boot file, then the ones that image all finish imaging and get to updating database and fail…
I’d have to see the actual error message here, but the updating to database is typically incorrect. But if its the typical condition where this happens it would happen on every image. Not just random deployments.
I thought the issue was with a crappy netgear switch I was using, started smaller deployment groups and that seemed to work but almost right before this batch I had that issue - had to restart then too. Was successful in imaging 7 or 8 more though.
Just for a point of reference, 3 unicast images sent at the same time will flood a 1GbE link between your FOG server and your network switch. A typical single unicast stream will give you about a ~6GB/min deployment rate. I tell you this so you can compare a single unicast stream in your network. If you are getting below 3GB/m (according to the partclone screen) you might need to look a bit more into your network configuration.
-
First off, THANKS! I kept wondering about why it took so long to do that. Thought cuz the initial machine I was using had been used as a loaner and kept having new OS’s put on a lot. Thought the partition table was screwed up.
Okay a little more topography info.
I have a linux mint 18.3 with two nics. one on the internet enabled school network for non-booting purposes. It allows for snapins, client management, etc.
The other nic is for booting (broadcom, not sure the model) connected to 1gigabit netgear unmanaged switch.For whatever reason, i keep having network issues. whether its the building i’m in and the network hardware in place, or just my crappy ‘sitting around for 3 years doing nothing so might as well put linux on it’ hardware…the internet nic drops out after certain periods and requires me to add routing and gateway information again. not sure why.
These 7 machines have the same baseline firmware, because as a whole, most of these machines haven’t been modified since their initial deployment. So I’m not sure if they’re booting to uefi (don’t think they are) I know the PXE booting that I’m selecting is of the legacy boot options. I generally use legacy, not uefi boot mode. The win10 image is efi though.
I will attempt to do best video I can, and see what comes up. I created a VM environment to test with at home right now and see if its also server related or machine related.
I’ll also test the settings from PHP and report back, as well as kernel downgrade.
So i can’t give you an EXACT wording, but after the blue imaging screen, after the setting of the partition labels and types and uuids and such, it has a line something like updating database…Failed!. The images still work, however the tasks menu still doesn’t show it having been entirely completed (usually its stuck at a progress bar of whenever it lost connectivity.) This is probably related to the network being saturated though. I will keep the imaging at 6 or under for the time being, and set the storage limit to 4…time taken isn’t necessarily an issue as they’re completing in 2:04 for a 500GB (~15GB used) partition.
I will be back at that site on Tuesday here. Honestly, if an issue arises I may just reinstall or attempt timeshift. I was able to successfully image that one of those same machines from a fog VM here just now.
-
@p4cm4n said in FOG Pxe boots, doesn't register as having image to deploy:
updating database…Failed!. The images still work, however the tasks menu still doesn’t show it having been entirely completed (usually its stuck at a progress bar of whenever it lost connectivity.)
Typically when we see this error it is because someone has messed with the linux user
fog
account. This is not the default webui admin fog user, but the linux service account calledfog
. If someone changes that password then you will get an update database failed. You can confirm this by looking in /images/dev directory. If there are subdirectories in there based on a mac address then this is typically an indicator that someone has changed thefog
service account. -
@george1421
Its interesting you mention the fog user account - for some reason on roughly half of the installs of fog I did to mess with, something would happen to permissions on storage, or files and I learned to just do a sudo passwd fog and set to password.the database wouldn’t fail typically - only when that level of high saturation would occur. its almost as if certain services would hiccup or hang. a reboot would usually cure the issue.
-
@p4cm4n said in FOG Pxe boots, doesn't register as having image to deploy:
I learned to just do a sudo passwd fog and set to password.
One should never mess with the
fog
user account. That account should only be managed (i.e. password changed) by the fog installer. If you get the password out of sync with the webui you WILL have upload and kernel update issues.I put together a tutorial on how to resync the
fog
service account user if you need it: https://forums.fogproject.org/topic/11203/resyncing-fog-s-service-account-password -
okay i’m running.
not sure the kernel thing worked, i read i can do debug and uname -a and see, but not worried about it, that it takes 3-5 min for gpt tables…since i can image
so heres what i did - i originally followed your tutorial, but the .fogsettings password was encrypted. copying it and pasting in encrypted state caused the client to be imaged to fail at TFTP loading. so heres what i did.
i went into .fogsettings and changed the password to password. i then went into the tutorial places and changed accordingly, lastly running the installer. then, the client machine is imaging as i type.
unsure if running the installer fixed it, or if the password stuff fixed it. honestly considering i’d done so many before that i find it difficult to believe it was password related at this point, but who knows. when i changed password before, i changed it in tftp and storage but never run the installer. i initially changed it because i got that ftp error on trying to create an image.
thanks man!
only a couple hundred to go lol
-
@p4cm4n OK now that you have the password parts worked out, you should be able to go into FOG Settings->FOG Kernel and from there you can downgrade to the 4.15.2 kernels, just download both bzImage and bzImage32 for the PC platform. From there it will download the kernels. You can check by going on the FOG server console to
cd /var/www/html/fog/service/ipxe file bzImage
That should print out version 4.15.2. If that is the case you are golden. FWIW, you CAN do this while you are imaging as long as you don’t currently have a system pxe booting.