Best posts made by george1421

george1421

@tlehrian Ok here is what we are seeing. When FOS Linux boots, the nvme drives initialize (become ready to the os) at different times. Sometimes drive A is ready first and other times drive B is ready first. Well when linux boots what ever drive inits first becomes /dev/nvme01 and the second one becomes /dev/nvme02. This is not an issue with FOG or linux, its an issue between the linux OS and the hardware.

So what we need is to run a utility with a few commands to help us detect which drive is which in each state. The switch is totally at random so we can’t predict the order using linux. So what I need you (as a tester) to do is to pxe boot the target computer multiple times to record the settings when the drives are in a normal and then reversed order. If we are lucky you will see this swap within 10 pxe boots.

Here is what I want you to do:

Download this updated init from here: https://fogproject.org/inits/init_nvme-cli.xz
Rename the original inits in `/var/www/html/fog/service/ipxe init.xz to init.xz.sav
Move the downloaded file to that directory and save as init.xz
Pick one of these dual drive nvme computers and schedule a deploy task to it. But before you hit the schedule task button tick the debug checkbox then schedule the task.
PXE boot the target computer. After a few screens of text where you need to press enter to clear you will be dropped to the FOS Linux command prompt.
At the FOS Linux command prompt run this command lsblk to note the size and order of the nvme disk. Use disk size to be your guide in determining the order. So this is state 1.
6.1 You can use these steps if you want to setup remote debugging. Its easier to do the copy and paste of commands from putty. You don’t need to, its just one option.
6.2 At the FOS Linux command prompt key in ip addr show and collect the IP address of the FOS Linux computer.
6.3 Give root a password with passwd. Just give it a simple password like hello. The password will be reset on the next reboot. So don’t worry.
6.4 From a windows computer use putty to ssh into the FOS Linux computer. Login as root and the password you created in step 6.3
At the FOS Linux command prompt key in the following and post the results here nvme list. If the nvme command isn’t known then the downloaded inits are not in the right spot.
Key in the following command and post the result(s) here nvme id-ctrl /dev/nvme0n1 -H and (I’m guessing at the name since I don’t have a dual nvme system, use the name from the lsblk command above) nvme id-ctrl /dev/nvme0n2 -H
Now reboot the FOS Linux computer with ctrl-alt-del or key in reboot at the FOS Linux command prompt. The system should PXE boot right back into FOS Linux in debug mode.
Use the lsblk command to determine the disk order. We are looks for the order of the drives when they switch places. If you can’t get them to switch then power off the system instead of rebooting to see if we can get them to switch. The key is to capture the output of the nvme command in both states.

george1421

@Luc-Novales Ok lets take a step back because I haven’t seen a picture yet from iPXE. Chain booting from syslinux to iPXE won’t work because you say that iPXE is hanging. Using syslinux to call boot.php won’t work since boot.php will make a menu for iPXE and not Syslinux. I suspect that pxelinux.0 (syslinux) will not like an iPXE style menu.

Lets first set dhcp options 67 to ipxe.kkpxe (this exact boot file) and pxe boot the target system. Using a mobile phone, take a screen shot of the error produced by ipxe (I know you have done this many times, but I want to see the error as well as the context of the error).

FOS Linux: This is the capture engine that runs on the target computer to capture and deploy images on the target hardware. Normally it is sent to the target hardware using iPXE via the menu produced by the boot.php page. FOS Linux is constructed out of 2 parts bzImage32 is the Linux 32 bit kernel and init_32.xz is the 32 bit virtual hard drive. Its iPXE’s job to transfer them to the target computer.

Today we have a usb flash drive that can be used to transfer bzImage32 and init_32.xz to the target computer using a usb flash drive and a grub boot loader.

Now with that said, we should be able to duplicate that grub boot menu using syslinux (pxelinux.0) Its not an ideal solution but if you have no choice, sometimes you have to do what you have to do to make things work. So if we can’t get things working with iPXE, then we can attempt to use syslinux to at least get FOS Linux to the target computer.

I know we are focusing on the CPU in this thread, but something I have to ask How much RAM does this device have?

Only for reference so I don’t forget and lose it.

default menu.c32
prompt 0
timeout 300
ONTIMEOUT local

MENU TITLE FOG PXE Menu

LABEL 1. FOG Image Deploy/Capture
        MENU LABEL 1. FOG Image Deploy/Capture
        KERNEL bzImage32
	APPEND loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=275000 keymap= web=$myfogip/fog/ boottype=usb consoleblank=0 rootfstype=ext4

LABEL 2. Perform Full Host Registration and Inventory
        MENU LABEL 2. Perform Full Host Registration and Inventory
        KERNEL bzImage32
        APPEND loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=275000 keymap= web=$myfogip/fog/ boottype=usb consoleblank=0 rootfstype=ext4 mode=manreg

LABEL 3. Quick Registration and Inventory
        MENU LABEL 3. Quick Registration and Inventory
        KERNEL bzImage32
        APPEND loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=275000 keymap= web=$myfogip/fog/ boottype=usb consoleblank=0 rootfstype=ext4 mode=autoreg

LABEL 4. Client System Information (Compatibility)
        MENU LABEL 4. Client System Information (Compatibility)
        KERNEL bzImage32
        APPEND loglevel=4 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=275000 keymap= web=$myfogip/fog/ boottype=usb consoleblank=0 rootfstype=ext4 mode=sysinfo

LABEL 5. FOG Debug Kernel
        MENU LABEL 5. FOG Debug Kernel
        KERNEL bzImage32
        APPEND loglevel=7 initrd=init_32.xz root=/dev/ram0 rw ramdisk_size=275000 keymap= boottype=usb consoleblank=0 rootfstype=ext4 isdebug=yes

george1421

Just as a follow up on this. The bzImage32 is configured to require a 686 as the minimum processor: https://github.com/FOGProject/fos/blob/master/configs/kernelx86.config#L251

I’ve reconfigured my build environment and changed the minimum processor to 486 and I’m rebuilding the bzImage32 file for i486. Understand two things. 1. This is not an official FOG Project kernel since its not coming from one of the Developers. 2. I have no idea if it will work because I don’t have a 486 machine to test it on.

When its done compiling I’ll post a link so you can download it and try it.

Also on my syslinux configuration, Its been almost 8 years since I’ve worked with syslinux so I guessed at most of the entries. You WILL need to adjust it to replace the entire variable $myfogip with the IP address of your fog server to make it work correctly. It will take about 20 minutes to finish the new kernel build.

george1421

@tlehrian Ok so this IS something that the Linux kernel developers are going to have to address. Its not something that only impacts FOG, but all distros of Linux.

george1421

While I don’t use snapins in my environment, I might think that the batch file has caused a pop-up message to be displayed. But since the applications are being installed in a hidden window no one is there is acknowledge the pop-up message so the task is stuck.

So how would you go about debugging this? I would copy the batch file to the target computer and run it as Administrator. See if it executes without any prompts.

george1421

@Sebastian-Roth Are we running into the mount timeout issue we’ve tried to debug before? I also wonder if we could add a bit more debugging information in that unable to find image store message, similar to the when the path was set incorrectly.

Let me see if I can come up with a patched fog.upload

@pnwbsi if you are willing to help us test this, hopefully we can squash this issue once and for all.

george1421

If someone changed the linux password for fogproject or messed with the setting for that account within the web ui that error will be thrown.

Work through this document: https://forums.fogproject.org/topic/11203/resyncing-fog-s-service-account-password

george1421

What do you have configured for dhcp option 67?

Also do you have a linux or windows 2012 or newer dhcp server?

george1421

@rogalskij fwiw the relevant lines in regards to which interface is here

Command: /usr/local/sbin/udp-sender --interface em1 --min-receivers 3 --max-wait 1200 --portbase 56590 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/BaseImage/d1p1.img;/usr/local/sbin/udp-sender --interface em1 --min-receivers 3 --max-wait 10 --portbase 56590 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/BaseImage/d1p2.img;

Task startedUdp-sender 20120424
Using mcast address 238.155.1.70
UDP sender for /images/BaseImage/d1p1.img at x.x.x.x on em1 
Broadcasting control to 224.0.0.1

george1421

So the first question I would have is: Are the target systems and the FOG server on the same subnet (vlan)?

george1421

@rogalskij ok so here is where we are:

Its not the image because it deploys correctly using unicast
We know the installed network adapters and em1 is the correct network adapter, it has an ip address and is currently up
The ps command shows that udp-sender should be using network interface em1
The target computers and fog server is on the same vlan so no additional infrastructure work is needed.
At least some of the multicasts are getting through since the clients are able to check in and the stream starts.
It appears to hang at the partclone screen

We still don’t know if the infrastructure is setup correctly for multicasting (i.e. igmp snooping is enabled on vlan 1).
We don’t know if the multicast settings are right in the fog configuration.
We don’t know if the fog server’s firewall has been enabled but multicasts not allowed.

george1421

@Sebastian-Roth from a previous post:

Additionally, the output of the command you specified “sudo ps aux|grep udp-sender” is:

root 13864 0.0 0.0 115300 1480 ? S Aug30 0:00 sh -c /usr/local/sbin/udp-sender --interface em1 --min-receivers 3 --max-wait 1200 --portbase 56590 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/BaseImage/d1p1.img;/usr/local/sbin/udp-sender --interface em1 --min-receivers 3 --max-wait 10 --portbase 56590 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/BaseImage/d1p2.img;
root 14393 0.0 0.0 8688 660 ? S Aug30 0:00 /usr/local/sbin/udp-sender --interface em1 --min-receivers 3 --max-wait 10 --portbase 56590 --full-duplex --ttl 32 --nokbd --nopointopoint --file /images/BaseImage/d1p2.img
root 31094 0.0 0.0 112708 992 pts/0 S+ 11:39 0:00 grep --color=auto udp-sender

There appears to be some stale multicast tasks running since 30-Aug.

george1421

@FogNewUser Ok this one almost tells me what I need to know, from the picture perspective it would be handy to see more to the left edge to get the complete interface name.

But what I see you have built in network adapters (knowing its a P710) helps with the guessing and maybe 3 riser card network adapters. At this point I don’t see any interfaces with IP addresses, so it looks like you have a bunch of unconfigured interfaces.

From the screen shot it looks like you have a debian based OS (could be ubuntu which is debian based). So you will need to use the network manager if you have the gui loaded to define an IP address for your fog server. IF you have fog installed AND it was working you can refer to the file /opt/fog/.fogsettings to what the IP address of the fog server was when FOG was installed. You need to move it back to that IP address or you will have to make several changes to the fog configuration since the IP address is hardcoded in a few spots.

george1421

@Redbob This reminds me of a communication problem. You should not get a connection reset…

It kind of makes me think its a spanning tree issue or a faulty cable. But if everything is the same other than bios/uefi switch on the same computer it really can’t be communications. If it is a communication issue, can you confirm the switch that the pxe booting client is on is using one of the fast spanning tree protocol like Fast-STP, RSTP, MSTP, port fast, etc? Another test would be to put a dumb unmanaged switch between the pxe booting computer and the building switch as a test. If it works with the dumb switch then its a spanning tree issue with the building switch.

If the firmware up to date on this target computer?

The other thing to test is to put the pxe booting computer on the same subnet, same switch as the FOG server. This would rule out any devices in between the two causing this issue. It still makes me think of a communication trouble because its failing at random different steps in the booting process.

george1421

@David-Osinski said in Multicast off of bond:

–mcast-rdv-address 10.10.10.74

This is what bugs me a bit. The rendezvous address should be a multicast address. That is where all of the multicast clients go to find each other. I would expect that to be at least 224.0.0.1

udp-sender --interface rope --min-receivers 1 --max-wait 600 --mcast-rdv-address 10.10.10.74 --portbase 56854 --full-duplex --ttl 32 --nokbd --nopointopoint

When you don’t define an multicast address, the address is created by udpsender as a composite of the FOG server’s IP address. I think it should be in the range of 239.10.10.74, possibly the second octet is something different. But the data channel IS calculated so its important for the rendezvous address to be defined, so the targets can locate the data channel.

george1421

Off the top of my head I can think of a few things (not in line with the OPs question about Galera).

Surely increase the client check in time to 600 or 900 seconds (guess) with over 1000 target computers running the FOG client.
Move the mysql database (server) to a dedicated server that can be tuned and targeted for mysql performance.
Make sure you have sufficient RAM and vCPUs allocated to the FOG Master node and database server.
Starting with FOG version 1.5.2, FOG started using php-fpm to process the php code over the built in apache php engine. This was done for few reasons. A dedicated php-fpm engine processes php code faster than apache’s php engine. This freed up apache to process http requests faster instead of doing both tasks.
You will probably want to tweak the php-fpm engine to allow more children php processes to run the default is 35 in FOG 1.5.x series.
You will probably need one (or more) 10GbE network adapters for both the fog master node and database server. I know on a 1 GbE network we can saturate it with just 3 simultaneous unicast streams.
If your FOG server is physical, then make sure your disk subsystem is either flash based or running on a raid array 0 or 10 with many spindles.

I have to say that FOG really hasn’t been performance tuned for such a large campus. I know there are some forum members that do have large campuses that are using fog for imaging.

george1421

@Sebastian-Roth @Jay-Bosworth Understand I’ve only read the last 2 posts, so I’m not sure where the thread is headed.

BUT I can offer a comment. During Windows golden image development I’ve used a modified post download script (similar to the one for sending the drivers to the target computer) to patch the unattend.xml and replace dlls in a previously captured image. I only did this during golden image development or until the next time I built a golden image.

The principles of this is outlined in this article: https://forums.fogproject.org/topic/11126/using-fog-postinstall-scripts-for-windows-driver-injection-2017-ed

The fog.copydrivers script could be modified to just copy over the one needed file from the fog server. In this section. Where clientdriverpath is the destination path and remotedriverpath is the path of the files to copy.

dots "Preparing Drivers"
clientdriverpath="/ntfs/Drivers"
remotedriverpath="/images/drivers/$machine/$osn/$arch"

debugPause

if [[ ! -d "${remotedriverpath}" ]]; then
    echo "failed";
    echo " ! Driver package not found for ${machine}/$osn/$arch ! ";
    debugPause;
    return;
fi
echo "Ready";

debugPause

[[ ! -d $clientdriverpath ]] && mkdir -p "$clientdriverpath" >/dev/null 2>&1
echo -n "In Progress"

rsync -aqz "$remotedriverpath" "$clientdriverpath" >/dev/null 2>&1

[[ ! $? -eq 0 ]] && handleError "Failed to download driver information for [$machine/$osn/$arch]"

debugPause

george1421

@xburnerx00 I would still consider replacing that HDD with a new SSD just because around 4-5 years of daily use the drives do fail. If you have the budget replace is and avoid having to replace it later, plus even a cheap ssd will make that laptop seem new again.

george1421

@Dinesh The problem is that your dhcp server isn’t telling the pxe booting client the ip address of the fog server. I have the idea that you have 2 dhcp servers on your network. One is configured correctly because it tells the PXE rom the name of the of boot server and boot file name because iPXE is starting, but then when iPXE starts it sends out a dhcp request again and the second dhcp server answers without the boot server information that is why iPXE asks.

If you look in the picture, the IP addresses coming from your dhcp server look… strange. The pxe booting computer gets an IP address of 192.168.6.14, but the gateway is 192.168.6.15. This is not something I would expect to see.

george1421

@Junkhacker @Sebastian-Roth

I was able to get the OP going by doing this and that.

We are not sure if it was this or that that got the kernel to boot. What I did was unlocked the max CPUs (that was capped at in the kernel and I also enabled almost all of the ACPI modules in the kernel. We also tried the acpi_osi=Linux kernel parameter.

We ruled out the acpi_osi=Linux kernel parameter fixing the issue so it must be something I enabled in the kernel. Tomorrow AM I’m going to reset the kernel environment and only unlock the max CPUs. The OP is going to test that new kernel to see if it was unlocking the max cpu or it was the acpi modules I enabled.

Either way I’ll report where we ended up and which kernel change fixed the issue. I have also seen other recent CPU stalls like this that was fixed by setting acpi=off so we may need to move what ever fixed the issue into the main kernel build because new hardware/cpus may require it.

Posts