DHCP problems on storage nodes
-
@greg-plamondon Well I think we need to get better debugging information into the inits to see exactly what is going wrong.
There is a wiki page that explains how to unpack the inits. Copy the init.xz (virtual hard drive) from /var/www/html/fog/service/ipxe to /tmp on your fog server then follow the instructions in the wiki: https://wiki.fogproject.org/wiki/index.php?title=Modifying_the_Init_Image
cp /var/www/html/fog/service/ipxe/init.xz /tmp cd /tmp xz -d init.xz mkdir initmountdir mount -o loop init initmountdir
Then change into /tmp/initmountdir/etc/init.d
I’m going to work on updating S40Network to add more info so we know what its doing (wrong).
-
This is the updated S40network file.
#!/bin/bash # # Start the network.... # if [[ -n $has_usb_nic ]]; then echo "Please unplug your device and replug it into the usb port" echo -n "Please press enter key to connect [Enter]" read -p "$*" echo "Sleeping for 5 seconds to allow USB to sync back with system" sleep 5 fi # Enable loopback interface echo -e "auto lo\niface lo inet loopback\n\n" > /etc/network/interfaces /sbin/ip addr add 127.0.0.1/8 dev lo /sbin/ip link set lo up sleep 10 # Generated a sorted list with primary interfaces first read p_ifaces <<< $(/sbin/ip -0 -o addr show | awk -F'[: ]+' 'tolower($0) ~ /link[/]?ether/ && tolower($0) ~ /'$mac'/ {print $2}' | tr '\n' ' ') read o_ifaces <<< $(/sbin/ip -0 -o addr show | awk -F'[: ]+' 'tolower($0) ~ /link[/]?ether/ && tolower($0) !~ /'$mac'/ {print $2}' | tr '\n' ' ') ifaces="$p_ifaces $o_ifaces" for iface in $ifaces; do echo "Starting $iface interface and waiting for the link to come up" echo -e "auto $iface\niface $iface inet dhcp\n\n" >> /etc/network/interfaces /sbin/ip link set $iface up # Wait till the interface is fully up and ready (spanning tree) timeout=0 linkstate=0 until [[ $linkstate -eq 1 || $timeout -ge 35 ]]; do let timeout+=1 linkstate=$(/bin/cat /sys/class/net/$iface/carrier) [[ $linkstate -eq 0 ]] && sleep 1 || break done [[ $linkstate -eq 0 ]] && echo "No link detected on $iface for $timeout seconds, skipping it." && continue for retry in $(seq 3); do echo "## Bringing up interface $iface ##" /sbin/udhcpc -i $iface --now ustat="$?" echo "## Calling the fog server ${web}/index.php ##" curl -Ikfso /dev/null "${web}"/index.php --connect-timeout 5 cstat="$?" # If the udhcp is okay AND we can curl our web # we know we have link so no need to continue on. # NOTE: the link to web is kind of important, just # exiting on dhcp request is not sufficient. if [[ $ustat -eq 0 && $cstat -eq 0 ]]; then echo "## We have an IP address on $iface and the Master FOG server responded to our query ##" fi [[ $ustat -eq 0 && $cstat -eq 0 ]] && exit 0 if [[ $ustat -eq 1 ]]; then echo "## DHCP failed on $iface ##" fi if [[ $cstat -eq 1 ]]; then echo "## The Master FOG server failed responded to our query ##" fi echo "Either DHCP failed or we were unable to access ${web}/index.php for connection testing." sleep 1 done echo "No DHCP response on interface $iface, skipping it." done # If we end up here something went wrong as we do exit the script as soon as we get an IP! if [[ -z $ifaces ]]; then echo "No network interfaces found, your kernel is most probably missing the correct driver!" else echo "Failed to get an IP via DHCP! Tried on interfaces(s): $ifaces" fi echo "Please check your network setup and try again!" [[ -z $isdebug ]] && sleep 60 && reboot echo "Press enter to continue" read exit 1
My comments have a double pound on both sides. Once you update the S40network file then you need to repack the inits and then move to your storage node at this test location. Understand this is only a test init so that we can find out what is going on. You will want to keep your untouched init file after the debugging is over.
My intuition is still telling me this could be a spanning tree issue, even though you said you checked that. Your debug FOS system has access so that doesn’t specifically have the same conditions as a physical system. I guess you could always try to image a virtual machine at the remote location and see if it works (even before tweaking the inits). In the case of a VM it will not drop the link on the physical switch because the VM is connected to a vswitch.
-
ok, I have the S40network modiefied and the init repacked. do I need to run the same command for testing?
-
I am getting an error in the S40network file.
-
@greg-plamondon Whelp, that’s why I’m not a programmer
Here is what needs to be fixed, sorry.
This is the bad line
if [ $ustat -eq 0 && $cstat -eq 0 ]; then
This is what it should have been
if [[ $ustat -eq 0 && $cstat -eq 0 ]]; then
Awe, crud and then the next errors you will find are a few lines down.
if [ $ustat -eq 1 ]; then echo "## DHCP failed on $iface ##" fi if [ $cstat -eq 1 ]; then echo "## The Master FOG server failed responded to our query ##" fi
need to have the brackets too
if [[ $ustat -eq 1 ]]; then echo "## DHCP failed on $iface ##" fi if [[ $cstat -eq 1 ]]; then echo "## The Master FOG server failed responded to our query ##" fi
-
-
My question is if the connection is good “## We have an IP address on eth0 and the Master FOG server responded to our query ##” Why does it then disconnect the eth0 interface and attempt to obtain an IP when it already has one that is working?
-
now if I issue a: /etc/init.d/S40network restart
I get this:
-
@greg-plamondon That looks perfect. If you now key in
fog
you can single step through deployment. Or just cancel the task on the fog server and then pxe boot this again, the vm should image. At least from a networking point everything is golden.Is there any chance to do this on one of the broken systems?
You will see if there is an error, it will loop through this code 3 times then give up. IN this case it only when through once because it worked.
-
hmmm
-
@george1421
here is a video:
Youtube FOG -
@greg-plamondon just for clarity is this on the VM or a physical host?
-
this is a VM on the same host as the fogserver.
the physical PC’s do the same thing I just can’t get screen-caps of them. -
@greg-plamondon looking at your video, it almost appears that there are 2 scripts running to start the networking. We see clearly that the S40network is executing because it has our ## messages. But the second “unknown script at this time” doesn’t print out our ## messages. That tells me there is some other code running not in S40Network that is trying to reinit the network adapter. I guess I need to do a bit more digging here.
-
oops, lol I copied the moved the S40network to S40network.old
-
@george1421
ok removed the S40network.old, new video -
@greg-plamondon Its the same issue again, a second set of dhcp functions are being called. Did you take the S40network.old out of the init.d directory? Actually you can delete it since you saved the original init.xz image anyway.
-
@george1421
yes its removed. -
I dont know what the difference is from the init.xz that is on the main fogserver and the one that is on the storage node but I copied the init.xz from the main fogserver to the storage node and it works now? what gives?
-
@greg-plamondon There is/was a change in the init.xz files between FOG 1.4.4 and 1.5.0RCx The 1.5.0RCx now supports both http and https transactions throughout, where 1.4.4 kind of - sort of - supported https transactions, but not always.
When you upgraded to 1.5.0 on your main server, did you upgrade all of the storage nodes in your fleet? The fog kernel (bzImage and init.xz) need to be matched to the version of FOG that is installed.