Clients PXE booting from another subnet

jgiovann · Dec 3, 2018, 1:33 AM

Greetings,

I’ve successfully setup a FOG server with a DNSmasq service running, i.e. there is an existing production DHCP server (RHEL 7.4)

Clients in the same subnet as the FOG server (192.168.0.1/24) can communicate with the FOG server (192.168.0.87) when PXE booting
However clients on another subnet (192.168.2.1/24) are not able to communicate to the FOG server via PXE booting
What additional options need to be configured on the DHCP server (not the FOG server) to direct/help clients in the 192.168.2.1/24 subnet to PXE boot off the FOG server ?

Going back to the installation script I can see the following suggestion:

On a Linux DHCP server you must set: next-server and filename

I can provide more details as required.

John

Sebastian Roth · Dec 3, 2018, 5:57 AM

@jgiovann PXE booting is done via so called broadcasts. A client sends a DHCP request to a broadcast address (everyone in its network), DHCP server and DHCP proxy (dnsmasq) see this and answer. Now this broadcast is restricted to the local subnet per definition.

So I am wondering how your clients in the 192.168.2.0/24 subnet get IP addresses. Is there another DHCP server or you have so called IP helpers/DHCP relay configured to forward the DHCP packets from that other subnet to your DHCP server. I guess the later one is the case as you would have told us about a seond DHCP server. Then you need to setup IP helpers/DHCP relay to also forward the DHCP packets to your FOG server where dnsmasq is running.

On the other hand I am wondering why you did set things up using dnsmasq at all. You should be able to adjust your existing RHEL DHCP server config easily to PXE boot clients without needing to fiddle with dnsmasq at all. Check out the example configs in our wiki: https://wiki.fogproject.org/wiki/index.php?title=BIOS_and_UEFI_Co-Existence#Example_1 (most important parts are next-server -> should point to FOG server IP; class/match/filename blocks -> just use as we have them in the example)

george1421 · Dec 3, 2018, 1:05 PM

Since you are running dnsmasq, dnsmasq is supplying the pxe boot information for the local subnet. As Sebastian says dhcp (pxe booting) relies on broadcast messages to communicate. Because your remote subnet (192.168.2.0/24) is isolated by a router, those broadcast messages are filtered out by your router. I’m assuming that your dhcp server and FOG server are on the 192.168.0.0/24 subnet. AND your dhcp server is issuing IP addresses for the 192.168.0.0/24 and 192.168.2.0/24 subnet. If this is the case, on your router between the subnets, you’ve probably setup a dhcp-helper / dhcp-relay service. That service forwards dhcp requests between your subnet.

What you need to do, is add the dnsmasq IP address to the last host position in that dhcp-helper service. That way the dhcp-helper service will send dhcp requests to both your main dhcp server and the dnsmasq service, so the dnsmasq service knows to respond to a boot request on the remote subnet (192.168.2.0/24).

If you can’t get this working then we have you capture a pcap of the pxe booting on the remote subnet to see what is really going down the wires. But adding the dnsmasq IP address to the dhcp-helper list should resolve your problem.

jgiovann · Dec 4, 2018, 6:07 AM

Thanks to the replies so far. Sebastian, I thought I’d trial the FOG server on a non-production machine first, not to mention the server that runs a DHCP service also runs other critical services.

George, the tips for setting up a dhcp-helper list are very useful. Admittedly I’ll need to work with the network admin to resolve this.

I’ll be happy to provide an update once some progress is made.

jgiovann · Dec 5, 2018, 6:42 AM

@Sebastian-Roth

Looks like your advice was very useful. In the end I found the following setting allowed the client (from the 192.168.2.0/24 subnet) to boot into the FOG menu (192.168.0.87):

   next-server 192.168.0.87;
    # Select the correct PXE boot file depending on whether Legacy or UEFI booting is requested
    class "pxeclients" {
            match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
            if option architecture-type = 00:07 {
                    filename "intel.efi";
            } else {
                    filename "undionly.kkpxe";
            }
    }

The only issue now is that I’m getting the following message when attempting to do a Quick Registration of the client:

udhcpc: sending discover
udhcpc: sending discover
udhcpc: sending discover
udhcpc: no lease, failing
Either DHCP failed or we were unable to access http://192.168.0.87/fog//index.php for connection testing
No DHCP responce on interface eth0, skipping it.
Failed to get an IP via DHCP! Tried on interfaces(s): eth0
Please check your network setup and try again!
Press enter to continue

Note the DHCP server is not running on the FOG server but the production server.

Perhaps I need to do some more tweaking ?

Sebastian Roth · Dec 5, 2018, 6:51 AM

@jgiovann You need to push a gateway/router to those clients in the other subnet. Add option routers 192.168.2.X; to your DHCP config. There must be a router/gateway between those two networks and you need to specify its IP from the 192.168.2.0/24 side.

jgiovann · Dec 6, 2018, 6:01 AM

@Sebastian-Roth Thanks for the recommendation. Admittedly the option you’ve mentioned is already in the DHCP config

subnet 192.168.2.0 netmask 255.255.255.0 {
        option routers             192.168.2.1;
        option subnet-mask         255.255.255.0;

The target machine correctly obtains an IP address (statically assigned) when PXE booting into the FOG menu. A truncated snippet of the log files is shown here

Dec 06 16:40:01  DHCPOFFER on 192.168.2.89 to 48:4d:7e:d5:66:a5 via 192.168.2.1
Dec 06 16:40:02  DHCPDISCOVER from 48:4d:7e:d5:66:a5 via 192.168.2.1
Dec 06 16:40:02  DHCPOFFER on 192.168.2.89 to 48:4d:7e:d5:66:a5 via 192.168.2.1
Dec 06 16:40:04  DHCPREQUEST for 192.168.2.89 (192.168.0.20) from 48:4d:7e:d5:66:a5 via 192.168.2.1
Dec 06 16:40:04  DHCPACK on 192.168.2.89 to 48:4d:7e:d5:66:a5 via 192.168.2.1

The problem is when one attempts to register the host that there is no response on the interface eth0

Perhaps I should step back and not run dnsmasq in the first place, i.e. simplify the setup ? In fact if I stop the dnsmasq service (on the FOG server), I can boot into the FOG menu
The log files on the FOG server don’t provide any details as to what is going on here. Perhaps I need to tweak the FOG server configuration to explicitly tell it where to find the DHCP server and other relevant details ?

Sebastian Roth · Dec 6, 2018, 7:32 AM

@jgiovann said in Clients PXE booting from another subnet:

Perhaps I should step back and not run dnsmasq in the first place, i.e. simplify the setup ? In fact if I stop the dnsmasq service (on the FOG server), I can boot into the FOG menu

Yes, as you seem to have your other DHCP server setup correctly now you don’t need dnsmasq anymore. I don’t think it causes the issue you have right now but better disable the service.

About the issue on registration. On boot the client checks if it can reach the FOG server via HTTP. You can manually do that. Boot up a Windows client in the 192.168.2.0 network and open the FOG web UI URL in the browser. Does it work?

jgiovann · Dec 6, 2018, 11:41 PM

@Sebastian-Roth

Getting closer …

When I enter the URL http://192.168.0.87//fog//management/index.php from a client in the 192.168.2.0/24 subnet (as reported by the host registration step) I get the default FOG Project login screen (not able to attach a screenshot). i.e. I can connect to the URL

Note:

The URL is redirected from http://192.168.0.87//fog//management/index.php to http://192.168.0.87//fog//management/index.php
Are the double slashes significant in the URL ?
I also checked for TCP connections on the FOG server. There are no TCP connections on port 80 via IPv4. The primary DHCP server is configured for IPv4 (not IPv6).

# netstat -ant | grep -v 127.0.0.1 | head -15
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:60313           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:20048           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:38871           0.0.0.0:*               LISTEN     
tcp        0      0 192.168.0.87:22         192.168.2.12:44596      ESTABLISHED
tcp        0      0 192.168.0.87:50226      192.168.0.216:389       ESTABLISHED
tcp        0      0 192.168.0.87:22         192.168.2.12:39532      ESTABLISHED
tcp        0      0 192.168.0.87:22         192.168.2.12:39414      ESTABLISHED
tcp6       0      0 ::1:25                  :::*                    LISTEN     

# netstat -ant6 | grep -v 127.0.0.1 | head -15
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp6       0      0 ::1:25                  :::*                    LISTEN     
tcp6       0      0 :::443                  :::*                    LISTEN     
tcp6       0      0 :::56638                :::*                    LISTEN     
tcp6       0      0 :::2049                 :::*                    LISTEN     
tcp6       0      0 :::39500                :::*                    LISTEN     
tcp6       0      0 :::111                  :::*                    LISTEN     
tcp6       0      0 :::80                   :::*                    LISTEN     
tcp6       0      0 :::20048                :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
tcp6       0      0 192.168.0.87:80         192.168.2.12:34100      TIME_WAIT  
tcp6       0      0 192.168.0.87:80         192.168.0.87:53770      TIME_WAIT  
tcp6       0      0 192.168.0.87:80         192.168.0.87:53604      TIME_WAIT  
tcp6       0      0 192.168.0.87:80         192.168.2.12:34116      TIME_WAIT

I can provide more details.

Sebastian Roth · Dec 7, 2018, 7:57 AM

@jgiovann said in Clients PXE booting from another subnet:

The URL is redirected from http://192.168.0.87//fog//management/index.php to http://192.168.0.87//fog//management/index.php

From what I see the URLs are exactly the same. So how would there be a redirect? Please post again those URLs. We do redirecting in some cases but I can’t think of that causing the issue for you.

Are the double slashes significant in the URL ?
No, should not necessarily be there but we had those in the scripts for a long time and did not seem to cause trouble.

I have just had a look at the scripts again and I wonder why it is showing the URL including that “/management/” part. From what I remember that should not be the case. Cannot remember from the top of my head if it’s in the storage node or general fog settings. Think it’s the later. Please check if you have messed with those. Web root is usually just fog…

Sebastian Roth · Dec 7, 2018, 10:36 AM

@jgiovann Ah sorry, just saw that I had miss-read one of your posts. Looking back to the older ones I see that your client seemed to try to connect to http://192.168.0.87/fog//index.php which is the right URL - not the one you posted last…

Now as I think of it it’s probably best you test this URL again and watch the apache access logs. On your FOG server run tail -f /var/log/apache2/access.log (debian/ubuntu) or tail -f /var/log/httpd/access_log (centos/fedora/rhel), hit ENTER twice to see where the last state was and then open http://192.168.0.87/fog//index.php in your browser from the client in the 192.168.2.0/24 network. You probably see the request coming in. Now quickly PXE boot another client that you want to register, choose quick register and keep an eye on the access logs while it boots up. Keep hitting ENTER on the access log and see if you can find the entry that says the client. If you don’t see the client requesting on PXE boot we probably need to see if you have some weird firewall rules blocking only some of the 192.168.2.0/24 clients?!

jgiovann · Dec 12, 2018, 5:43 AM

@Sebastian-Roth I rebuilt the server from scratch using the latest stable version of FOG (1.5.5). Also checked the firewall between the 2 subnets - there are no rules blocking communication between the client and FOG server. I also disabled SELinux and stopped the firewall on the FOG server (for the time being).

Continuous pings to the target client show that the interface is correctly assigned an IP address after it boots into the FOG menu. However as soon as the registration process is launched, the client loses connectivity and is no longer able to communicate with the FOG server

64 bytes from 192.168.2.89: icmp_seq=85 ttl=64 time=40.2 ms
64 bytes from 192.168.2.89: icmp_seq=86 ttl=64 time=28.0 ms
64 bytes from 192.168.2.89: icmp_seq=87 ttl=64 time=0.301 ms
... (at this point the registration process is launched)
From 192.168.2.12 icmp_seq=128 Destination Host Unreachable
From 192.168.2.12 icmp_seq=129 Destination Host Unreachable
From 192.168.2.12 icmp_seq=130 Destination Host Unreachable

I’m attaching snippets of both the error log

[Wed Dec 12 10:36:40.973608 2018] [core:notice] [pid 5259] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Wed Dec 12 16:12:01.720797 2018] [mpm_prefork:notice] [pid 5259] AH00170: caught SIGWINCH, shutting down gracefully
[Wed Dec 12 16:12:34.842582 2018] [core:notice] [pid 5266] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
[Wed Dec 12 16:12:34.852140 2018] [suexec:notice] [pid 5266] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Wed Dec 12 16:12:34.954624 2018] [auth_digest:notice] [pid 5266] AH01757: generating secret for digest authentication ...
[Wed Dec 12 16:12:34.956680 2018] [lbmethod_heartbeat:notice] [pid 5266] AH02282: No slotmem from mod_heartmonitor
[Wed Dec 12 16:12:35.121620 2018] [mpm_prefork:notice] [pid 5266] AH00163: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9 PHP/5.6.39 configured -- resuming normal operations
[Wed Dec 12 16:12:35.121645 2018] [core:notice] [pid 5266] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'

… and the access_log (filtered by the target client machine)

192.168.2.89 - - [12/Dec/2018:16:08:04 +1100] "GET /fog/service/ipxe/init.xz HTTP/1.1" 200 19286132 "-" "iPXE/1.0.0+ (960d1)"
192.168.2.89 - - [12/Dec/2018:16:12:41 +1100] "POST /fog/service/ipxe/boot.php HTTP/1.1" 200 2701 "-" "iPXE/1.0.0+ (960d1)"
192.168.2.89 - - [12/Dec/2018:16:12:42 +1100] "GET /fog/service/ipxe/bg.png HTTP/1.1" 200 21280 "-" "iPXE/1.0.0+ (960d1)"
192.168.2.89 - - [12/Dec/2018:16:14:54 +1100] "GET /fog/service/ipxe/bzImage HTTP/1.1" 200 8430224 "-" "iPXE/1.0.0+ (960d1)"
192.168.2.89 - - [12/Dec/2018:16:14:54 +1100] "GET /fog/service/ipxe/init.xz HTTP/1.1" 200 19286132 "-" "iPXE/1.0.0+ (960d1)"

I double checked the URL redirect, I was meant to say that http://192.168.0.87//fog//index.php is re-directed to http://192.168.0.87//fog//management/index.php
. Is this the correct URL ? The re-directed URL is the management login page

As a final option, could I add a 2nd network interface on the FOG server which has an IP address in the 192.168.2.0/24 subnet ?

Sebastian Roth · Dec 12, 2018, 7:45 AM

@jgiovann said in Clients PXE booting from another subnet:

Continuous pings to the target client show that the interface is correctly assigned an IP address after it boots into the FOG menu. However as soon as the registration process is launched, the client loses connectivity and is no longer able to communicate with the FOG server.

I have seen the client receiving a different IP address on different stages of the PXE boot process. In that whole process the client requests an address from the DHCP server three times. First the PXE ROM of your network card, second is iPXE and last the Linux Kernel. There should be no difference in the DHCP information the client gets for each of those three stages but you never know. Maybe there is another wild DHCP server in your network or a replicating DHCP server setup that is playing tricks.

To actually know what DHCP information is sent is key here I suppose. Setup a mirror port to capture the client port traffic using Wireshark.

If that is asking too much of you we could maybe do a Teamviewer session. The other thing you could check is when exactly does the ping stop? It should stop and pick up again several times if the IP does not change. Check your DHCP logs or leases to see which IP it recieves. As well pay attention on boot up, the Linux part should show the IP it gets as well.

Then see if you find that HTTP request done by the Linux FOS client after DHCP. It should be a so called HTTP HEAD request.

jgiovann · Dec 18, 2018, 5:53 AM

@Sebastian-Roth It turns out I was getting the same issue even with the client and FOG server resided on the same subnet. However not a problem in a virtual environment.

… after capturing port traffic with wireshark and doing some investigation, it turns out that turning off the spanning-tree protocol on the provisioning port allowed the client to register with the FOG server.

In a real production network (where turning off spanning tree is not be allowed), is it therefore possible to re-configure the FOG server to wait longer or retry more times before it gives up the registration retry loop ?

Sebastian Roth · Dec 18, 2018, 7:39 AM

@jgiovann Great you figured this is a spanning treee thing! Seems like I was to focused on the issue might be routing problems that I didn’t notice the message udhcpc: no lease, failing.

In a real production network (where turning off spanning tree is not be allowed), is it therefore possible to re-configure the FOG server to wait longer or retry more times before it gives up the registration retry loop ?

The problem cannot be solved in the registration retry loop I think but we’d need to tell the dhcp client to wait longer. But from my point of view you should be able to solve this by setting the client ports to “port fast”. There er different names for this but what it essentially does is disable spanning tree for particular ports where you surely know there are no other switches connected but only clients. On those ports you never ever need spanning tree because single clients connected to a port can never cause a loop (which spanning tree was invented to prevent from)!

Clients PXE booting from another subnet

192

12.1k

17.3k

155.4k