FOG Multicast issue - stuck on starting to restore image - version 7547
-
I tried running
/opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint --portbase 9000
but the same thing happened. I can see the client connecting, but when the client presses any key nothing happens. The only thing that changed is that when the client presses ctrl-c I seere-xmits=0000000 ( 0.0%) slice=0112 - 0
.
display on the server. That didn’t happen with the previous command.Also, I do not have iptables, selinux, or firewalld running so it isn’t a rule issue I believe.
Here is what was in the multicast log.
11:33:51.058257 Disconnecting #0 (192.168.1.127) 11:33:51.058848 Transfer complete. 11:33:51.058925 [06-08-16 11:33:54 am] | 0 tasks to be cleaned [06-08-16 11:33:54 am] | 1 task found [06-08-16 11:34:04 am] | 0 tasks to be cleaned [06-08-16 11:34:04 am] | 1 task found [06-08-16 11:34:14 am] | 0 tasks to be cleaned [06-08-16 11:34:14 am] | 1 task found [06-08-16 11:34:24 am] | 0 tasks to be cleaned [06-08-16 11:34:24 am] | 1 task found [06-08-16 11:34:34 am] | 0 tasks to be cleaned [06-08-16 11:34:34 am] | 1 task found [06-08-16 11:34:44 am] | 0 tasks to be cleaned [06-08-16 11:34:44 am] | 1 task found [06-08-16 11:34:54 am] | 0 tasks to be cleaned [06-08-16 11:34:54 am] | 1 task found [06-08-16 11:35:04 am] | 0 tasks to be cleaned [06-08-16 11:35:04 am] | 1 task found [06-08-16 11:35:14 am] | 0 tasks to be cleaned [06-08-16 11:35:14 am] | 1 task found [06-08-16 11:35:24 am] | 0 tasks to be cleaned [06-08-16 11:35:24 am] | 1 task found 11:35:27.212477 Using mcast address 232.168.1.3 11:35:27.212572 UDP sender for /opt/fog/.fogsettings at 192.168.1.3 on eth0 11:35:27.212588 Broadcasting control to 192.168.1.255 [06-08-16 11:35:34 am] | 0 tasks to be cleaned [06-08-16 11:35:34 am] | 1 task found [06-08-16 11:35:44 am] | 0 tasks to be cleaned [06-08-16 11:35:44 am] | 1 task found [06-08-16 11:35:54 am] | 0 tasks to be cleaned [06-08-16 11:35:54 am] | 1 task found [06-08-16 11:36:04 am] | 0 tasks to be cleaned [06-08-16 11:36:04 am] | 1 task found 11:36:11.950410 Starting transfer: 00000009 11:36:12.028143 Timeout notAnswered=[0] notReady=[0] nrAns=0 nrRead=0 nrPart=1 avg=10000 11:36:14.094665 Timeout notAnswered=[0] notReady=[0] nrAns=0 nrRead=0 nrPart=1 avg=10000 [06-08-16 11:36:14 am] | 0 tasks to be cleaned [06-08-16 11:36:14 am] | 1 task found 11:36:20.161200 Timeout notAnswered=[0] notReady=[0] nrAns=0 nrRead=0 nrPart=1 avg=10000 [06-08-16 11:36:24 am] | 0 tasks to be cleaned [06-08-16 11:36:24 am] | 1 task found 11:36:26.227781 Timeout notAnswered=[0] notReady=[0] nrAns=0 nrRead=0 nrPart=1 avg=10000 11:36:26.894416 Disconnecting #0 (192.168.1.127) 11:36:26.894806 Transfer complete. 11:36:26.894865 [06-08-16 11:36:34 am] | 0 tasks to be cleaned [06-08-16 11:36:34 am] | 1 task found [06-08-16 11:36:44 am] | 0 tasks to be cleaned```
-
@arainero the logs say there was a timeout. I suspect a network problem.
Is it possible for you to connect the fog server and a Linux client via a mini switch (live boot) and try? If fog isn’t doing dhcp then you will need to configure a static ip on the live boot before testing. This test will eliminate your network as a variable to test just the server. If it succeeds then the issue is a network one. If it fails then it Is a server one.
-
@Wayne-Workman
When you say live boot what do you mean exactly? I currently have my pfsense router handling DHCP and network-boot requests.What is the best way to have FOG work when removing the router from this situation? I know I can just assign each server their own IP and do a direct connection, but I don’t think the network boot would work then, right?
-
@arainero Maybe it’s not the best idea given your setup.
I’m willing to take a look via TeamViewer around 4:30 Central Standard if you are free/around/being paid.
-
I would love to but I won’t be available until around 6:30 pm central. Would tomorrow work better for you?
-
@arainero 6:30 works for me. Shoot me a PM then.
-
After running some tests with @Wayne-Workman some progress was made!
The udp-sender test from the wiki wasn’t working until we forced the command to utilize the broadcast address on the FOG server NIC. After doing that the udp test worked.
After doing that we changed the broadcast address under the multicast settings in FOG Settings. We no longer got the “blue screen” failure, but a different type of failure and more activity in the logs.
Here is what the multicast log shows:
[06-22-16 8:37:55 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:55 pm] | 1 task found [06-22-16 8:37:55 pm] | 0 tasks to be cleaned [06-22-16 8:37:45 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:45 pm] | 1 task found [06-22-16 8:37:45 pm] | 0 tasks to be cleaned [06-22-16 8:37:35 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:35 pm] | 1 task found [06-22-16 8:37:35 pm] | 0 tasks to be cleaned Broadcasting control to 224.0.0.1 UDP sender for (stdin) at 192.168.1.3 on eth0 Udp-sender 20120424 cat: write error: Broken pipe Could not broadcast data packet to 192.168.1.255:64478 (Permission denied) Starting transfer: 00000009 New connection from 192.168.1.122 (#5) 00000009 New connection from 192.168.1.123 (#4) 00000009 New connection from 192.168.1.124 (#3) 00000009 [06-22-16 8:37:25 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:25 pm] | 1 task found [06-22-16 8:37:25 pm] | 0 tasks to be cleaned New connection from 192.168.1.120 (#2) 00000009 New connection from 192.168.1.121 (#1) 00000009 New connection from 192.168.1.119 (#0) 00000009 [06-22-16 8:37:15 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:15 pm] | 1 task found [06-22-16 8:37:15 pm] | 0 tasks to be cleaned [06-22-16 8:37:05 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:05 pm] | 1 task found [06-22-16 8:37:05 pm] | 0 tasks to be cleaned [06-22-16 8:36:55 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:36:55 pm] | 1 task found [06-22-16 8:36:55 pm] | 0 tasks to be cleaned [06-22-16 8:36:45 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:36:45 pm] | 1 task found [06-22-16 8:36:45 pm] | 0 tasks to be cleaned [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task has started. [06-22-16 8:36:35 pm] | CMD: cat /images/6_16_16/d1p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 64478 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d1p2.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 64478 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d2p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 64478 --full-duplex --ttl 32 --nokbd --nopointopoint; [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task sending on base port: 64478 [06-22-16 8:36:35 pm] | Task (3) 6 client(s) found. [06-22-16 8:36:35 pm] | Task (3) /images/6_16_16 image file found. [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task has been cleaned. Broadcasting control to 224.0.0.1 UDP sender for (stdin) at 192.168.1.3 on eth0 Udp-sender 20120424 [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task is new! [06-22-16 8:36:35 pm] | 1 task found [06-22-16 8:36:35 pm] | 0 tasks to be cleaned [06-22-16 8:36:25 pm] | Sleeping for 10 seconds to ensure tasks are properly submitted [06-22-16 8:36:15 pm] * No tasks found! [06-22-16 8:36:05 pm] * No tasks found! [06-22-16 8:35:55 pm] * No tasks found! [06-22-16 8:35:45 pm] * No tasks found! [06-22-16 8:35:35 pm] * No tasks found! [06-22-16 8:35:25 pm] * No tasks found! [06-22-16 8:35:15 pm] * No tasks found! [06-22-16 8:35:05 pm] * No tasks found! [06-22-16 8:34:55 pm] * No tasks found! [06-22-16 8:34:45 pm] | Task (2) Multi-Cast Task has been cleaned as cancelled. [06-22-16 8:34:45 pm] | Task (2) Multi-Cast Task is no longer running. [06-22-16 8:34:45 pm] | 1 task found [06-22-16 8:34:45 pm] | 0 tasks to be cleaned [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task has started. [06-22-16 8:34:35 pm] | CMD: cat /images/6_16_16/d1p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d1p2.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d2p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint; cat: write error: Broken pipe bind socket to 192.168.1.3:9001 (Address already in use) Udp-sender 20120424 cat: write error: Broken pipe bind socket to 192.168.1.3:9001 (Address already in use) Udp-sender 20120424 [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task sending on base port: 9000 : Broken pipe cat: write error[06-22-16 8:34:35 pm] | Task (2) 6 client(s) found. bind socket to 192.168.1.3:9001 (Address already in use) Udp-sender 20120424 [06-22-16 8:34:35 pm] | Task (2) /images/6_16_16 image file found. [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task has been cleaned. [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task is new! [06-22-16 8:34:35 pm] | 1 task found [06-22-16 8:34:35 pm] | 0 tasks to be cleaned [06-22-16 8:34:25 pm] | Sleeping for 10 seconds to ensure tasks are properly submitted [06-22-16 8:34:15 pm] * No tasks found! [06-22-16 8:34:05 pm] * No tasks found! [06-22-16 8:33:55 pm] * No tasks found! [06-22-16 8:33:45 pm] * No tasks found! [06-22-16 8:33:35 pm] * No tasks found! [06-22-16 8:33:25 pm] * No tasks found! [06-22-16 8:33:15 pm] * No tasks found! [06-22-16 8:33:05 pm] * No tasks found! [06-22-16 8:32:55 pm] * No tasks found! [06-22-16 8:32:45 pm] * No tasks found! [06-22-16 8:32:35 pm] * No tasks found! [06-22-16 8:32:25 pm] * No tasks found! [06-22-16 8:32:15 pm] * No tasks found! [06-22-16 8:32:05 pm] | Task (1) Multi-Cast Task has been cleaned as cancelled. [06-22-16 8:32:05 pm] | Task (1) Multi-Cast Task is no longer running. [06-22-16 8:32:05 pm] | 1 task found [06-22-16 8:32:05 pm] | 0 tasks to be cleaned [06-22-16 8:31:55 pm] | Task (1) Multi-Cast Task has started. [06-22-16 8:31:55 pm] | CMD: cat /images/6_16_16/d1p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 2 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d1p2.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 2 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d2p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 2 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;
-
Particularly, these lines from the log are most interesting:
cat: write error: Broken pipe Could not broadcast data packet to 192.168.1.255:64478 (Permission denied) Starting transfer: 00000009
Also, the invalid session is concerning as well.
-
Looking at the source code from here, this is where the error comes from:
https://www.udpcast.linux.lu/downloads.htmlsenddata.c
static int sendRawData(int sock, struct net_config *config, char *header, int headerSize, unsigned char *data, int dataSize) { struct iovec iov[2]; struct msghdr hdr; int packetSize; int ret; iov[0].iov_base = header; iov[0].iov_len = headerSize; iov[1].iov_base = data; iov[1].iov_len = dataSize; hdr.msg_name = &config->dataMcastAddr; hdr.msg_namelen = sizeof(struct sockaddr_in); hdr.msg_iov = iov; hdr.msg_iovlen = 2; initMsgHdr(&hdr); packetSize = dataSize + headerSize; rgWaitAll(config, sock, config->dataMcastAddr.sin_addr.s_addr, packetSize); ret = sendmsg(sock, &hdr, 0); if (ret < 0) { char ipBuffer[16]; udpc_fatal(1, "Could not broadcast data packet to %s:%d (%s)\n", getIpString(&config->dataMcastAddr, ipBuffer), getPort(&config->dataMcastAddr), strerror(errno)); } return 0; }
-
I tried a few different variations of changing the multicast address with no luck. I also tried a different kernel to see if that would help either, but it didn’t appear to.
@Developers anything else you suggest I try to do or provide on this post that will help?
-
Could not broadcast data packet to 192.168.1.255:64478 (Permission denied)
Well, you might want to try disabling your firewall for the moment to see if it works…
-
@Sebastian-Roth thank you for the suggestion. However, selinux and iptables on the server have both been turned off since FOG was installed. I checked to make sure firewalld was not running, but it is not available in CentOS 6.5. Is there a different service I should check to make sure it is not restricting it?
I don’t believe it is router/firewall related since I’ve never had issues with multicasting with it in the past and all inter LAN traffic is allowed. I am going to triple check to make sure nothing on the router would impede multicast traffic.
Also, is using 192.168.1.255 an appropriate setting to use for the multicast address under multicast settings? It was the only address that would work with the manual test from the wiki which was odd. When left on default (which is 0 or using 0.0.0.0) it wouldn’t work.
-
@arainero Have you done the multicast tests by hand yet? https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_Downloading_-_Multicast#Testing_Multicast
What happens if you leave the multicast address setting empty? In theory it should not pass the
--mcast-data-address
to the udp-sender tool and the tool should use the address it thinks is correct. Don’t set0
or0.0.0.0
but make it empty. -
@Sebastian-Roth We did run the manual test, when we leave the destination address blank, it doesn’t work and says it’s sending to 2xx.xxx.xxx.xxx something (can’t remember).
-
As you can see here multicast has it’s very own “address space”. It’s designed to use those addresses to work properly I suppose. I guess you can use different addresses if you really know what you are doing. But using a subnet broadcast address does not sound right to me.
-
All the mcast-data-address does, to my knowledge, is tell the client what address to try to grab the UDP packets from. It’s still going to use multicast addressing to grab it. So I think @Sebastian-Roth is correct. The mcast-data-address should NOT be a broadcast/network address, but the direct location to the host trying to send the data to begin with. The mcast-rdv-address will allow you to define the actual multicast address to use for the session, and should remain in the 224 pool (though there could be any number of reasons you need to change it to something else.)
Then again, I’m a total idiot when it comes to fully understand UDP and its actions. What I do know mcast-data-address will try to derive that ip address’s multicast address from where it’s pointing. This is NOT where you want to be trying to define the multicast network it’s going across.
To quote the man page in regards to these two separate arguments:
–mcast-data-address address
Uses the given address for multicasting the data. If not specified, the program will automatically derive a multicast address from its own IP (by keeping the last 27 bits of the IP and then prepending 232).–mcast-rdv-address address
Uses a non-standard multicast address for the control (rendez-vous) connection. This address is used by the sender and receivers to “find” each other. This is not the address that is used to transfer the actual data.
By default “mcast-rdv-address” is the Ethernet broadcast address if “ttl” is 1, and 224.0.0.1 otherwise. This setting should not be used except in very special situations, such as when 224.0.0.1 cannot be used for policy reasons. -
Ok then. All that makes sense, I learn something new every day here.
However - manually doing a udp-cast test without specifying an address failed, while specifying the broadcast address for this fog server’s broadcast domain worked.
Ideas?
-
I need details on the exact steps.
For example, if you setup the client to receive data BEFORE the udp-sender was established, the client side will NEVER connect to anything.
-
I’m sure that @arainero would be happy to re-do the test, and provide exact steps.
-
Sorry for the delay, here are the exact steps:
For testing 1 client:
- Boot one computer into debug
- Enter
udp-sender --file /opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint
into server and press enter. I now see
[root@fogserv ~]# udp-sender --file /opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint Udp-sender 20120424 Using full duplex mode
- Go to client computer and type this
udp-receiver
. On the client this immediately shows
Udp-receiver 20120424 UDP receiver for (stdout) at 192.168.1.118 on eth0 received message, cap=00000009 Connected as #0 to 192.168.1.3 Listening to multicast on 232.168.1.3 Press any key to start receiving data!
On the server this immediately shows
New connection from 192.168.1.118 (#0) 00000009 Ready. Press any key to start sending data.
- I press enter on the server. Nothing happens.
- I now press enter on the client. I get
Sending go signal
Nothing happens on the server. I now press enter on the server and still nothing happens.
I CTRL-C out on the client and see
[root@fogserv ~]# re-xmits=0000000 ( 0.0%) slice=0112 - 0
appear on the server. I CTRL-C on server and I am back to the prompt.
If I repeat the steps to step 4 and hit enter on the client first instead of the server I get the same results.
If I do the above steps again but use
udp-sender --file /opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint --mcast-data-address 192.168.1.255
instead of the wiki command and hit enter on the client first I see the fog settings file displayed on the client. If I repeat that and hit enter on the server first I see the settings displayed. This also works with a 2nd client computer and does not matter if I hit enter on the client or server first.
The
--mcast-data-address 192.168.1.255
seems to be the magic argument that makes it “work” for the test scenario.Was this enough to determine anything or would you want me to do anything else to help narrow it down? If anyone wants to take a look through teamviewer or what-have-you I am more than open.
Thanks again everyone and enjoy the 4th!