FOG Multicast issue - stuck on starting to restore image - version 7547
-
@arainero 6:30 works for me. Shoot me a PM then.
-
After running some tests with @Wayne-Workman some progress was made!
The udp-sender test from the wiki wasn’t working until we forced the command to utilize the broadcast address on the FOG server NIC. After doing that the udp test worked.
After doing that we changed the broadcast address under the multicast settings in FOG Settings. We no longer got the “blue screen” failure, but a different type of failure and more activity in the logs.
Here is what the multicast log shows:
[06-22-16 8:37:55 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:55 pm] | 1 task found [06-22-16 8:37:55 pm] | 0 tasks to be cleaned [06-22-16 8:37:45 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:45 pm] | 1 task found [06-22-16 8:37:45 pm] | 0 tasks to be cleaned [06-22-16 8:37:35 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:35 pm] | 1 task found [06-22-16 8:37:35 pm] | 0 tasks to be cleaned Broadcasting control to 224.0.0.1 UDP sender for (stdin) at 192.168.1.3 on eth0 Udp-sender 20120424 cat: write error: Broken pipe Could not broadcast data packet to 192.168.1.255:64478 (Permission denied) Starting transfer: 00000009 New connection from 192.168.1.122 (#5) 00000009 New connection from 192.168.1.123 (#4) 00000009 New connection from 192.168.1.124 (#3) 00000009 [06-22-16 8:37:25 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:25 pm] | 1 task found [06-22-16 8:37:25 pm] | 0 tasks to be cleaned New connection from 192.168.1.120 (#2) 00000009 New connection from 192.168.1.121 (#1) 00000009 New connection from 192.168.1.119 (#0) 00000009 [06-22-16 8:37:15 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:15 pm] | 1 task found [06-22-16 8:37:15 pm] | 0 tasks to be cleaned [06-22-16 8:37:05 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:37:05 pm] | 1 task found [06-22-16 8:37:05 pm] | 0 tasks to be cleaned [06-22-16 8:36:55 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:36:55 pm] | 1 task found [06-22-16 8:36:55 pm] | 0 tasks to be cleaned [06-22-16 8:36:45 pm] | Task (3) Multi-Cast Task is already running PID 8514 [06-22-16 8:36:45 pm] | 1 task found [06-22-16 8:36:45 pm] | 0 tasks to be cleaned [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task has started. [06-22-16 8:36:35 pm] | CMD: cat /images/6_16_16/d1p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 64478 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d1p2.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 64478 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d2p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 64478 --full-duplex --ttl 32 --nokbd --nopointopoint; [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task sending on base port: 64478 [06-22-16 8:36:35 pm] | Task (3) 6 client(s) found. [06-22-16 8:36:35 pm] | Task (3) /images/6_16_16 image file found. [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task has been cleaned. Broadcasting control to 224.0.0.1 UDP sender for (stdin) at 192.168.1.3 on eth0 Udp-sender 20120424 [06-22-16 8:36:35 pm] | Task (3) Multi-Cast Task is new! [06-22-16 8:36:35 pm] | 1 task found [06-22-16 8:36:35 pm] | 0 tasks to be cleaned [06-22-16 8:36:25 pm] | Sleeping for 10 seconds to ensure tasks are properly submitted [06-22-16 8:36:15 pm] * No tasks found! [06-22-16 8:36:05 pm] * No tasks found! [06-22-16 8:35:55 pm] * No tasks found! [06-22-16 8:35:45 pm] * No tasks found! [06-22-16 8:35:35 pm] * No tasks found! [06-22-16 8:35:25 pm] * No tasks found! [06-22-16 8:35:15 pm] * No tasks found! [06-22-16 8:35:05 pm] * No tasks found! [06-22-16 8:34:55 pm] * No tasks found! [06-22-16 8:34:45 pm] | Task (2) Multi-Cast Task has been cleaned as cancelled. [06-22-16 8:34:45 pm] | Task (2) Multi-Cast Task is no longer running. [06-22-16 8:34:45 pm] | 1 task found [06-22-16 8:34:45 pm] | 0 tasks to be cleaned [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task has started. [06-22-16 8:34:35 pm] | CMD: cat /images/6_16_16/d1p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d1p2.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d2p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 6 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint; cat: write error: Broken pipe bind socket to 192.168.1.3:9001 (Address already in use) Udp-sender 20120424 cat: write error: Broken pipe bind socket to 192.168.1.3:9001 (Address already in use) Udp-sender 20120424 [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task sending on base port: 9000 : Broken pipe cat: write error[06-22-16 8:34:35 pm] | Task (2) 6 client(s) found. bind socket to 192.168.1.3:9001 (Address already in use) Udp-sender 20120424 [06-22-16 8:34:35 pm] | Task (2) /images/6_16_16 image file found. [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task has been cleaned. [06-22-16 8:34:35 pm] | Task (2) Multi-Cast Task is new! [06-22-16 8:34:35 pm] | 1 task found [06-22-16 8:34:35 pm] | 0 tasks to be cleaned [06-22-16 8:34:25 pm] | Sleeping for 10 seconds to ensure tasks are properly submitted [06-22-16 8:34:15 pm] * No tasks found! [06-22-16 8:34:05 pm] * No tasks found! [06-22-16 8:33:55 pm] * No tasks found! [06-22-16 8:33:45 pm] * No tasks found! [06-22-16 8:33:35 pm] * No tasks found! [06-22-16 8:33:25 pm] * No tasks found! [06-22-16 8:33:15 pm] * No tasks found! [06-22-16 8:33:05 pm] * No tasks found! [06-22-16 8:32:55 pm] * No tasks found! [06-22-16 8:32:45 pm] * No tasks found! [06-22-16 8:32:35 pm] * No tasks found! [06-22-16 8:32:25 pm] * No tasks found! [06-22-16 8:32:15 pm] * No tasks found! [06-22-16 8:32:05 pm] | Task (1) Multi-Cast Task has been cleaned as cancelled. [06-22-16 8:32:05 pm] | Task (1) Multi-Cast Task is no longer running. [06-22-16 8:32:05 pm] | 1 task found [06-22-16 8:32:05 pm] | 0 tasks to be cleaned [06-22-16 8:31:55 pm] | Task (1) Multi-Cast Task has started. [06-22-16 8:31:55 pm] | CMD: cat /images/6_16_16/d1p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 2 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d1p2.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 2 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;cat /images/6_16_16/d2p1.img | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 2 --max-wait 60 --mcast-data-address 192.168.1.255 --portbase 9000 --full-duplex --ttl 32 --nokbd --nopointopoint;
-
Particularly, these lines from the log are most interesting:
cat: write error: Broken pipe Could not broadcast data packet to 192.168.1.255:64478 (Permission denied) Starting transfer: 00000009
Also, the invalid session is concerning as well.
-
Looking at the source code from here, this is where the error comes from:
https://www.udpcast.linux.lu/downloads.htmlsenddata.c
static int sendRawData(int sock, struct net_config *config, char *header, int headerSize, unsigned char *data, int dataSize) { struct iovec iov[2]; struct msghdr hdr; int packetSize; int ret; iov[0].iov_base = header; iov[0].iov_len = headerSize; iov[1].iov_base = data; iov[1].iov_len = dataSize; hdr.msg_name = &config->dataMcastAddr; hdr.msg_namelen = sizeof(struct sockaddr_in); hdr.msg_iov = iov; hdr.msg_iovlen = 2; initMsgHdr(&hdr); packetSize = dataSize + headerSize; rgWaitAll(config, sock, config->dataMcastAddr.sin_addr.s_addr, packetSize); ret = sendmsg(sock, &hdr, 0); if (ret < 0) { char ipBuffer[16]; udpc_fatal(1, "Could not broadcast data packet to %s:%d (%s)\n", getIpString(&config->dataMcastAddr, ipBuffer), getPort(&config->dataMcastAddr), strerror(errno)); } return 0; }
-
I tried a few different variations of changing the multicast address with no luck. I also tried a different kernel to see if that would help either, but it didn’t appear to.
@Developers anything else you suggest I try to do or provide on this post that will help?
-
Could not broadcast data packet to 192.168.1.255:64478 (Permission denied)
Well, you might want to try disabling your firewall for the moment to see if it works…
-
@Sebastian-Roth thank you for the suggestion. However, selinux and iptables on the server have both been turned off since FOG was installed. I checked to make sure firewalld was not running, but it is not available in CentOS 6.5. Is there a different service I should check to make sure it is not restricting it?
I don’t believe it is router/firewall related since I’ve never had issues with multicasting with it in the past and all inter LAN traffic is allowed. I am going to triple check to make sure nothing on the router would impede multicast traffic.
Also, is using 192.168.1.255 an appropriate setting to use for the multicast address under multicast settings? It was the only address that would work with the manual test from the wiki which was odd. When left on default (which is 0 or using 0.0.0.0) it wouldn’t work.
-
@arainero Have you done the multicast tests by hand yet? https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_Downloading_-_Multicast#Testing_Multicast
What happens if you leave the multicast address setting empty? In theory it should not pass the
--mcast-data-address
to the udp-sender tool and the tool should use the address it thinks is correct. Don’t set0
or0.0.0.0
but make it empty. -
@Sebastian-Roth We did run the manual test, when we leave the destination address blank, it doesn’t work and says it’s sending to 2xx.xxx.xxx.xxx something (can’t remember).
-
As you can see here multicast has it’s very own “address space”. It’s designed to use those addresses to work properly I suppose. I guess you can use different addresses if you really know what you are doing. But using a subnet broadcast address does not sound right to me.
-
All the mcast-data-address does, to my knowledge, is tell the client what address to try to grab the UDP packets from. It’s still going to use multicast addressing to grab it. So I think @Sebastian-Roth is correct. The mcast-data-address should NOT be a broadcast/network address, but the direct location to the host trying to send the data to begin with. The mcast-rdv-address will allow you to define the actual multicast address to use for the session, and should remain in the 224 pool (though there could be any number of reasons you need to change it to something else.)
Then again, I’m a total idiot when it comes to fully understand UDP and its actions. What I do know mcast-data-address will try to derive that ip address’s multicast address from where it’s pointing. This is NOT where you want to be trying to define the multicast network it’s going across.
To quote the man page in regards to these two separate arguments:
–mcast-data-address address
Uses the given address for multicasting the data. If not specified, the program will automatically derive a multicast address from its own IP (by keeping the last 27 bits of the IP and then prepending 232).–mcast-rdv-address address
Uses a non-standard multicast address for the control (rendez-vous) connection. This address is used by the sender and receivers to “find” each other. This is not the address that is used to transfer the actual data.
By default “mcast-rdv-address” is the Ethernet broadcast address if “ttl” is 1, and 224.0.0.1 otherwise. This setting should not be used except in very special situations, such as when 224.0.0.1 cannot be used for policy reasons. -
Ok then. All that makes sense, I learn something new every day here.
However - manually doing a udp-cast test without specifying an address failed, while specifying the broadcast address for this fog server’s broadcast domain worked.
Ideas?
-
I need details on the exact steps.
For example, if you setup the client to receive data BEFORE the udp-sender was established, the client side will NEVER connect to anything.
-
I’m sure that @arainero would be happy to re-do the test, and provide exact steps.
-
Sorry for the delay, here are the exact steps:
For testing 1 client:
- Boot one computer into debug
- Enter
udp-sender --file /opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint
into server and press enter. I now see
[root@fogserv ~]# udp-sender --file /opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint Udp-sender 20120424 Using full duplex mode
- Go to client computer and type this
udp-receiver
. On the client this immediately shows
Udp-receiver 20120424 UDP receiver for (stdout) at 192.168.1.118 on eth0 received message, cap=00000009 Connected as #0 to 192.168.1.3 Listening to multicast on 232.168.1.3 Press any key to start receiving data!
On the server this immediately shows
New connection from 192.168.1.118 (#0) 00000009 Ready. Press any key to start sending data.
- I press enter on the server. Nothing happens.
- I now press enter on the client. I get
Sending go signal
Nothing happens on the server. I now press enter on the server and still nothing happens.
I CTRL-C out on the client and see
[root@fogserv ~]# re-xmits=0000000 ( 0.0%) slice=0112 - 0
appear on the server. I CTRL-C on server and I am back to the prompt.
If I repeat the steps to step 4 and hit enter on the client first instead of the server I get the same results.
If I do the above steps again but use
udp-sender --file /opt/fog/.fogsettings --log /opt/fog/log/multicast.log --ttl 1 --nopointopoint --mcast-data-address 192.168.1.255
instead of the wiki command and hit enter on the client first I see the fog settings file displayed on the client. If I repeat that and hit enter on the server first I see the settings displayed. This also works with a 2nd client computer and does not matter if I hit enter on the client or server first.
The
--mcast-data-address 192.168.1.255
seems to be the magic argument that makes it “work” for the test scenario.Was this enough to determine anything or would you want me to do anything else to help narrow it down? If anyone wants to take a look through teamviewer or what-have-you I am more than open.
Thanks again everyone and enjoy the 4th!
-
Should I update to trunk and do the same tests or hold off on that for now?
-
@arainero What about using the options you see in the logs?
cat /path/to/file | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 1 --max-wait 600 --portbase 65102 --full-duplex --ttl 32 --nokbd --nopointopoint
-
If I use this modifed version of that line from the log it works.
cat /opt/fog/.fogsettings | /usr/local/sbin/udp-sender --interface eth0 --min-receivers 1 --max-wait 600 --full-duplex --nokbd --nopointopoint --mcast-data-address 192.168.1.255
I had to remove portbase and the ttl flags and add the --mcast-data-address 192.168.1.255 flag. After doing that I was able to receive the fog settings files.
I tried the command with just removing the portbase and ttl flags and not adding --mcast-data-address 192.168.1.255, but that didnt work.
-
@arainero Why did you have to remove
--ttl
and--portbase
?? I still think something is wrong with the network setup. From the udp-cast man page I get that it would default to use 232.168.1.x (where x would be the last byte of the server IP address). Possibly your switch is layer 3 switch and does not allow those “out of range” IPs? -
@Sebastian-Roth I removed those to test to see if different scenarios would work. I tested with --mcast-data-address 232.168.1.255, 232.168.1.1, 232.168.1.0, etc, and those worked too.
I actually added --ttl 32 and that worked. It’s just when I add --portbase 65102 it stops working. I tried a few different port variations with no luck.
The switches in the letwork are layer 2. Multicast used to work fine and there have been no changes to the network. I will double check the switch configurations again though.