Multicast problem fog 0.32 on centOS 5.6
-
Hello.
I’m having a problem with multicasting on fog 0.32, CentOS 5.6 (final) and windows 7 machines (Dell Optiplex gx620).
It’s a rather strange problem where I’m trying to multicast 145 machines and it’s all fine with the first small partition, it multicasts that one just fine, but when it’s about to start the second partition it just sits and wait on the “please wait” screen.
[B]Now the log in /opt/fog/log/ show’s the following:[/B]
[SIZE=2]([COLOR=#ff0000]This is the multicast.log.udpcast.28 log and not the one called only “multicast.log”[/COLOR]) [/SIZE]Udp-sender 2007-12-28
Using mcast address 236.21.238.31
UDP sender for (stdin) at 172.21.238.31 on eth0
Broadcasting control to 224.0.0.1
New connection from 172.21.238.123 (#0) 00000009
New connection from 172.21.238.156 (#1) 00000009
New connection from 172.21.238.152 (#2) 00000009
New connection from 172.21.238.151 (#3) 00000009
New connection from 172.21.238.136 (#4) 00000009 etc etc…[B]The it show’s:[/B]
Starting transfer: 00000009
bytes= 97 552 re-xmits=0000001 ( 1.4%) slice=0066 73 709 551 615 - 132
bytes= 193 648 re-xmits=0000001 ( 0.7%) slice=0066 73 709 551 615 - 131
bytes= 289 744 re-xmits=0000001 ( 0.5%) slice=0066 73 709 551 615 - 131
bytes= 385 840 re-xmits=0000001 ( 0.3%) slice=0066 73 709 551 615 - 131
bytes= 481 936 re-xmits=0000001 ( 0.3%) slice=0066 73 709 551 615 - 131
bytes= 578 032 re-xmits=0000001 ( 0.2%) slice=0066 73 709 551 615 - 131
bytes= 674 128 re-xmits=0000001 ( 0.2%) slice=0066 73 709 551 615 - 130 etc etc…[B]And then the interesting bits happen:[/B]
bytes= 25 370 800 re-xmits=0000001 ( 0.0%) slice=0066 73 709 551 615 - 130
bytes= 25 375 168 re-xmits=0000001 ( 0.0%) slice=0066 73 709 551 615 - 133
bytes= 25 375 612 re-xmits=0000001 ( 0.0%) slice=0066 73 709 551 615 - 132
Timeout notAnswered=[2,4,7,8,10,13,14,15,18,19,20,21,63,106,110,111,114,115,118,119,120,121,122,124,127,128,129,131,133,134,135,136,138,139,140,142,143,145] nrAns=108 nrRead=108 nrPart=146 avg=3661
Disconnecting #24 (172.21.239.10)
Disconnecting #89 (172.21.238.240)
Disconnecting #78 (172.21.238.143)
Disconnecting #28 (172.21.238.100)
Disconnecting #23 (172.21.238.227)
Disconnecting #31 (172.21.238.132)
Disconnecting #25 (172.21.238.111)
Disconnecting #22 (172.21.238.226) etc etc…[B]And then follows:[/B]
Disconnecting #126 (172.21.238.174)
Disconnecting #130 (172.21.238.119)
Disconnecting #131 (172.21.238.194)
Bad command 0300
Bad command 0300
Bad command 0300
Bad command 0300
Bad command 0300
Bad command 0300 etc etc…[B]After that I’m getting:[/B]
Dropping client #2 because of timeout
Disconnecting #2 (172.21.238.152)
Dropping client #4 because of timeout
Disconnecting #4 (172.21.238.136)
Dropping client #7 because of timeout
Disconnecting #7 (172.21.238.149)
Dropping client #8 because of timeout
Disconnecting #8 (172.21.238.138)
Dropping client #10 because of timeout
Disconnecting #10 (172.21.238.205)
Dropping client #13 because of timeout
Disconnecting #13 (172.21.238.104) etc etc…[B]Almost at the end it says:[/B]
Dropping client #142 because of timeout
Disconnecting #142 (172.21.238.120)
Dropping client #145 because of timeout
Disconnecting #145 (172.21.238.173)
Transfer complete.^G
Disconnecting #0 (172.21.238.123)
Disconnecting #1 (172.21.238.156)
Disconnecting #3 (172.21.238.151)
Disconnecting #5 (172.21.238.140)
Disconnecting #6 (172.21.238.129)
Disconnecting #9 (172.21.238.130)
Disconnecting #12 (172.21.238.207)
Disconnecting #16 (172.21.238.211)
Disconnecting #27 (172.21.239.8)[B]And then finally:[/B]
Udp-sender 2007-12-28
Using mcast address 236.21.238.31
UDP sender for (stdin) at 172.21.238.31 on eth0
Broadcasting control to 224.0.0.1Any idea’s what can be wrong? It complains about timeout altho the machines all started up within 20 minutes from the first to the last and looking at [URL=‘http://fogproject.org/forum/threads/multicast-timeout.529/’]this[/URL] post I’ve also checked the Config.php --> UPDSENDER_MAXWAIT setting and I see that it’s on 0 (so I guess that means it will wait forever and not timeout anything).
-
What’s your CPU usage look like? With unicasting FOG pushes a compressed image file, which the client uncompresses as it receives it; with multicast FOG uncompresses the image files and then pushes them out. So it could be an issue of failing to uncompress that second partition. Have you been able to unicast this image? Or maybe multicast to a smaller number of hosts?
-
I’m not able to check the cpu usage at this moment, but I did try to multicast the image first to 2 computers and that worked fine.
I rejoyced and tried them all again but the issue came back.I then tried to send it to 25 computers and that worked but it wanted to use 6-7 hours on it and the image would send out around 100-200MB and then freeze for around a minute and then send more… The computer I house fog on is rather old and if it’s a cpu heavy job for the fog server to multicast then your theory might indeed fit what I saw.
For the fun of it I’m going to get some new fresh hardware and try it on that to see what results that yields.
Unicasting the image went fine btw
-
Another thing that slows down multicasts is actually the performance of the hosts themselves. During a mutlicast the FOG server sends out the image in chunks and waits to get confirmation from each host that they’re ready for the next chunk, so if one host has a bad hard drive or any other issue with receiving and writing the chunk to disk it’ll slow down the entire session.