Multicast problem fog 0.32 on centOS 5.6



  • Hello.

    I’m having a problem with multicasting on fog 0.32, CentOS 5.6 (final) and windows 7 machines (Dell Optiplex gx620).

    It’s a rather strange problem where I’m trying to multicast 145 machines and it’s all fine with the first small partition, it multicasts that one just fine, but when it’s about to start the second partition it just sits and wait on the “please wait” screen.

    [B]Now the log in /opt/fog/log/ show’s the following:[/B]
    [SIZE=2]([COLOR=#ff0000]This is the multicast.log.udpcast.28 log and not the one called only “multicast.log”[/COLOR]) [/SIZE]

    Udp-sender 2007-12-28
    Using mcast address 236.21.238.31
    UDP sender for (stdin) at 172.21.238.31 on eth0
    Broadcasting control to 224.0.0.1
    New connection from 172.21.238.123 (#0) 00000009
    New connection from 172.21.238.156 (#1) 00000009
    New connection from 172.21.238.152 (#2) 00000009
    New connection from 172.21.238.151 (#3) 00000009
    New connection from 172.21.238.136 (#4) 00000009 etc etc…

    [B]The it show’s:[/B]

    Starting transfer: 00000009
    bytes= 97 552 re-xmits=0000001 ( 1.4%) slice=0066 73 709 551 615 - 132
    bytes= 193 648 re-xmits=0000001 ( 0.7%) slice=0066 73 709 551 615 - 131
    bytes= 289 744 re-xmits=0000001 ( 0.5%) slice=0066 73 709 551 615 - 131
    bytes= 385 840 re-xmits=0000001 ( 0.3%) slice=0066 73 709 551 615 - 131
    bytes= 481 936 re-xmits=0000001 ( 0.3%) slice=0066 73 709 551 615 - 131
    bytes= 578 032 re-xmits=0000001 ( 0.2%) slice=0066 73 709 551 615 - 131
    bytes= 674 128 re-xmits=0000001 ( 0.2%) slice=0066 73 709 551 615 - 130 etc etc…

    [B]And then the interesting bits happen:[/B]

    bytes= 25 370 800 re-xmits=0000001 ( 0.0%) slice=0066 73 709 551 615 - 130
    bytes= 25 375 168 re-xmits=0000001 ( 0.0%) slice=0066 73 709 551 615 - 133
    bytes= 25 375 612 re-xmits=0000001 ( 0.0%) slice=0066 73 709 551 615 - 132
    Timeout notAnswered=[2,4,7,8,10,13,14,15,18,19,20,21,63,106,110,111,114,115,118,119,120,121,122,124,127,128,129,131,133,134,135,136,138,139,140,142,143,145] nrAns=108 nrRead=108 nrPart=146 avg=3661
    Disconnecting #24 (172.21.239.10)
    Disconnecting #89 (172.21.238.240)
    Disconnecting #78 (172.21.238.143)
    Disconnecting #28 (172.21.238.100)
    Disconnecting #23 (172.21.238.227)
    Disconnecting #31 (172.21.238.132)
    Disconnecting #25 (172.21.238.111)
    Disconnecting #22 (172.21.238.226) etc etc…

    [B]And then follows:[/B]

    Disconnecting #126 (172.21.238.174)
    Disconnecting #130 (172.21.238.119)
    Disconnecting #131 (172.21.238.194)
    Bad command 0300
    Bad command 0300
    Bad command 0300
    Bad command 0300
    Bad command 0300
    Bad command 0300 etc etc…

    [B]After that I’m getting:[/B]

    Dropping client #2 because of timeout
    Disconnecting #2 (172.21.238.152)
    Dropping client #4 because of timeout
    Disconnecting #4 (172.21.238.136)
    Dropping client #7 because of timeout
    Disconnecting #7 (172.21.238.149)
    Dropping client #8 because of timeout
    Disconnecting #8 (172.21.238.138)
    Dropping client #10 because of timeout
    Disconnecting #10 (172.21.238.205)
    Dropping client #13 because of timeout
    Disconnecting #13 (172.21.238.104) etc etc…

    [B]Almost at the end it says:[/B]

    Dropping client #142 because of timeout
    Disconnecting #142 (172.21.238.120)
    Dropping client #145 because of timeout
    Disconnecting #145 (172.21.238.173)
    Transfer complete.^G
    Disconnecting #0 (172.21.238.123)
    Disconnecting #1 (172.21.238.156)
    Disconnecting #3 (172.21.238.151)
    Disconnecting #5 (172.21.238.140)
    Disconnecting #6 (172.21.238.129)
    Disconnecting #9 (172.21.238.130)
    Disconnecting #12 (172.21.238.207)
    Disconnecting #16 (172.21.238.211)
    Disconnecting #27 (172.21.239.8)

    [B]And then finally:[/B]

    Udp-sender 2007-12-28
    Using mcast address 236.21.238.31
    UDP sender for (stdin) at 172.21.238.31 on eth0
    Broadcasting control to 224.0.0.1

    Any idea’s what can be wrong? It complains about timeout altho the machines all started up within 20 minutes from the first to the last and looking at [URL=‘http://fogproject.org/forum/threads/multicast-timeout.529/’]this[/URL] post I’ve also checked the Config.php --> UPDSENDER_MAXWAIT setting and I see that it’s on 0 (so I guess that means it will wait forever and not timeout anything).



  • Another thing that slows down multicasts is actually the performance of the hosts themselves. During a mutlicast the FOG server sends out the image in chunks and waits to get confirmation from each host that they’re ready for the next chunk, so if one host has a bad hard drive or any other issue with receiving and writing the chunk to disk it’ll slow down the entire session.



  • I’m not able to check the cpu usage at this moment, but I did try to multicast the image first to 2 computers and that worked fine.
    I rejoyced and tried them all again but the issue came back.

    I then tried to send it to 25 computers and that worked but it wanted to use 6-7 hours on it and the image would send out around 100-200MB and then freeze for around a minute and then send more… The computer I house fog on is rather old and if it’s a cpu heavy job for the fog server to multicast then your theory might indeed fit what I saw.

    For the fun of it I’m going to get some new fresh hardware and try it on that to see what results that yields.

    Unicasting the image went fine btw ;)



  • What’s your CPU usage look like? With unicasting FOG pushes a compressed image file, which the client uncompresses as it receives it; with multicast FOG uncompresses the image files and then pushes them out. So it could be an issue of failing to uncompress that second partition. Have you been able to unicast this image? Or maybe multicast to a smaller number of hosts?


Log in to reply
 

418
Online

39.3k
Users

11.0k
Topics

104.6k
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.