Another slow Deployment qustion



  • Hey all,

    We have been noticing some random problems with our fog server lately. It seems at random (more often then not) the fog server will negotiate speeds of like 60MB per-minute while other computers (of the exact same hardware) will get speeds of 6GB per-minute. And to add to the equation, if we reboot the computer half way through it will sometimes catch a fast speed and image at 6Gig per-minute other times it will still download at 60MB per-minute.

    Things I have tried/checked

    Check all switches from fog to computer, From 1gig card in fog to 10gb copper port over 40gb trunk to 10gb fiber to 10gb switch to 1gb nic card in end computer. everything that I can check is running 1 or 10gb full duplex.
    Tried swapping kernel’s on the PXE menu
    checked all switchports to end PC for errors/Dropped Packets/etc…
    Checked IO stat on fog server to see if there were any HDD Errors…

    I am at a loss, I don’t understand why these would be going so slow. We have never in all years previous had fog run this slow.

    I am on Fog 1.2.0 SVN 4301 (master) + FOG 0.32 (storage)
    Image cap is 10 per server. Nothing crazy. nothing else fancy.


  • Moderator

    While I still stand by what I’ve already said - just try multicast and see if it works - for same images at once. Stick computers into a group, and initiate multicast from the group.


  • Developer

    @Quazz said:

    Only downside is reduced flexibility, that is to say if you have 20 computers and they need 7 different images, multicast won’t really be that useful.

    While you are right about the impossibility to deploy different images via multicast I don’t see this as a fair argument. A network/switch can handle unicast and multicast at the same time so it’s not an either or thing. Use whichever is appropriate for the task you want to run. Multicasting when you actually want different images is stupid and unicasting when you want the same image is not very wise either.

    Other than that, multicast is pretty great. Is there any word on the state of the torrent mechanic or is that on hold/abandoned? I personally think multicast would be better than the torrenting from a network saturation/resources perspective, but maybe I’m wrong?

    There has been a discussion on and off on the forum. Just search for ‘torrent’ and I am sure you’ll find it.


  • Moderator

    @Sebastian-Roth Only downside is reduced flexibility, that is to say if you have 20 computers and they need 7 different images, multicast won’t really be that useful.

    Other than that, multicast is pretty great. Is there any word on the state of the torrent mechanic or is that on hold/abandoned? I personally think multicast would be better than the torrenting from a network saturation/resources perspective, but maybe I’m wrong?


  • Developer

    @Arsenal101 said:

    Maybe at some point I could convince my boss to throw a 10gb NIC and some decent hardware for our fog server in the budget.

    As George already said this is not very wise. I might add that multicast would solve all that I am pretty sure. I don’t understand why everyone is so afraid to get multicast running?? What’s that drawback that I don’t seem to see? Please tell me…


  • Senior Developer

    @Wayne-Workman Throughput is already displayed on the dashboard (granted not per host), but if the network is NOT saturated as this is trying to lead to, you should see “plenty” of available bandwidth of the network there.

    The part that bothers me is this still feels more like a networking issue than a seek/io issue. While I do totally understand IO as being a part of this, your network is most likely the first culprit. Primarily considering the mount point is used across the network to begin with.


  • Moderator

    @george1421 We can get a exact metric if we turn on FTP_Image_Size on the server, it’ll display the image size. Then we can use total time elapsed during imaging to calculate throughput.


  • Moderator

    @Wayne-Workman said in Another slow Deployment qustion:

    @george1421 Take this video for example, look at the video @ 5:34

    We see the space used on the image is 31.7, write speed is 2.29GB/min.

    We see elapsed time is 13 minutes and 31 seconds.

    31.7 divided by 2.29 = 13.84, or 13 minutes and 50 seconds.

    Therefore the rate that Partclone displays is write speed (or read speed), not network transfer rate.

    Wayne I fully agree with you and understand that the part clone display is not actual network usage, but its the best metric we have without getting into to much tech. So the point it it’s not accurate but its the best we have (like the Windows Performance Index, at least its some metric that we can use as a baseline).

    My testing with the intel nuc as the FOG server and deploying to a e6400 with a HHD and SSD. You can see the speed difference in just switching the target from a HDD to SSD with everything else being the same.

    https://forums.fogproject.org/topic/6373/fog-mobile-deployment-server-for-200-usd-finished/3

    “I replace the seagate rotating hard drive in the e6400 with a Crucial MX100 256GB SSD I had laying around. I again redeployed the same image as in the previous tests, this time the transfer rate was 7.8GB/m (130MB/s {!!faster than wire speed!!} ) according to partclone. As compared to 5.1GB/m with a rotating disk in the target. I booted the e6400 back into debug mode and ran hdparm -Tt /dev/sda hdparm reported 242MB/s for buffered disk reads as compared to 80MB/s with rotating media.”



  • @george1421 Pretty much, It would just be cool to say we have 10gb to the desktop… A SSD drive would be sweet though!


  • Moderator

    @Arsenal101 said in Another slow Deployment qustion:

    If we could ever get it 10gb to the end PC the only thing slowing the process down would be the human factor!

    While 10G to the desktop would be really nice, its not necessary and a bit of a waste because I would suspect on the target end, the disk or CPU is your limiting factor and not the network. For the server managing multiple data streams I can see the network and then the disk subsystem being the bottleneck.



  • @Wayne-Workman Sorry I concatenated @Sebastian-Roth’s and your reply in my head.


  • Moderator

    @Arsenal101 I was not suggesting multicast? But ok.



  • Great replies guy! I thought that the rate was always the network transfer speed and not the write speed…

    @Wayne-Workman I am in now way discrediting your method. I just don’t think its the best set up for the needs that we have right now. I want to stay away from multicast for now as we dont have out layer 3 switch/router configured to pass broadcast packets as we have had several issues with staff Looping the network and bringing down everything instead of just one building (this was also before we turned spanning tree on), and I know my boss wont be a fan of increasing the broadcast domain even if was just for the summer.

    on top of that we don’t have the time to build more storage nodes for fog so it’s kind of a trade off yes they image slower but at the same time the you’re doing 10 at a time per storage node (20 at a time in my case 1 master + 1 storage) and 800MB-PM is plenty acceptable for what we need it for.

    @george1421 Maybe at some point I could convince my boss to throw a 10gb NIC and some decent hardware for our fog server in the budget. I could try some of the stuff you are suggesting. and that would really boost the speed right up.

    If we could ever get it 10gb to the end PC the only thing slowing the process down would be the human factor!


  • Moderator

    @george1421 Take this video for example, look at the video @ 5:34

    We see the space used on the image is 31.7, write speed is 2.29GB/min.

    We see elapsed time is 13 minutes and 31 seconds.

    31.7 divided by 2.29 = 13.84, or 13 minutes and 50 seconds.

    Therefore the rate that Partclone displays is write speed (or read speed), not network transfer rate. The image is sent across the network in compressed form, so there is much more room available on the network. Once network saturation is reached, it’s reached. However, if you continue to pile on imaging tasks, eventually the HDD seek time is maxed, and you actually start loosing network saturation. This is my reasoning for setting Max Clients to 2 or 3, not 10.

    The idea is to saturate the available bandwidth for imaging tasks, but not exceed the HDD’s seek ability to keep up. If you exceed the HDD’s ability to keep up due to seek times, then you’re going backwards.

    The optimal setting for max clients should saturate the network fully, but not exceed the HDD’s ability to seek and keep up with the full network speeds.


  • Moderator

    @george1421 I didn’t suggest one at a time, I suggested two at a time.

    I can test this in the lab and give hard numbers, I know this because I’ve had max clients set to 10 before and it was terrible, and changed it to 2 and now it’s performing great.

    Also, pretty sure the figures Partclone displays is read/write speed, not network transfer speed.


  • Moderator

    Just thinking about the numbers, here.

    Let say if you have a single unicast image sent. And that transfer goes at 6GB/m, that translates to about 100MB/s (near the theoretical limit of a GbE network). So we know that 6GB is near the fastest we can go for a GbE network. ( I know there are other factors here like compression ratio, target system performance, and so on. I’m just talking in general terms ).

    So for a standard 25GB fat client at 6GB/m it should take just a tad over 4 minutes to image that system. (25/6 = 4.1m)

    Now the OP can image 10 machines at 800MB/m or 13MB/s. To deploy a 25GB image it should take about 31 minutes to net 10 systems.

    If we serialized the deployment and only deployed 1 system at a time, with a 4 minute deployment we should be able to deploy 6 systems per 30 minutes, which does not beat the 10 machines at 800MB/m. If we allowed dual unticast deployments per imaging cycle we still should be able to achieve 10 systems per 30 minutes.

    So how could we go faster?

    I might start with creating a bonding network connection with maybe 3 or 4 links. Adding network bonding to the equation add some processing overhead, to off set this just add more links to the LAG (more than just one additional). This will spread the network load over multiple links (actually since 10G is available, I would just jump to a 10G adapter then LAG is not needed). Once the network bottleneck was eliminated, I would then probably install a SSD drive in the server to host the images on. If you think about it having 10 systems all at different parts of the download, those drive heads are bouncing all around the platter to service the data request. Moving the images to SSD on the FOG server will eliminate the drive thrashing. The FOG server CPU really doesn’t come into play here, the only thing the FOG server is doing is moving data from the hard drive to the network adapter. There is not a lot of computational power required here. Its all network and disk subsystems that are under load.


  • Moderator

    @Arsenal101 800MB/min is slow, even. If you limit each storage node to 2, you’ll get 7GB/min on two at a time, and get more done in a day.



  • I think I may have figured it out. After I was able to upgrade the master server to the latest trunk we were still having troubles.One of my coworkers noticed that the ones that were going slow were loading and connection to the Storage node (version .32) So I unplugged and and removed it from the master server, that seemed to speed everything up. I can now image 10 machines at 800MB-PM which for me is more than acceptable.

    My best guess is thatit was a combination of the master server was offloading the work to the storage server which is not very good hardware, on top of the fact that it was deploying a windows 10 image with fog version .32 which has no idea windows 10 existed… So I am planning on building a few more storage nodes since they are rather simple to build and we should be able to image 20 machines at 800MB-PM (in theory).

    Thanks for all the help and suggestions guy!


  • Moderator

    @Arsenal101 You don’t need to setup a storage server for each subnet, but your network has to be setup to allow multicasting.

    I would go with my suggestion (of course) with 2 borrowed systems for fog. Get that working and then test mulicast deployment to a remote subnet if that works then you are golden, if not you still have the two newer fog servers running the latest trunk build.

    For multicasting your router needs to allow directed broadcasts between the subnets and you should have igmp snooping on for all vlans where you would have multicast clients. This is typically set on each switch that would be part of the multicast conversation.

    From an analytical side, you’ve tested what I would have tested to identify the performance issue. Unless the performance issue is area specific I would focus on the areas in command like the datacenter network and the fog servers. That is is the only thing in common at this point (in my mind).


  • Moderator

    @Arsenal101 You’ve still not tried my suggestion?


Log in to reply
 

385
Online

38733
Users

10558
Topics

99988
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.