Slowdown Unicast and Multicast after upgrading FOG Server



  • @Sebastian-Roth
    We did two re-captures now. One with 1.5.3 binaries and one with current binaries from dev-branch 1.5.7.112.

    We have no improvments in deploying with 1.5.7. The speed of the binaries 1.5.7.112 is around 5 GB/min.
    Luckily the binaries 1.5.3 boost up to 13 GB/min. Thats the speed we had before.

    For now we will stick to 1.5.3 binaries running behind FOG 1.5.7.112.
    If there are improvments please let us know so we can test them.



  • @mp12 Just curious if you ever figured out the source of your slow down? I too experienced the same major slowdown after upgrading to a newer dev branch to fix some issues we had (I am now on 1.5.7.102). I went from roughly 13GB per minute to 2GB or slower per minute. Very frustrating. I tried capturing the image a couple different times with different compression and a bunch of things. Tried on multiple Dell Optiplex models: 9010, 9020, 7050 and all of them display the same slowness. Got this issue on two separate servers (we have 2 campuses at our College, so two different servers). Almost feels like the different kernels did this. We were on “5.1.16 mac nvmefix” but then upgraded to the “4.19.101” which came with the 1.5.7.102 install.

    I am interested in any fix for this as my desktop support team is very frustrated at the moment. I am happy to test out any theories to help this along. Would hate for others to run into this as well.


  • Moderator

    @Sebastian-Roth The 11 retransmits are something but really not much. If you look at the testing I did on a Dell 790 https://forums.fogproject.org/topic/10459/can-you-make-fog-imaging-go-fast/4 For a single iperf there was no retrans but as soon as I added a second iperf running at the same time the retrans shot up in the hundreds and the throughput dropped off accordingly.


  • Developer

    @mp12 said in Slowdown Unicast and Multicast after upgrading FOG Server:

    Now we are capturing a new image with the 1.5.3 binaries. The above error may have something to do with a mismatching partclone version.

    Yeah, definitely. Need to re-capture the image.

    Thanks for the information. Good to know you have a fleet of exact same machines and I am sure we will figure out what is causing the slow speed and fix it.



  • @Sebastian-Roth @Quazz @george1421
    We only have computers with the following specs:

    • Dell 9010 (BIOS A30)
    • i7-3770
    • 16GB RAM
    • Samsung SSD 860 EVO 500GB

    Made ten iperf3 tests in a row and got an average of 9.7 retries. Is that really so bad?

    We also checked the binaries from 1.5.3 to 1.5.6. with the old image and recieved the following error:
    read image_hdr device_size error

    Now we are capturing a new image with the 1.5.3 binaries. The above error may have something to do with a mismatching partclone version. Hopefully the deploy speed will be at 12GB/min. If so we will try to capture an image with the current binaries.


  • Developer

    @george1421 @mp12 Using init and kernel binaries from the same archive as you will run into kernel panics quite easily if you do otherwise.

    @mp12 I am wondering if you see the same slowness on many different types of hardware or if it’s all machines with the Samsung SSD 860 EVO 500GB??

    EDIT: Reading through the whole topic again I stumbled upon this:

    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec   11             sender
    [  5]   0.00-10.00  sec  1.10 GBytes   942 Mbits/sec                  receiver
    

    Eleven retries for a file transfer over a period of 10 seconds seems a lot to me. So we might look at a combination of issues here.


  • Moderator

    @Sebastian-Roth Yes going back to 1.5.3 version of FOS (at least for the inits) would be a good test of before and after upgrades causing this slowness. All the OP needs to do is download and extract the init.xz file from the zip file and move it to the FOG server to test.


  • Developer

    @george1421 Good point on trying the older binaries. Though I’d expect that you get issues going back to very early binary versions like 1.5.3… They are all available on the fogproject.org website:
    https://fogproject.org/binaries1.5.6.zip
    https://fogproject.org/binaries1.5.5.zip
    and so on all the way to 1.3.0…


  • Moderator

    @Quazz would the complete fog inventory for this system give us enough data or do we need to dig deeper? If it doesn’t give us all of the data at least it would be a start. What would be grand is of the OP had 2 systems on his campus where one worked correctly and the other is slow. Then we could contrast and compare these two systems.

    @Sebastian-Roth Is there a place where we can still download the 1.5.5 or 1.5.6 binaries zip file? I’m wondering if we replace the current bzImage and init.xz with the ones from 1.5.5 or 1.5.6 does that change the performance of these systems. While I highly doubt its the FOG server, this would at least isolate the issue to the FOS Linux install (unless that is our conclusion already).


  • Moderator

    @Sebastian-Roth I mean, the strange part is that the older versions gave him proper speed too. And it’s not like this is a universal problem since others have not experienced such a large performance difference between FOG versions.

    So I guess it’s time for more information to try and pin down the source of it all.

    @mp12 Can you list a full spec list of a troubled machine? (or perhaps even two or three different ones if you have that)

    It almost assuredly has to be some kind of interaction between certain kind of hardware and the Linux kernel (and its config), so we have to try and narrow it down or at least get a clearer picture of what we’re dealing with.



  • @Sebastian-Roth

    Linux debian 5.3.0-1-amd64 #1 SMP Debian 5.3.7-1 (2019-10-19) x86_64 GNU/Linux


  • Developer

    @mp12 said in Slowdown Unicast and Multicast after upgrading FOG Server:

    Therefor I booted a Clonezilla (2.6.4-10-amd64) Flashdrive

    From what you wrote so far I would expect the kernel to make the difference. What kernel is in the CloneZilla you used? Boot to a command shell and run uname -a.



  • @Quazz said in Slowdown Unicast and Multicast after upgrading FOG Server:

    I vaguely recalled someone having this problem before with similar outcomes.
    Linking here for further reference: https://forums.fogproject.org/topic/13733/hp-elitebook-830-gen-6-issues-capturing-images-and-deploying-images/10

    Tried the bzImage529 and checked if RAMDISK size is correct.

    Also tried the following Kernels which all end up with a KERNEL PANIC:
    4.13.4, 4.11.1, 4.10.1, 4.9.11 and 4.8.11

    Other kernels starting with 4.15.2 and above seem to work but not with the sufficient speed.



  • @Quazz said in Slowdown Unicast and Multicast after upgrading FOG Server:

    Are you on the dev-branch

    Yes I am on the dev-branch 1.5.7.109 (bzImage Version: 4.19.101)


  • Moderator

    @mp12 I vaguely recalled someone having this problem before with similar outcomes.

    Linking here for further reference: https://forums.fogproject.org/topic/13733/hp-elitebook-830-gen-6-issues-capturing-images-and-deploying-images/10

    Our kernels and inits have since received a few upgrades, however.

    Are you on the dev-branch, by the way? I don’t believe 1.5.7 was launched with partclone 0.3.12 (that’s for the upcoming release).

    If not, then try the init and kernel files from https://dev.fogproject.org/blue/organizations/jenkins/fos/detail/master/122/artifacts



  • @Sebastian-Roth @Quazz @george1421

    I have some good and bad news.

    First the good ones:

    I created a deploy using the FOG.Wiki https://wiki.fogproject.org/wiki/index.php/Debug_Mode#Win_7
    Therefor I booted a Clonezilla (2.6.4-10-amd64) Flashdrive and mounted the NFS-Share from FOG Server.

    I started the deploy with the following command:

    cat /images/IMAGEPATH/d1p2.img | zstd -d | sudo partclone.restore -O /dev/sda2 -N -f -i
    

    DSC_0585.JPG

    Tried a deploy with the fog client and still one-third of the expected speed.
    I think there is something wrong in the deploy process.
    Main difference I can see is that the FOS uses Partclone 0.3.12 and Clonezilla 0.3.13.



  • @Quazz

    Did my best. After trying I noticed that the “frozen state” was active. Removing the SSD power cable while PC was powered on removed the frozen state. Then I was able to do a secure erase. A dd from ramdisk to /dev/sda was at the same speed as before. So no luck at all.


  • Moderator

    @mp12 I found somewhat similiar situations on google where the conclusion was to use Secure Erase on the drive first. Worth a try just to see if it helps?



  • @george1421

    Server

    -----------------------------------------------------------
    Server listening on 5201
    -----------------------------------------------------------
    Accepted connection from x.x.x.x, port 50672
    [  5] local x.x.x.x port 5201 connected to x.x.x.x port 50674
    [ ID] Interval           Transfer     Bandwidth
    [  5]   0.00-1.00   sec   108 MBytes   903 Mbits/sec
    [  5]   1.00-2.00   sec   112 MBytes   942 Mbits/sec
    [  5]   2.00-3.00   sec   112 MBytes   942 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   942 Mbits/sec
    [  5]   4.00-5.00   sec   112 MBytes   942 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec
    [  5]   6.00-7.00   sec   112 MBytes   942 Mbits/sec
    [  5]   7.00-8.00   sec   112 MBytes   942 Mbits/sec
    [  5]   8.00-9.00   sec   112 MBytes   942 Mbits/sec
    [  5]   9.00-10.00  sec   112 MBytes   942 Mbits/sec
    [  5]  10.00-10.04  sec  4.35 MBytes   937 Mbits/sec
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bandwidth       Retr
    [  5]   0.00-10.04  sec  1.10 GBytes   939 Mbits/sec   11             sender
    [  5]   0.00-10.04  sec  1.10 GBytes   938 Mbits/sec                  receiver
    -----------------------------------------------------------
    Server listening on 5201
    -----------------------------------------------------------
    

    Client

    Connecting to host x.x.x.x, port 5201
    [  5] local x.x.x.x port 50674 connected to x.x.x.x port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec   113 MBytes   947 Mbits/sec    3    258 KBytes       
    [  5]   1.00-2.00   sec   112 MBytes   943 Mbits/sec    0    364 KBytes       
    [  5]   2.00-3.00   sec   112 MBytes   939 Mbits/sec    2    232 KBytes       
    [  5]   3.00-4.00   sec   112 MBytes   943 Mbits/sec    1    318 KBytes       
    [  5]   4.00-5.00   sec   112 MBytes   943 Mbits/sec    2    211 KBytes       
    [  5]   5.00-6.00   sec   112 MBytes   943 Mbits/sec    0    364 KBytes       
    [  5]   6.00-7.00   sec   112 MBytes   943 Mbits/sec    1    267 KBytes       
    [  5]   7.00-8.00   sec   112 MBytes   943 Mbits/sec    1    364 KBytes       
    [  5]   8.00-9.00   sec   112 MBytes   943 Mbits/sec    0    366 KBytes       
    [  5]   9.00-10.00  sec   112 MBytes   943 Mbits/sec    1    282 KBytes       
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec   11             sender
    [  5]   0.00-10.00  sec  1.10 GBytes   942 Mbits/sec                  receiver
    
    iperf Done.
    

  • Moderator

    @Quazz Interesting, on the previous dd test. I would like to see this dd test and then the next step is an iperf test. That will test local disk and then network without involving the nfs stack or partclone. At least in my mind is how I would break it down. Something had to have changed besides fog.


Log in to reply
 

440
Online

6.6k
Users

14.0k
Topics

132.2k
Posts