SOLVED Performance decrease using Hyper-V Win10 clients

  • Senior Developer

    @jkozee said:

    Something has changed in the kernel build in regards to a VM running on Hyper-V with SSD backed storage

    From my point of view the kernel in the VM has absolutely no knowledge of the underlaying filesystem/disk outside the VM. I thought it could be a fragmentation problem on the backend storage device but then you wouldn’t see a difference in speed just by booting different kernel versions. From what you said I think your test setup is pretty good (just changing the kernel parameter in the host setting and leaving everything else untouched).

    There is a great way to pin this kind of issue down to exactly one version/commit. It’s called git bisect. Please read through this article and see if you want to dive into this. I am more than happy to help you along the way! Have you ever compiled a (FOG) kernel? It’s actually not to complicated. Just give it a try following this article: https://wiki.fogproject.org/wiki/index.php/Build_TomElliott_Kernel (the second half is talking about current FOG version). Instead of make menuconfig (after downloading Tom’s kernel config) you can just run make oldconfig instead where you don’t need to bother about the menu stuff.

    You can build the kernels on your FOG server if you like. Just needs some disk space for the kernel git repo and some tools. As I don’t know which linux you are on I will leave this open. Ask google which packages you need to install on CentOS/Debian/… to compile the kernel. There are lots of tutorials out there.


  • I pulled version 4.1.4 of bzImage and bzImage32 from another server.

    Without replacing the init’s, “Resizing Filesystem” now completes in about a minute.


  • I’ve noticed this problem with Hyper-V VMs on physical discs as well. Though much, much worse at 45 -60 minutes stuck on Resizing Filesystem. These are brand new VMs built from scratch. The last command I use next to shutdown is:

    defrag c: /x /h /u /v
    

    I have an added problem in that I can’t download other kernels (in another thread) so I have been unable to test with older kernels other than 4.4.1 .

  • Testers

    @Sebastian-Roth I tested the different kernels by downloading them to separate files using the web interface. The only difference between the last two setups I compared is the kernel parameter in the host setting using the web interface. I only see this issue on my VM’s, my physical units behave normally with either kernel.

    I did some additional tests this morning and here’s what I found.

    Both 4.4.0 and 4.4.1 take around 3.5 minutes to complete “Formatting initialized partition” during deploy, while 4.3.0, 4.3.2, and 4.3.0CDCETHER take less than 1 second.

    The VM I’ve been using has a VHDX file that lives on an SSD, and is the only thing on it. I tested using on VHDX that is on a spinning disk (desktop drive 5600 rpm) amd 4.3.0 still takes <1 sec to complete, but 4.4.1 can complete the step in about 15.5 seconds.

    Something has changed in the kernel build in regards to a VM running on Hyper-V with SSD backed storage.

  • Senior Developer

    @jkozee Do you use the web interface to up-/downgrade kernels? I looked through the official kernel change logs but couldn’t find anything related to NTFS at all. As well I checked the buildroot (this is what is used in the inits doing all the work you see when capturing/deploying a client) change logs and couldn’t find an obvious hint on issues with the ntfs-3g progs. Hmmmm, still wondering if it is the kernel or the init??

    Does anyone else see capture/deploy taking literally minutes to resize/format NTFS when image type is set to resizable? Or is this an issue only happening within Hyper-V?


  • @jkozee Well, I for one really appreciate your efforts with testing performances of various revisions and kernels! Perhaps you can just turn one of the VMs off and leave it alone, and wait until FOG Trunk enters into RC (release candidate) so that you can test speeds then and compare to your findings here? It’d be very appreciated.

  • Testers

    Here are the metrics comparing 6307 using kernel 4.3.0 and 4.4.1. Looking at the numbers, it’s safe to say that 6307/4.3.0 is faster than 5315/4.3.0 and far faster than 6307/4.4.1 when using a VM client on Hyper-V.

    Let me know if there are any more measurements required. I’ll keep the VMs around for a day or two.

    6307-4.3.0-Capture
    #0:00
    #0:18

    • Resizing filesysten…Done
      #0:18
      #0:19
      <<PARTCLONE>>
      #14:19
      #14:20
    • Resizing ntfs volune (/dev/sda1)…Done
      #14:20
      #14:25

    6307-4.4.1-Capture
    #0:00
    #0:17

    • Resizing filesysten…Done
      #4:53
      #4:54
      <<PARTCLONE>>
      #22:42
      #22:43
    • Resizing ntfs volune (/dev/sda1)…Done
      #25:49
      #25:53

    6307-4.3.0-Deploy
    #0:00
    #0:20

    • Formatting initialized partition…Done
      #0:20
      #0:24
      <<PARTCLONE>>
      #4:11
      #4:11
    • Resizing ntfs volume (/dev/sda1)…Done
      #4:12
      #4:14

    6307-4.4.1-Deploy
    #0:00
    #0:20

    • Formatting initialized partition…Done
      #3:53
      #3:57
      <<PARTCLONE>>
      #5:42
      #5:42
    • Resizing ntfs volume (/dev/sda1)…Done
      #8:49
      #8:52
  • Testers

    @Sebastian-Roth said:

    Do you mean 6303 with kernel 4.3.0 is as fast as 5315?? Can you please verify if you see such drastic differences (where exactly? still resize ntfs…?) just by using older/newer kernel!

    Actually it may be faster. I ran a deploy test using the same VM’s with 6307 using kernel 4.3.0, and it completed in 4:18. To be accurate, I would need to repeat all of the tests to compare 5315/4.3.0 and 6307/4.3.0 under the same server load. But it’s probably safe to say it’s as fast using the older kernel.

  • Senior Developer

    @jkozee Thanks a lot for the accurate timing! Good to know where exactly time is passing by. I thing @Tom-Elliott is the only one who can shed a light on what changed in “Resizing filesysten”, “Resizing ntfs volune” and “Formatting initialized partition”. Between 5315 and 6303 there were heaps of changes in the whole process.

    New tests indicate the slowdown exists in kernel 4.4.0 (x86_64) and 4.4.1 (x86_64), but 4.3.0 (x86_64) appears to be fine.

    Do you mean 6303 with kernel 4.3.0 is as fast as 5315?? Can you please verify if you see such drastic differences (where exactly? still resize ntfs…?) just by using older/newer kernel!

  • Testers

    New tests indicate the slowdown exists in kernel 4.4.0 (x86_64) and 4.4.1 (x86_64), but 4.3.0 (x86_64) appears to be fine.

  • Testers

    I should also note that 5315 == kernel 4.3.0 and 6303 == kernel 4.4.1, as that’s probably relevant. I have tested tried an older kernel on 6303, but can if needed.

  • Testers

    Sorry, long post as I have a limit on how often I can post

    5315-Capture
    #0:00

    • Verifying network interface configuration…Done

    • Checking Operating System…Windows 10

    • Checking CPU Cores…1

    • Send method…NFS

    • Checking In…Done

    • Mounting File System…Done

    • Preparing to send image file to server…Done

    • Checking Mounted File System…Done

    • Using Image: delme

    • Preparing backup location…Done

    • Looking for Hard Disks…Done

    • Re-reading Partition Tables…Done

    • Using Hard Disk: /dev/sda

    • Clearing part (/dev/sdal)…Done

    • Mounting partition (/dev/sdal)…Done

    • Removing page file…Done

    • Removing hibernate file…No hibernate found

    • Clearing ntfs flag…Done

    • Saving original partition table…Done

    • Saving Partition Tables (MBR)…Done

    • Possible resize partition size: 11263111 k

    • Running resize test /dev/sdal…Done

    • Resize test was successful

    • Resizing filesystem…Done

    • Clearing ntfs flag…Done

    • Resizing partition dev/sda1…Done

    • Checking Hard Disks…Done

    • Clearing ntfs flag…Done

    • Now FOG will attempt to upload the image using Partclone.

    • Processing Partition: /dev/sdal (1)

    • Using partclone.ntfs
      #0:23
      <<PARTCLONE>>
      #18:00

    • Image uploaded

    • Restoring MBR…Done

    • Resizing ntfs volume (/dev/sdal)…Done

    • Clearing ntfs flag…Done

    • Stopping FOG Status Reporter…Done
      #18:03

    6303-Capture
    #0:00

    • Verifying network interface configuration…Done
    • Checking Operating System…Windows 10
    • Checking CPU Cores…1
    • Send method…NFS
    • Attempting to check in…Done
    • Mounting File System…Done
    • Checking Mounted File System…Done
    • Checking img variable is set…Done
    • Preparing to send image file to server
    • Preparing backup location…Done
    • Setting permission on /images/00155d016673…Done
    • Removing any pre-existing files…Done
    • Using Image: delme
    • Looking for Hard Disk…Done
    • Reading Partition Tables…Done
    • Using Hard Disk: /dev/sda
    • Now FOG will attempt to upload the image using Partclone
    • Checking for fixed partitions…Done
    • Getting Windows/Linux Partition Count…Done
    • NTFS Partition count of: 1
    • EXTFS Partition count of: 0
    • Setting up any additional fixed parts
    • Saving original partition table…Done
    • Saving original disk/parts UUIDs…Done
    • Shrinking Partitions on disk
    • Clearing part (/dev/sda1)…Done
    • Mounting partition (/dev/sdal)…Done
    • Removing page file…Done
    • Possible resize partition size: 11263111 k
    • Running resize test /dev/sdal…Done
    • Resize test was successful
      #0:18
    • Resizing filesysten…Done
      #4:53
    • Resizing partition /dev/sdal…Done
    • Clearing ntfs flag…Done
    • Saving shrunken partition table
    • Saving Partition Tables (MBR)…Done
      #4:53
      <<PARTCLONE>>
      #22:44
    • Image Uploaded
    • Restoring Original Partition Layout…Done
      #22:44
    • Resizing ntfs volune (/dev/sda1)…Done
      #25:49
    • Clearing ntfs flag…Done
    • Stopping FOG Status Reporter…Done
    • Task Complete
    • Updating Database…Done
    • Rebooting system as task is conplete
      reboot: Restarting system
      #25:52

    5315-Deploy
    #0:00

    • Verifying network interface configuration…Done

    • Checking Operating System…Windows 10

    • Checking CPU Cores…1

    • Send method…NFS

    • Attempting to send inventory…Done

    • Checking In…Done

    • Mounting File System…Done

    • Checking Mounted File System…Done

    • Starting Image Push

    • Using Image: delme

    • Looking for Hard Disks…Done

    • Checking write caching status on HDD…Enabled

    • Erasing current MBR/GPT Tables…Done

    • Restoring Partition Tables (MBR)…Done

    • Extended partitions…Done

    • Expanding partition table to fill disk…Done

    • Processing Partition: /dev/sdal (1)
      #0:28
      <<PARTCLONE>>
      #6:00

    • Clearing ntfs flag…Done

    • Stopping FOG Status Reporter…Done

    • Resizing ntfs uolume (/dev/sda1)…Done

    • Clearing ntfs flag…Done

    • Backing up and replacing BCD…Done

    • Changing hostname…Done

    • Updating Computer Database Status

    • Database Updated!

    • Task is completed, computer will now restart.

    reboot: Restarting system
    #6:03

    6303-Deploy
    #0:00

    • Verifying network interface configuration…Done
    • Checking Operating System…Windows 10
    • Checking CPU Cores…1
    • Send method…NFS
    • Attempting to check in…Done
    • Mounting File System…Done
    • Checking Mounted File System…Done
    • Checking img variable is set…Done
    • Attenpting to send inventory…Done
    • Using Image: delme
    • Looking for Hard Disk…Done
    • Using Disk: /dev/sda
    • Write caching not supported
    • Preparing Partition layout
    • Wiping /dev/sda partition information
    • Erasing current MBBA3PT Tables…Done
    • Creating disk with new label…Done
    • Initializing /dev/sda with NTFS partition…Done
      #0:20
    • Formatting initialized partition…Done
      #3:53
    • Erasing current MBR/GPT Tables…Done
    • Restoring Partition Tables (MBR)…Done
    • Inserting Extended partitions…Done
    • Attempting to expand/fill partitions…Done
      #3:57
      <<PARTCLONE>>
      #5:53
    • Clearing ntfs flag…Done
      #5:53
    • Resizing ntfs volume (/dev/sda1)…Done
      #8:51
    • Clearing ntfs flag…Done
    • Resetting UUIDs for /dev/sda
    • Resettings swap systems
    • Stopping FOG Status Reporter…Done
    • Mounting directory…Done
    • Changing hostname…Done
    • Task Complete
    • Updating Database…Done
    • Rebooting system as task is complete
      reboot: Restarting system
      #8:54
  • Testers

    I ran some tests that will hopefully prove useful.

    Both the client and server are VM’s on the same server. I used a single checkpoint on the client to run all of the tests. The server was tested from a checkpoint running 5315 and then upgraded to 6303 with a new checkpoint created, so that I can easily do additional tests if needed. The upgraded VM gives similar results as a new install on a VM that I originally observed the slow behavior. So, there should be no appreciable differences between the test scenarios, except for the updated FOG version.

    Deployment went from 6:03 to 8:54, with the most time increase seen during “Formatting initialized partition” before Partclone and “Resizing ntfs volume” after.

    Capture went from 18:03 to 25:52, with the most time increase seen during “Resizing filesysten” before Partclone and “Resizing ntfs volume” after.

    I will include additional data and times in separate posts for each test for closer inspection.

    Please let me know if you have any ideas, or anything else you would like to see tested.

    Thanks!


  • One time I accidentally only had one core assigned to a VM in Hyper-V. I went through all the motions of installing FOG Trunk… performance sucked. I deleted the VM and started over, this time with 4 cores! ☝

  • Testers

    I don’t think anything was changed between versions, but I will verify. I will perform a detailed analysis and report my findings.

  • Senior Developer

    @jkozee said:

    Is this expected/explainable behavior?

    In general I’d say no! Did you change compression ratio? We usually don’t have this kind of setup around to test new versions with. So it would be great if you could find some more specific details on this? Where exactly does it take longer?

394
Online

8.8k
Users

15.5k
Topics

144.5k
Posts