Upgrade from 1.5.7 to 1.5.8 issues

Chris Whiteley

I am having significant speed differences during imaging from 1.5.7 to 1.5.8. I noticed this when I tried some of the 1.5.7.X revisions and so I went back to 1.5.7. I tried the update and here was the speed difference:

1.5.7 - 10.54 GB/min
1.5.8 - 7.23 GB/min

Is there a reason why there is such a drastic change in speed?

Just curious.

Thanks,

Truth table:

bzImage version	init version	partclone	buildroot	speed	tested by SR
4.19.101	1.5.8	0.3.13	2019.02.9	1.4.4	slow
4.19.65	1.5.7	0.2.89	2019.02.1	1.3.5	fast
4.19.101	1.5.8+pc0.2.89	0.2.89	2019.02.9	1.4.4	slow
4.19.101	1.5.7	0.2.89	2019.02.1	1.3.5	fast
4.19.101	1.5.7+pc0.3.13	0.3.13	2019.02.1	1.3.5	fast
4.19.101	1.5.7+zstd1.4.4	0.2.89	2019.02.1	1.4.4	fast
4.19.101	1.5.9	0.3.13	2020.02.6	1.4.5	slow
…	…	…	…	…	…

Full test logs…

george1421

@Chris-Whiteley There was another thread about the same issue. Let me find it. I think there was a hot fix to partclone that resolved it. Let me find it.

george1421

https://forums.fogproject.org/topic/14198/slowdown-unicast-and-multicast-after-upgrading-fog-server/7

https://forums.fogproject.org/topic/14223/slowness-after-upgrade-to-1-5-7-102-dev-branch/5

In the first post Sebastian posts a link to a new/updated init as well as the latest linux kernel. At this point I don’t know if these inits have been integrated into the dev branch that is at 1.5.8.1 the last I knew. But if you upgrade to 1.5.8 again then install the patched inits and updated kernel then you should see your speed return.

Sebastian Roth

@Chris-Whiteley The speed change can be caused for different reasons and while we have had reports about this I am not sure I see a connection here with what you see.

The slowness described in the topics mentioned by @george1421 where due to partclone 0.3.12 we had added in dev-branch but I have updated it to partclone 0.3.13 just before the release of FOG 1.5.8 (as mentioned in the topics) as people have reported speeds were back to normal with those. Please pay attention to the version number you should see in the blue partclone screens and let us know what you see.

The other thing that comes to my mind is that we have heard of certain NVMe drives being very slow with newer Linux kernels. Though on the other hand this should have been the case in 1.5.7 already and I don’t see why it would come up with an update to 1.5.8.

So if you see partclone 0.3.13 and don’t have NVMe drives then we might look at an even different issue here and will need to start debugging it with your help. Start off by deploying to different hardware (if you have) and let us know if it’s consistently slower on all machines.

Chris Whiteley

@Sebastian-Roth I have done some testing and the version of partclone is v0.3.13. I have mostly NVMe machines which will make it difficult to want to update to 1.5.8. 1.5.7 did not have the same issues with NVMe drives. I also tested it on one of my machines that I put a normal SSD in and that machine was also very slow. It is my imaging machine that I make my golden image on and it is usually around 14 GB/min download and now it is 9.50 GB/min. So it looks like in both scenarios that it is slow. The only other thing is the same is that they are all Dell machines, optiplex and latitudes.

Thanks for reaching out!

george1421

@Chris-Whiteley Can we get you to reinstall 1.5.8.

Make sure the version of bzImage is at 4.19.100+ file /var/www/html/fog/service/ipxe/bzImage If not grab that from here: https://fogproject.org/kernels/Kernel.TomElliott.4.19.101.64

Download this file https://fogproject.org/binaries1.5.7.zip and take the init.xz in that zip file and move that to /var/www/html/fog/service/ipxe directory overwriting what 1.5.8 installed.

Now try to pxe boot. The configuration you current (will) have is the fog server at 1.5.8 and FOS linux with the current kernel with the 1.5.7 virtual hard drive. We don’t normally like to mix the version of FOS Linux with the version of the FOG server, but we need to see if this condition corrects the issue. Note this is not a fix only a test condition. If need be you can run in this (specific) state until the devs and sort this out.

Sebastian Roth

@Chris-Whiteley Just be aware, image format has changed between partclone 0.2.89 (FOG 1.5.7) and partclone 0.3.13 (FOG 1.5.8). While you can deploy all your old images using the newer partclone you cannot deploy images captured with 0.3.13 using partclone 0.2.89!

Chris Whiteley

@Sebastian-Roth Thanks for the heads up! I have not done any imaging since upgrading. I held off on doing 1.5.7.X since I had that issue with speed. I was hoping that 1.5.8 was going to be different. Luckily I have it as a VM and I just reverted my snapshot so I could do some testing with you guys.

Sebastian Roth

@Chris-Whiteley said in Upgrade from 1.5.7 to 1.5.8 issues:

I held off on doing 1.5.7.X since I had that issue with speed. I was hoping that 1.5.8 was going to be different.

While I totally understand that not everyone can be pushing the edge (e.g. using latest dev-branch) we can only fix the things we are aware of. There is no point in hoping something will be fixed if we don’t know about it beforehand. Hope you don’t get me wrong here. I don’t want to sound harsh or anything, just pointing out that we need people to test things in their environments and report when issues come up.

Anyway, let’s face it and try to figure out what’s wrong. I’d suggest I build fresh inits with the only difference of partclone being reverted to 0.2.89. If that turns out to speed things up again for you we are sure it’s just that and we can dig into finding the speed issue in the new partclone version. Will be just a few minutes till I post a link for you to download.

Chris Whiteley

@Sebastian-Roth I am sorry that I didn’t post anything or submit my feedback. As a SysAdmin it is hard sometimes to find the time to start trying to dig into issues when you are busy and you know that going back to the version you had fixes the issue and you can move on. You guys have always been incredible and you have a team of people here that truly wants to help. I so appreciate the time and energy you guys spend tirelessly making this into a product I recommend to anyone that will listen to me.

Sebastian Roth

@Chris-Whiteley No need to say sorry. I know we are all pretty busy and I kind of regret having thrown this at you. Thanks for not taking it as offense. Wasn’t meant to.

Here is the init proposed: https://fogproject.org/inits/init-1.5.8-pc0.2.89.xz

Download and put in /var/www/html/fog/service/ipxe/. Either rename to init.xz or leave filename as is and just set the filename as Host Init option within one of your test hosts settings to use it.

Chris Whiteley

@Sebastian-Roth I understand completely where you are coming from and how frustrating it could be to have someone not try and help out the community. I take no offense at all.

I will get working on this right away and let you know my findings.

JJ Fullmer

Just wanted to chime in with another report on a speed change between 1.5.7 and 1.5.8

1.5.7 ~22 GiB/min
1.5.8 ~11 GiB/min

This is on nvme drives, and we have a gigabit port aggregation on the main deploying node (in case you’re wondering how we got it going so fast).

However on 1.5.7 there was always a slow but steady drop in speed. It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time. In 1.5.8 it isn’t doing the speed drop and the overall time taken is about the same. It was just cycling between just below and just above 11 GiB/min (i.e. 10.58 - 11.03 or something along those lines) Looking at some of my recent imaging times just before and now after the upgrade to 1.5.8 they’re all at about 2 minutes 30 seconds. The only real variation appears to be the hardware being imaged, which is to be expected.

Point being, perhaps there isn’t actually a speed change but rather a more accurate overall average speed for the whole process instead of attempting a realtime speed? Or maybe just a generally more steady speed? Or just a better way of calculating the displayed imaging speed?

@Chris-Whiteley Maybe take a look in the web gui at the report viewer -> Imaging log and see if there’s actually a difference in time for your images deploying before and after the upgrade? I’m finding mine are all still within 0-30 seconds of the same time.

Chris Whiteley

@JJ-Fullmer Thanks for the update. It is taking the machine considerably longer. Now…longer is relative at about 3-5 extra minutes, but if you have a ton to image it can be painful.

JJ Fullmer

@Chris-Whiteley 3-5 minutes is definitely a bigger deal than 0-30 seconds. I was hoping I was right, but I guess not. Have you tried the changes to the kernel suggested?

Chris Whiteley

@Sebastian-Roth After a test with the new init I am still having the issues of speed decrease. It is almost double what it used to take. My images being pushed out was around 2:30 minutes and now it is 4:17.

george1421

@Chris-Whiteley So just for clarity, the speed drop you are seeing is with which inits? The ones from Sebastian’s link or the 1.5.7 inits?

george1421

@JJ-Fullmer said in Upgrade from 1.5.7 to 1.5.8 issues:

It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time.

I just want to add a bit of color commentary here. A single 1 GbE link can only carry ~7.5GB/m theoretical maximum throughput. The number of GB/s you see on the partclone screen is an aggregate value of network throughput and the speed at which the image can be rehydrated on the target computer and written to storage. I find 20-25 GB/m a bit hard to believe (but not impossible to reach) that would mean you have a saturated 1 GbE network link and your image compression ratio was almost 4:1 with a very fast storage disk. I might expect around 13GB/min on a well managed 1GbE network. That would mean a saturated 1 GbE link with about a 2:1 compression ratio on a fast storage disk. So why is it so fast in the beginning and drops off, something must be buffering the data and it settles down as the buffer gets full and is forced to wait until the storage can intake the data.

My comments have nothing to do with the slower speeds with 1.5.8 but to explain why such speeds are possible on a 1 GbE network.

JJ Fullmer

@george1421 On the node that was showing that speed I have a bonded/aggregated link. So the node has a 2 Gbps link. Then the nvme storage has a theoretical write speed of 2.3 GB/s which is a theoretical speed of 138 GB/min (I don’t expect to see that kinda speed of course, just cool to think about, and shows that’s certainly not a bottleneck). I think that the 11 GiB/min I see now on 1.5.8 is probably closer to the actual speed I’ve been experiencing the whole time.

george1421

@JJ-Fullmer I appreciate your feed back and clarification. I also want to add a clarification to your post to not confuse others that may read this in the future. Hold on I feel this is going to be a wall of text…

You have a 2 link bonded connection. That doesn’t imply that it gives you 2Gb/s of bandwidth (i.e. twice as fast as a single link). LAG/bonded/teamed groups don’t work that way (at least with today’s technology). A 2 link bonded group would give you 2 1GbE links into that device. It works the same way as adding and additional lane to a highway. Your road can carry more traffic, but the speed limit is still 70mph. Also assuming we are talking about lacp/802.3ad/mlt links there is a hashing algorithm that is used to decide which traffic flows across which link. Once the link route has been determined (i.e. link 1 or link 2) that link route does not change during the lifetime of the communication between the two devices (assuming that port based hashing is not used). So the guidance is between any two devices you will never have any faster communication than the speed of a single link. With only 2 actors (FOG server and target computer) you will only have the best speed as a 1GbE link. So having a LAG/MLT/bonded link on the target computer will not help you one bit for imaging. Having a LAG/MLT/bonded link on the server end will not help you when there is only a FOG server and target computer involved. Having a LAG group on the FOG server when more than 2 actors are involved will help you to spread the load across the links based on the link hashing protocol.

See I told you it was going to be a wall of text.

Upgrade from 1.5.7 to 1.5.8 issues

230

12.3k

17.4k

155.9k