Upgrade from 1.5.7 to 1.5.8 issues
-
@Chris-Whiteley No need to say sorry. I know we are all pretty busy and I kind of regret having thrown this at you. Thanks for not taking it as offense. Wasn’t meant to.
Here is the init proposed: https://fogproject.org/inits/init-1.5.8-pc0.2.89.xz
Download and put in
/var/www/html/fog/service/ipxe/
. Either rename toinit.xz
or leave filename as is and just set the filename as Host Init option within one of your test hosts settings to use it. -
@Sebastian-Roth I understand completely where you are coming from and how frustrating it could be to have someone not try and help out the community. I take no offense at all.
I will get working on this right away and let you know my findings.
-
Just wanted to chime in with another report on a speed change between 1.5.7 and 1.5.8
1.5.7 ~22 GiB/min
1.5.8 ~11 GiB/minThis is on nvme drives, and we have a gigabit port aggregation on the main deploying node (in case you’re wondering how we got it going so fast).
However on 1.5.7 there was always a slow but steady drop in speed. It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time. In 1.5.8 it isn’t doing the speed drop and the overall time taken is about the same. It was just cycling between just below and just above 11 GiB/min (i.e. 10.58 - 11.03 or something along those lines) Looking at some of my recent imaging times just before and now after the upgrade to 1.5.8 they’re all at about 2 minutes 30 seconds. The only real variation appears to be the hardware being imaged, which is to be expected.
Point being, perhaps there isn’t actually a speed change but rather a more accurate overall average speed for the whole process instead of attempting a realtime speed? Or maybe just a generally more steady speed? Or just a better way of calculating the displayed imaging speed?
@Chris-Whiteley Maybe take a look in the web gui at the report viewer -> Imaging log and see if there’s actually a difference in time for your images deploying before and after the upgrade? I’m finding mine are all still within 0-30 seconds of the same time.
-
@JJ-Fullmer Thanks for the update. It is taking the machine considerably longer. Now…longer is relative at about 3-5 extra minutes, but if you have a ton to image it can be painful.
-
@Chris-Whiteley 3-5 minutes is definitely a bigger deal than 0-30 seconds. I was hoping I was right, but I guess not. Have you tried the changes to the kernel suggested?
-
@Sebastian-Roth After a test with the new init I am still having the issues of speed decrease. It is almost double what it used to take. My images being pushed out was around 2:30 minutes and now it is 4:17.
-
@Chris-Whiteley So just for clarity, the speed drop you are seeing is with which inits? The ones from Sebastian’s link or the 1.5.7 inits?
-
@JJ-Fullmer said in Upgrade from 1.5.7 to 1.5.8 issues:
It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time.
I just want to add a bit of color commentary here. A single 1 GbE link can only carry ~7.5GB/m theoretical maximum throughput. The number of GB/s you see on the partclone screen is an aggregate value of network throughput and the speed at which the image can be rehydrated on the target computer and written to storage. I find 20-25 GB/m a bit hard to believe (but not impossible to reach) that would mean you have a saturated 1 GbE network link and your image compression ratio was almost 4:1 with a very fast storage disk. I might expect around 13GB/min on a well managed 1GbE network. That would mean a saturated 1 GbE link with about a 2:1 compression ratio on a fast storage disk. So why is it so fast in the beginning and drops off, something must be buffering the data and it settles down as the buffer gets full and is forced to wait until the storage can intake the data.
My comments have nothing to do with the slower speeds with 1.5.8 but to explain why such speeds are possible on a 1 GbE network.
-
@george1421 On the node that was showing that speed I have a bonded/aggregated link. So the node has a 2 Gbps link. Then the nvme storage has a theoretical write speed of 2.3 GB/s which is a theoretical speed of 138 GB/min (I don’t expect to see that kinda speed of course, just cool to think about, and shows that’s certainly not a bottleneck). I think that the 11 GiB/min I see now on 1.5.8 is probably closer to the actual speed I’ve been experiencing the whole time.
-
@JJ-Fullmer I appreciate your feed back and clarification. I also want to add a clarification to your post to not confuse others that may read this in the future. Hold on I feel this is going to be a wall of text…
You have a 2 link bonded connection. That doesn’t imply that it gives you 2Gb/s of bandwidth (i.e. twice as fast as a single link). LAG/bonded/teamed groups don’t work that way (at least with today’s technology). A 2 link bonded group would give you 2 1GbE links into that device. It works the same way as adding and additional lane to a highway. Your road can carry more traffic, but the speed limit is still 70mph. Also assuming we are talking about lacp/802.3ad/mlt links there is a hashing algorithm that is used to decide which traffic flows across which link. Once the link route has been determined (i.e. link 1 or link 2) that link route does not change during the lifetime of the communication between the two devices (assuming that port based hashing is not used). So the guidance is between any two devices you will never have any faster communication than the speed of a single link. With only 2 actors (FOG server and target computer) you will only have the best speed as a 1GbE link. So having a LAG/MLT/bonded link on the target computer will not help you one bit for imaging. Having a LAG/MLT/bonded link on the server end will not help you when there is only a FOG server and target computer involved. Having a LAG group on the FOG server when more than 2 actors are involved will help you to spread the load across the links based on the link hashing protocol.
See I told you it was going to be a wall of text.
-
@george1421 While I agree and understand how it all works. I have found that we did get an increase in speed when we setup the aggregated adapter on the storage node. Even with just one client going. But perhaps that’s really just agreeing with your statements. As like on a highway if it was 1 lane, you often slowdown cause of the other slow drivers and perhaps I just opened up a metaphorical passing lane for fog images to go the full speed limit at. You do also have to consider all the switches it goes through and yada yada. And of course it’s all more complicated. Point being, I didn’t get a 2x boost when we aggregated the server link, but we did get a boost, so I wouldn’t deter anyone with the equipment capable of it from giving it a try.
-
@Chris-Whiteley said in Upgrade from 1.5.7 to 1.5.8 issues:
After a test with the new init I am still having the issues of speed decrease. It is almost double what it used to take. My images being pushed out was around 2:30 minutes and now it is 4:17.
Ok, then we need to start looking at other things I suppose. Did you try going back to use the inits from 1.5.7 (only inits, not the kernel as a first try) as suggested by @george1421 yet? https://fogproject.org/binaries1.5.7.zip
What if there is something in your network that changed?
-
@Sebastian-Roth Nothing on the network has changed and I did try @george1421 suggestion as well. It was a 2 minute install/upgrade of 1.5.8 and the snapshot revert took about 2 minutes max. Sorry, I have been on vacation the last couple of days but I wanted to make sure and get back to you on those things.
-
@Chris-Whiteley So what’s the outcome of testing George’s suggestion? Sorry for nagging. Take your time…
-
@Sebastian-Roth it was the same as the new init
-
@Chris-Whiteley So that would leave us with an issue in the Linux kernel or FOG server. Kernel is easier to test. Leave the inits from 1.5.7 in place and grab the kernel binaries (
bzImage*
) from the archive to use. See if it’s back to speed like this. -
@Sebastian-Roth said in Upgrade from 1.5.7 to 1.5.8 issues:
So that would leave us with an issue in the Linux kernel or FOG server.
-OR- something outside of FOG causing the delay.
-
@Sebastian-Roth I just tested with the old bzimage for 1.5.7 and the speed was much faster and what I am used to.
-
@Chris-Whiteley If you are testing the bzImage from 1.5.7 and 1.5.8 on different days, could you test the bzImage from 1.5.8 now? I’m just trying to rule out other variables, because what you’ve told us, should not be. I’m not saying it can’t happen, only its a bit strange in the same series (4.19.x)
-
@george1421 I am testing it now. I found the 1.5.8 binaries.