Upgrade from 1.5.7 to 1.5.8 issues
-
@JJ-Fullmer said in Upgrade from 1.5.7 to 1.5.8 issues:
It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time.
I just want to add a bit of color commentary here. A single 1 GbE link can only carry ~7.5GB/m theoretical maximum throughput. The number of GB/s you see on the partclone screen is an aggregate value of network throughput and the speed at which the image can be rehydrated on the target computer and written to storage. I find 20-25 GB/m a bit hard to believe (but not impossible to reach) that would mean you have a saturated 1 GbE network link and your image compression ratio was almost 4:1 with a very fast storage disk. I might expect around 13GB/min on a well managed 1GbE network. That would mean a saturated 1 GbE link with about a 2:1 compression ratio on a fast storage disk. So why is it so fast in the beginning and drops off, something must be buffering the data and it settles down as the buffer gets full and is forced to wait until the storage can intake the data.
My comments have nothing to do with the slower speeds with 1.5.8 but to explain why such speeds are possible on a 1 GbE network.
-
@george1421 On the node that was showing that speed I have a bonded/aggregated link. So the node has a 2 Gbps link. Then the nvme storage has a theoretical write speed of 2.3 GB/s which is a theoretical speed of 138 GB/min (I don’t expect to see that kinda speed of course, just cool to think about, and shows that’s certainly not a bottleneck). I think that the 11 GiB/min I see now on 1.5.8 is probably closer to the actual speed I’ve been experiencing the whole time.
-
@JJ-Fullmer I appreciate your feed back and clarification. I also want to add a clarification to your post to not confuse others that may read this in the future. Hold on I feel this is going to be a wall of text…
You have a 2 link bonded connection. That doesn’t imply that it gives you 2Gb/s of bandwidth (i.e. twice as fast as a single link). LAG/bonded/teamed groups don’t work that way (at least with today’s technology). A 2 link bonded group would give you 2 1GbE links into that device. It works the same way as adding and additional lane to a highway. Your road can carry more traffic, but the speed limit is still 70mph. Also assuming we are talking about lacp/802.3ad/mlt links there is a hashing algorithm that is used to decide which traffic flows across which link. Once the link route has been determined (i.e. link 1 or link 2) that link route does not change during the lifetime of the communication between the two devices (assuming that port based hashing is not used). So the guidance is between any two devices you will never have any faster communication than the speed of a single link. With only 2 actors (FOG server and target computer) you will only have the best speed as a 1GbE link. So having a LAG/MLT/bonded link on the target computer will not help you one bit for imaging. Having a LAG/MLT/bonded link on the server end will not help you when there is only a FOG server and target computer involved. Having a LAG group on the FOG server when more than 2 actors are involved will help you to spread the load across the links based on the link hashing protocol.
See I told you it was going to be a wall of text.
-
@george1421 While I agree and understand how it all works. I have found that we did get an increase in speed when we setup the aggregated adapter on the storage node. Even with just one client going. But perhaps that’s really just agreeing with your statements. As like on a highway if it was 1 lane, you often slowdown cause of the other slow drivers and perhaps I just opened up a metaphorical passing lane for fog images to go the full speed limit at. You do also have to consider all the switches it goes through and yada yada. And of course it’s all more complicated. Point being, I didn’t get a 2x boost when we aggregated the server link, but we did get a boost, so I wouldn’t deter anyone with the equipment capable of it from giving it a try.
-
@Chris-Whiteley said in Upgrade from 1.5.7 to 1.5.8 issues:
After a test with the new init I am still having the issues of speed decrease. It is almost double what it used to take. My images being pushed out was around 2:30 minutes and now it is 4:17.
Ok, then we need to start looking at other things I suppose. Did you try going back to use the inits from 1.5.7 (only inits, not the kernel as a first try) as suggested by @george1421 yet? https://fogproject.org/binaries1.5.7.zip
What if there is something in your network that changed?
-
@Sebastian-Roth Nothing on the network has changed and I did try @george1421 suggestion as well. It was a 2 minute install/upgrade of 1.5.8 and the snapshot revert took about 2 minutes max. Sorry, I have been on vacation the last couple of days but I wanted to make sure and get back to you on those things.
-
@Chris-Whiteley So what’s the outcome of testing George’s suggestion? Sorry for nagging. Take your time…
-
@Sebastian-Roth it was the same as the new init
-
@Chris-Whiteley So that would leave us with an issue in the Linux kernel or FOG server. Kernel is easier to test. Leave the inits from 1.5.7 in place and grab the kernel binaries (
bzImage*
) from the archive to use. See if it’s back to speed like this. -
@Sebastian-Roth said in Upgrade from 1.5.7 to 1.5.8 issues:
So that would leave us with an issue in the Linux kernel or FOG server.
-OR- something outside of FOG causing the delay.
-
@Sebastian-Roth I just tested with the old bzimage for 1.5.7 and the speed was much faster and what I am used to.
-
@Chris-Whiteley If you are testing the bzImage from 1.5.7 and 1.5.8 on different days, could you test the bzImage from 1.5.8 now? I’m just trying to rule out other variables, because what you’ve told us, should not be. I’m not saying it can’t happen, only its a bit strange in the same series (4.19.x)
-
@george1421 I am testing it now. I found the 1.5.8 binaries.
-
@Chris-Whiteley That was the 4.19.101 version.
But maybe now I’m confused. I’m trying to build a truth table and I thought we zeroed in on something
1.5.8:bzImage 4.19.101 1.5.8:init.xz == Slow (partclone 0.3.13)
1.5.8:bzImage 4.19.101 1.5.7:init.xz == Slow (partclone 0.3.13)
1.5.7:bzImage 4.19.65 1.5.7:init.xz == Fast (partclone 0.2.89)
1.5.8:bzImage 4.19.101 1.5.7:init.xz == ?? (partclone 0.2.89) -
@george1421 This is correct so far. This last test proved to be faster this time…hmmm…This has also been the most stable as far as not going from ridiculous speeds down to what it actually should be. Now it starts where it normally rests at.
-
@Chris-Whiteley said in Upgrade from 1.5.7 to 1.5.8 issues:
This last test proved to be faster this time
Just as fast, or faster than last time?
Also this is why I wanted you to test the same day with the same set of circumstances. Just in case there was something in network land that was different than last week.
-
@george1421 Just as fast as when I had 1.5.7 and it was working. Both tests from normal SSD and NVMe worked as I think they should. How do I now get an output of what versions of things I have so that you guys can figure out what you need to?
-
@Chris-Whiteley well for bzImage that’s easy
file bzImage
will tell you the current version of bzImage.For init.xz its not as easy but not hard either. If we use md5sum utility we can get the fingerprint of init.xz file.
This is for 1.5.7
md5sum init.xz 913326f3317b577be3cb65a7bf332afb init.xz
If you have the chance, since something is different, can you test with the init.xz from 1.5.8? Its best to do all in one day if possible.
-
bzimage - 4.19.101
init.xz - 9133326 -
@george1421 just tested with the 1.5.8 init.xz and got the slowness again