Upgrade from 1.5.7 to 1.5.8 issues
-
@Chris-Whiteley Just be aware, image format has changed between partclone 0.2.89 (FOG 1.5.7) and partclone 0.3.13 (FOG 1.5.8). While you can deploy all your old images using the newer partclone you cannot deploy images captured with 0.3.13 using partclone 0.2.89!
-
@Sebastian-Roth Thanks for the heads up! I have not done any imaging since upgrading. I held off on doing 1.5.7.X since I had that issue with speed. I was hoping that 1.5.8 was going to be different. Luckily I have it as a VM and I just reverted my snapshot so I could do some testing with you guys.
-
@Chris-Whiteley said in Upgrade from 1.5.7 to 1.5.8 issues:
I held off on doing 1.5.7.X since I had that issue with speed. I was hoping that 1.5.8 was going to be different.
While I totally understand that not everyone can be pushing the edge (e.g. using latest dev-branch) we can only fix the things we are aware of. There is no point in hoping something will be fixed if we don’t know about it beforehand. Hope you don’t get me wrong here. I don’t want to sound harsh or anything, just pointing out that we need people to test things in their environments and report when issues come up.
Anyway, let’s face it and try to figure out what’s wrong. I’d suggest I build fresh inits with the only difference of partclone being reverted to 0.2.89. If that turns out to speed things up again for you we are sure it’s just that and we can dig into finding the speed issue in the new partclone version. Will be just a few minutes till I post a link for you to download.
-
@Sebastian-Roth I am sorry that I didn’t post anything or submit my feedback. As a SysAdmin it is hard sometimes to find the time to start trying to dig into issues when you are busy and you know that going back to the version you had fixes the issue and you can move on. You guys have always been incredible and you have a team of people here that truly wants to help. I so appreciate the time and energy you guys spend tirelessly making this into a product I recommend to anyone that will listen to me.
-
@Chris-Whiteley No need to say sorry. I know we are all pretty busy and I kind of regret having thrown this at you. Thanks for not taking it as offense. Wasn’t meant to.
Here is the init proposed: https://fogproject.org/inits/init-1.5.8-pc0.2.89.xz
Download and put in
/var/www/html/fog/service/ipxe/
. Either rename toinit.xz
or leave filename as is and just set the filename as Host Init option within one of your test hosts settings to use it. -
@Sebastian-Roth I understand completely where you are coming from and how frustrating it could be to have someone not try and help out the community. I take no offense at all.
I will get working on this right away and let you know my findings.
-
Just wanted to chime in with another report on a speed change between 1.5.7 and 1.5.8
1.5.7 ~22 GiB/min
1.5.8 ~11 GiB/minThis is on nvme drives, and we have a gigabit port aggregation on the main deploying node (in case you’re wondering how we got it going so fast).
However on 1.5.7 there was always a slow but steady drop in speed. It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time. In 1.5.8 it isn’t doing the speed drop and the overall time taken is about the same. It was just cycling between just below and just above 11 GiB/min (i.e. 10.58 - 11.03 or something along those lines) Looking at some of my recent imaging times just before and now after the upgrade to 1.5.8 they’re all at about 2 minutes 30 seconds. The only real variation appears to be the hardware being imaged, which is to be expected.
Point being, perhaps there isn’t actually a speed change but rather a more accurate overall average speed for the whole process instead of attempting a realtime speed? Or maybe just a generally more steady speed? Or just a better way of calculating the displayed imaging speed?
@Chris-Whiteley Maybe take a look in the web gui at the report viewer -> Imaging log and see if there’s actually a difference in time for your images deploying before and after the upgrade? I’m finding mine are all still within 0-30 seconds of the same time.
-
@JJ-Fullmer Thanks for the update. It is taking the machine considerably longer. Now…longer is relative at about 3-5 extra minutes, but if you have a ton to image it can be painful.
-
@Chris-Whiteley 3-5 minutes is definitely a bigger deal than 0-30 seconds. I was hoping I was right, but I guess not. Have you tried the changes to the kernel suggested?
-
@Sebastian-Roth After a test with the new init I am still having the issues of speed decrease. It is almost double what it used to take. My images being pushed out was around 2:30 minutes and now it is 4:17.
-
@Chris-Whiteley So just for clarity, the speed drop you are seeing is with which inits? The ones from Sebastian’s link or the 1.5.7 inits?
-
@JJ-Fullmer said in Upgrade from 1.5.7 to 1.5.8 issues:
It would start at 20-25 GiB/min and slowly drop GiB/min every couple seconds. But I never cared much since the ~20 GB image was done deploying in 2-3 minutes each time.
I just want to add a bit of color commentary here. A single 1 GbE link can only carry ~7.5GB/m theoretical maximum throughput. The number of GB/s you see on the partclone screen is an aggregate value of network throughput and the speed at which the image can be rehydrated on the target computer and written to storage. I find 20-25 GB/m a bit hard to believe (but not impossible to reach) that would mean you have a saturated 1 GbE network link and your image compression ratio was almost 4:1 with a very fast storage disk. I might expect around 13GB/min on a well managed 1GbE network. That would mean a saturated 1 GbE link with about a 2:1 compression ratio on a fast storage disk. So why is it so fast in the beginning and drops off, something must be buffering the data and it settles down as the buffer gets full and is forced to wait until the storage can intake the data.
My comments have nothing to do with the slower speeds with 1.5.8 but to explain why such speeds are possible on a 1 GbE network.
-
@george1421 On the node that was showing that speed I have a bonded/aggregated link. So the node has a 2 Gbps link. Then the nvme storage has a theoretical write speed of 2.3 GB/s which is a theoretical speed of 138 GB/min (I don’t expect to see that kinda speed of course, just cool to think about, and shows that’s certainly not a bottleneck). I think that the 11 GiB/min I see now on 1.5.8 is probably closer to the actual speed I’ve been experiencing the whole time.
-
@JJ-Fullmer I appreciate your feed back and clarification. I also want to add a clarification to your post to not confuse others that may read this in the future. Hold on I feel this is going to be a wall of text…
You have a 2 link bonded connection. That doesn’t imply that it gives you 2Gb/s of bandwidth (i.e. twice as fast as a single link). LAG/bonded/teamed groups don’t work that way (at least with today’s technology). A 2 link bonded group would give you 2 1GbE links into that device. It works the same way as adding and additional lane to a highway. Your road can carry more traffic, but the speed limit is still 70mph. Also assuming we are talking about lacp/802.3ad/mlt links there is a hashing algorithm that is used to decide which traffic flows across which link. Once the link route has been determined (i.e. link 1 or link 2) that link route does not change during the lifetime of the communication between the two devices (assuming that port based hashing is not used). So the guidance is between any two devices you will never have any faster communication than the speed of a single link. With only 2 actors (FOG server and target computer) you will only have the best speed as a 1GbE link. So having a LAG/MLT/bonded link on the target computer will not help you one bit for imaging. Having a LAG/MLT/bonded link on the server end will not help you when there is only a FOG server and target computer involved. Having a LAG group on the FOG server when more than 2 actors are involved will help you to spread the load across the links based on the link hashing protocol.
See I told you it was going to be a wall of text.
-
@george1421 While I agree and understand how it all works. I have found that we did get an increase in speed when we setup the aggregated adapter on the storage node. Even with just one client going. But perhaps that’s really just agreeing with your statements. As like on a highway if it was 1 lane, you often slowdown cause of the other slow drivers and perhaps I just opened up a metaphorical passing lane for fog images to go the full speed limit at. You do also have to consider all the switches it goes through and yada yada. And of course it’s all more complicated. Point being, I didn’t get a 2x boost when we aggregated the server link, but we did get a boost, so I wouldn’t deter anyone with the equipment capable of it from giving it a try.
-
@Chris-Whiteley said in Upgrade from 1.5.7 to 1.5.8 issues:
After a test with the new init I am still having the issues of speed decrease. It is almost double what it used to take. My images being pushed out was around 2:30 minutes and now it is 4:17.
Ok, then we need to start looking at other things I suppose. Did you try going back to use the inits from 1.5.7 (only inits, not the kernel as a first try) as suggested by @george1421 yet? https://fogproject.org/binaries1.5.7.zip
What if there is something in your network that changed?
-
@Sebastian-Roth Nothing on the network has changed and I did try @george1421 suggestion as well. It was a 2 minute install/upgrade of 1.5.8 and the snapshot revert took about 2 minutes max. Sorry, I have been on vacation the last couple of days but I wanted to make sure and get back to you on those things.
-
@Chris-Whiteley So what’s the outcome of testing George’s suggestion? Sorry for nagging. Take your time…
-
@Sebastian-Roth it was the same as the new init
-
@Chris-Whiteley So that would leave us with an issue in the Linux kernel or FOG server. Kernel is easier to test. Leave the inits from 1.5.7 in place and grab the kernel binaries (
bzImage*
) from the archive to use. See if it’s back to speed like this.