Deployment stuck at x percentage

sega

Hi,

we noticed in the last time, that we have issues with the deployment process. Sometimes the deployment is getting stuck at a random % number.
The problem happens with different pcs and different images. Is there somewhere a log file which could give me a hint, what the problem could be?

george1421

@sega It would be helpful to have a picture of the error message to know where its getting stuck at a percentage. But If I have to guess, it would at the beginning of the imaging process where its downloading bzImage or init.xz to the target computer. When it gets hung it doesn’t ever continue.

If its getting stuck here you might want to rebuild iPXE with the latest version.

Also you didn’t mention the version of FOG you are using.

sega

@george1421 It’s always different, sometimes at 34%, then 56%, then another number. But the next time it happens I will take a picture of it.
The fog version is 1.5.10

sega

We tried to reset 8 pcs yesterday. 3 of them stuck at 22, 24 and 25% like shown in the image. These are also always different pcs.
https://drive.google.com/file/d/1tca2-DMd16ZeNwSitgC0hb0-xhaw-Ma2/view?usp=drive_link

george1421

@sega Ok the picture tells me more about where its hanging. When it hangs here it more indicative that you have a network infrastructure issue. At this point you are running under the FOS (linux) engine. You could always try to update the kernel to the latest 6.x version using the web ui fog configuration->kernel update. But my intuition is telling me its something outside of the fog ecosystem causing this outage.

The transfer rate of 948MB/min is also not good. On a well behaved/designed 1GbE network you should see 5.7 to 6.5GB/min To translate that 948MB/min == 16.4MB/sec

1GbE is 125MB/s
100Mhz is 12.5MB/s

With data compression 16.4MB/sec is pretty close to 12.5MB/s. So just using some wild guessing I would have to say you have somewhere between the FOG server and the target computer a 100Mb/s network link.

sega

@george1421 On that place we have 10 pcs, but all are connected at a one gigabyte port. So when all are deplyoing at the same time, the rate is decreased. If just one is deploying the rate is at that 5.7 GB/min.
But you saying that the problem isn’t something on the FOG more like outside, like the network connection?

fogcloud

@sega Have you used FOG in the past, and if so, did you encounter this problem back then? It would be helpful to know if this problem is something that has only occurred recently or if it has always been this was since using FOG. If something did change, we need to narrow down what it could be.

When a computer gets “stuck,” does the data block percentage change at all? Have you tried leaving a computer to see if it finishes the imaging process even if it’s slow?

Have you tried plugging in the computers to the same network switch that the FOG server is connected to? The would eliminate any other bottlenecks in network speed on your network.

george1421

So when all are deplyoing at the same time, the rate is decreased.

Ah you are deploying more than one image in a push. Just be aware that you can saturate a single 1GbE link with between 2 and 3 simultaneous unicast image deployments. If you need to deploy 10 at a time them I would look at using a multicast image deployment. This will only send out 1 image with all 10 receiving that same image. Its much easier on the network.

Also if you want to do this with unicast messages, upgrade your fog server network link to 10GbE. If that isn’t possible add 2-3 network ports connected in a network LAG (trunked) configuration. With multiple unicast deployments a LAG group will surely help.

sega

@george1421 The 10 pcs getting all the same image, but through the normal deployment. It would be still better to use multicast deploying in that case?

Edit: I tried the multicast for one group now, they starting the Partclone part but don’t start the deplayoment. As I seen the FOG server and the client pcs has to be in the same VLAN for multicast? That’s not our case, they have different ones.

george1421

@sega said in Deployment stuck at x percentage:

As I seen the FOG server and the client pcs has to be in the same VLAN for multicast? That’s not our case, they have different ones.

Generally the fog server and target computers should be on the same vlan for multicasting to work. It can work if you have a igmp-proxy server setup on your router between the vlans. Multicast traffic don’t normally pass a normal router.

But now in your network you’ve introduced another choke point in your router must be able to handle the load of these 10 systems being imaged simultaneously. As I said you can flood a 1GbE link with between 2-3 simultaneous unicast images. Also on your fog server, is the image stored on what media ssd or hdd? If hdd, how many spindles are being used? That will have an impact on performance too if its hdd.

sega

@george1421
We have the following structure:

FOG Server on office VLAN
3 rooms with separate VLANs each (because they have different rights in the network, but all can access the FOG server)
FOG server is running on the VM but on ssd, if I’m correct
All are linked with aruba managed switches

Somewhere I read that it’s possible to have it with different VLANs if the udp stream is forwarded.
The weird thing is: we didn’t had the problems in the past… so yea something in the network could be the problem.

sega

This post is deleted!

sega

@george1421

The current status: FOG and client pc in different VLANs. We have managed switches where the streams should be forwarded (we don’t managed them ourselves). We started a multicast session for 6 pcs. They are all going into the Partclone windows, but don’t start the deployment process.

In the active task tab I have once (under “Active Tasks”) 6 task (for each pc one) and under “Active Multicast Tasks” I have one for the whole group. State is in “In-Progess” and status is “0”.

edit: I somehow can’t post the logs because they are marked as spam

Tom Elliott

@sega if you want more “speedy” unicast I would suggest setting your storage nodes queue limit to 2 or three.

I know this sounds low and all but it’d be better to have 10 systems imaged in 30 - 45 minutes than to have 10 systems imaged in 2-3 hours.

Unless you can get multicast to work, I think this is your best approach. Another method, though unrealistic, would be to setup multiple storage nodes in the same group (each with only 2 or 3 per vlan) and possible the location plugin to designate the vlan systems to their respective vlan ip groups.

Ultimately keeping things as they currently are is not going to magically work the next time you try again.

Tom Elliott

@sega if you can post logs to paste bin and send the link in the forums that may work as well.

sega

@Tom-Elliott

That’s a good idea to reduce the limit. But our priority will be to get multicast to work. I’m pretty sure somewhere in our network structure is the error why it’s not working yet.
Since we have to do that multiply times in a week, reducing the time would be great.

And here are the logs: https://pastebin.com/JQadkbZ0

george1421

@sega Multicasting should work correctly if the fog server and target computers are on the same subnet. You said that may not be the case. For multicasting to work your router has to support forwarding multicast packets or your router or some device that has access to all of the vlans need to be running an igmp-proxy. This is akin to what a dhcp-helper/relay is to dhcp. You should see exactly what you are seeing if the network doesn’t support multicasting. The FOS Engine is started using unicast messaging, the partcone part is where the multicast stream really starts moving data.

sega

@george1421

Yea, it’s not possible having the FOG and the target pcs on the same subnet. Our switches have the option active igmp. And I see a warning on the switches that they get a v3 query but the device is just configured for igmp v2. But that I have to look with our service provider.

sega

@george1421 @Tom-Elliott

We just saw our switches can’t process v3… Is it possible to change a setting that FOG uses the v2 of igmp?

Tom Elliott

@sega https://www.udpcast.linux.lu/hints.html

I think this may tell you more?

We don’t control what “version” your switches can/cannot allow?

Deployment stuck at x percentage

101

12.7k

17.6k

156.6k