Deploy (Unicast): Wrong number of client per node.

mp12

@Tom-Elliott maybe a delay between the WOL should be enough? I think 10 sec. should be enough and it also protects the electrical fuse against defective.

Wayne Workman

@Tom-Elliott The only way that a queue would accidentally overfill is if the clients were checking for themselves and then seeing that a slot is open (when two check at once) and going ahead and starting themselves - instead of the server evaluating each request and saying yes or no to it.

Sounds like it takes roughly 7 seconds from a client thinking it’s good to go - to the queue count being updated.

Tom Elliott

I believe this is now fixed, but confirmation would be good. I’m solving, but it can be unsolved again if this is not working still/again.

mp12

@Tom-Elliott

FOG r6439

Still having problems.

node1 max client: 3
node2 max client: 3
number of clients in the deployed group: 6

–> node1 deploys five clients.
–> node2 deploys zero clients.
–> one client is queued.

Tom Elliott

@mp12 What kind of tasks are you running? Are the ALL deploy’s?

mp12

@Tom-Elliott

I am running a instant group deploy (basic tasks). The group has six clients.

Tom Elliott

@mp12 THey are not multicast?

mp12

@Tom-Elliott

No multicast!
Just a normal deploy “type=1”

Tom Elliott

@mp12 And you’re 100% sure you’re running 6439? I ask because I tested this same type of thing quite a lot yesterday.

Granted with only two hosts, but I started both systems at the same time. One won in the battle and the other was pushed to the back.

mp12

@Tom-Elliott

Yes 100% sure. Just did an upgrade when I read your post from yesterday.
I will test with two and four clients tomorrow.

Tom Elliott

@mp12 do all 5 systems actually start receiving the image, or does only three receive the image, and the other 2 wait in line?

See, splitting between multiple nodes is not a straightforward thing. You can still queue many systems to receive an image.

At the time the systems are booting, they’re not magically going to switch between using different nodes. The reason for this is because the clients haven’t started doing anything. From the client’s perspective (when it’s booting up) it sees the same node as the optimalnode it needs to use. The “load” isn’t even calculated until the first system checks in. If 5 systems boot up and decide to use the same optimal node, there’s nothing I can really do for it.

mp12

@Tom-Elliott said:

@mp12 do all 5 systems actually start receiving the image, or does only three receive the image, and the other 2 wait in line?

They all recieved the image at the same time. Yesterday I ran 4 Clients on node1. No split between both nodes.

See, splitting between multiple nodes is not a straightforward thing. You can still queue many systems to receive an image.

I see splitting is not as easy as I thought
Just remembered in fog 0.32 I never had problems with the splitting.

Now I am running another test with the same node config as before.

Started four group deploy (unicast) where each group has two clients.
Between each group deploy there is a delay of some minutes. I thought that the clients now can find a proper node.
In my “Active Tasks” I see eight tasks (six of them are running and two are queued). So far so good.

When I log into the shell of both nodes I see:

node1: six connected clients receiving images via nfs.
node2: zero clients.

Tom Elliott

@mp12 so both nodes are a part of the same storage group and contain the same images?

mp12

@Tom-Elliott

Yes they do.

When I start single deploys with one client and a delay of 10 seconds between, everything works fine. Group deploys won’t work without the splitting error.

Wayne Workman

@mp12 Is this actually causing a problem for you, or are you just trying to help make fog better?

mp12

@Wayne-Workman first of all I am very thankful.

I am using FOG now for six-seven years. It’s a wonderful piece of software. Sure I want to help make FOG better. Thats why I am testing all these different configurations.

My only problem is that the imaging now takes twice the time.
I can handle this but I think a correct splitting would be great for the whole community.

Tom Elliott

@mp12 what equates as a correct splitting?

See the way the split occurs is based on client load. Load is calculated by the number of queued and used tasks happening on a more. The problem is when multiples are checking in at exactly the same time they have not started to queue up yet. Because of this, when the system boots it finds the optimal node. That optimal node doesn’t know anything at boot time of who is using it, so splitting isn’t really viable at that point. I have a mechanism I could add to make it do this but it seems a bit off kilter.

mp12

@Tom-Elliott for me a correct splitting (using more then one node) would be, the client connects to a node and the node checks if “max_clients” is exceeded. Then the node only replys to its configured number of “max_clients”. The rest gets queued.

Maybe I will give multicast another shot.

Wayne Workman

@mp12 The issue is that the client is what determines if the “max_clients” is met or not. If two or more check at the same time, there appears to be a 10 second window where too many might start, or perhaps too many queued when there was another node available.

However, once the max_clients is met, it makes no sense why other clients wait in line when there’s another node with empty slots.

@Tom-Elliott I think maybe the inits should evaluate all possible nodes - within the constraints of which have the image available and the location plugin constraints. I think the issue is the inits are only checking one instead of all nodes.

mp12

@Wayne-Workman said:
I am working with the location plugin now. The splitting works fine.

Deploy (Unicast): Wrong number of client per node.

95

12.7k

17.6k

156.8k