Deploy (Unicast): Wrong number of client per node.

Tom Elliott

I believe this is now fixed, but confirmation would be good. I’m solving, but it can be unsolved again if this is not working still/again.

mp12

@Tom-Elliott

FOG r6439

Still having problems.

node1 max client: 3
node2 max client: 3
number of clients in the deployed group: 6

–> node1 deploys five clients.
–> node2 deploys zero clients.
–> one client is queued.

Tom Elliott

@mp12 What kind of tasks are you running? Are the ALL deploy’s?

mp12

@Tom-Elliott

I am running a instant group deploy (basic tasks). The group has six clients.

Tom Elliott

@mp12 THey are not multicast?

mp12

@Tom-Elliott

No multicast!
Just a normal deploy “type=1”

Tom Elliott

@mp12 And you’re 100% sure you’re running 6439? I ask because I tested this same type of thing quite a lot yesterday.

Granted with only two hosts, but I started both systems at the same time. One won in the battle and the other was pushed to the back.

mp12

@Tom-Elliott

Yes 100% sure. Just did an upgrade when I read your post from yesterday.
I will test with two and four clients tomorrow.

Tom Elliott

@mp12 do all 5 systems actually start receiving the image, or does only three receive the image, and the other 2 wait in line?

See, splitting between multiple nodes is not a straightforward thing. You can still queue many systems to receive an image.

At the time the systems are booting, they’re not magically going to switch between using different nodes. The reason for this is because the clients haven’t started doing anything. From the client’s perspective (when it’s booting up) it sees the same node as the optimalnode it needs to use. The “load” isn’t even calculated until the first system checks in. If 5 systems boot up and decide to use the same optimal node, there’s nothing I can really do for it.

mp12

@Tom-Elliott said:

@mp12 do all 5 systems actually start receiving the image, or does only three receive the image, and the other 2 wait in line?

They all recieved the image at the same time. Yesterday I ran 4 Clients on node1. No split between both nodes.

See, splitting between multiple nodes is not a straightforward thing. You can still queue many systems to receive an image.

I see splitting is not as easy as I thought
Just remembered in fog 0.32 I never had problems with the splitting.

Now I am running another test with the same node config as before.

Started four group deploy (unicast) where each group has two clients.
Between each group deploy there is a delay of some minutes. I thought that the clients now can find a proper node.
In my “Active Tasks” I see eight tasks (six of them are running and two are queued). So far so good.

When I log into the shell of both nodes I see:

node1: six connected clients receiving images via nfs.
node2: zero clients.

Tom Elliott

@mp12 so both nodes are a part of the same storage group and contain the same images?

mp12

@Tom-Elliott

Yes they do.

When I start single deploys with one client and a delay of 10 seconds between, everything works fine. Group deploys won’t work without the splitting error.

Wayne Workman

@mp12 Is this actually causing a problem for you, or are you just trying to help make fog better?

mp12

@Wayne-Workman first of all I am very thankful.

I am using FOG now for six-seven years. It’s a wonderful piece of software. Sure I want to help make FOG better. Thats why I am testing all these different configurations.

My only problem is that the imaging now takes twice the time.
I can handle this but I think a correct splitting would be great for the whole community.

Tom Elliott

@mp12 what equates as a correct splitting?

See the way the split occurs is based on client load. Load is calculated by the number of queued and used tasks happening on a more. The problem is when multiples are checking in at exactly the same time they have not started to queue up yet. Because of this, when the system boots it finds the optimal node. That optimal node doesn’t know anything at boot time of who is using it, so splitting isn’t really viable at that point. I have a mechanism I could add to make it do this but it seems a bit off kilter.

mp12

@Tom-Elliott for me a correct splitting (using more then one node) would be, the client connects to a node and the node checks if “max_clients” is exceeded. Then the node only replys to its configured number of “max_clients”. The rest gets queued.

Maybe I will give multicast another shot.

Wayne Workman

@mp12 The issue is that the client is what determines if the “max_clients” is met or not. If two or more check at the same time, there appears to be a 10 second window where too many might start, or perhaps too many queued when there was another node available.

However, once the max_clients is met, it makes no sense why other clients wait in line when there’s another node with empty slots.

@Tom-Elliott I think maybe the inits should evaluate all possible nodes - within the constraints of which have the image available and the location plugin constraints. I think the issue is the inits are only checking one instead of all nodes.

mp12

@Wayne-Workman said:
I am working with the location plugin now. The splitting works fine.

Tom Elliott

@Wayne-Workman the inits determine nothing. All the node information is passed to the client when it first goes to boot. The only other time things are checked is when the client goes to checkin. It does not move around nodes when it checks in, it only tries to operate within the node the init currently has during its initial boot up. Moving to allow a separate node is possible but not very simple as you do have to change variables around and verify many things during that process. Add to that and we pull in the information as sent when the client booted up to allow separate sessions to operate and it makes things that much more difficult.

Wayne Workman

@Tom-Elliott I stand corrected, then. Thanks for explaining this.

Deploy (Unicast): Wrong number of client per node.

124

12.5k

17.5k

156.1k