@AndrewG78 I’ll try to answer all the things you brought up. But first let me state that so far you haven’t been clear (from my point of view) what has happened on which FOG server. For replication there are at least two parties (servers) involved and it’s important for me to understand which one showed the issue. I will get to that point later on again.
Although replication services are disabled, there is still some replication done between storage groups.
Disabled on which server? All FOG servers?
1 a) The question is, was this a proper behaviour?
I thought replication is done only within the storage group members(nodes).
As I haven’t invented the replication algorithm I don’t know it as much as Tom would. But reading the docs I get the impression that this is expected to happen: https://wiki.fogproject.org/wiki/index.php?title=Replication
6. If the node currently checking is the "primary master group" for the data it's working, it will attempt replicating its data to the master of each of the other groups the data is assigned under.
1 b) Are there any other services that could do this replication?
You have two nodes and both have replication services running on them!
The high cpu load(kworker and vsftpd) was related to replication and lack of disk space. Replication processes did not stop even if there was 0% of free space.
I think this is a bug.
The vsftpd part is what I would call the receiving node in this constellation. This might give you an idea which node was causing this. Disks can run out of space for many different reasons. I don’t see why our replication service should constantly check and stop replication just because of little space. Every server needs a good working disk space monitoring to warn the sysadmin to take care of it. See it from this side: If we add a check and simply stop replicating because of a lack of disk space people who don’t monitor their disk space won’t notice possibly for month and might blame us about replication not working. Although it’s not nice to hit a full disk this will eventually cause trouble and make the sleeping sysadmin aware.
3 a) Should there be some smarter log rotation ?
As well something a sysadmin should be able to handle. Linux has logrotate and I don’t see why we should invent that again.
3 b) "No new tasks found "is logged every 10s - Can we change this time somehow ?
Yes, web UI -> FOG Configuration -> FOG Settings -> FOG Linux Service Sleep Times -> MULTICASTSLEEPTIME
Sorry if my answers sound a bit impolite. I don’t mean it that way! Just wanted to show you that things can be seen from the other side as well.