SOLVED Replication runaway on one storage node

  • HI all

    apologies first of all - I’ve posted so many problems and questions lately I’m beginning to feel like a spammer 🙂

    Due to lots of help from Tom and co I have a fully functional FOG Server once more running on latest trunk. That one is now the single master node for our head office. What used to be the head office storage node I recently rebuilt (due issues with the storage node installer) as a master node serving our remote sites.

    I have set up the remote sites as storages nodes within this master server and limited the bandwidth on each storage node to 500kbps to avoid saturating the links.

    There is only a single image on this server of 6.9GB which has already been replicated successfully to all 8 remote storage nodes.

    I noticed this evening on our PRTG server that there was really high network utilisation from the remote master node to a single storage node which is eating up 50% of our head office link. Given that the image is already on the storage node should logic not kick in to stop replication from overwriting the image again? I’m assuming that’s what’s happening.

    I bounced the master server and no change then did same on the node and that seems to have done the trick. Curious to know how the replication code works so I can more accurately troubleshoot should this happen again. And how robust everyone is finding it to be?

    cheers, Kiweegie.

  • Image is still copying this morning but is being accurately limited to the 500kbps setting set on the storage node. Not sure what the issue was but suspect it may even have been erroneous reporting within our monitoring tool PRTG. Marking as resolved.

    cheers, Kiweegie.

  • FOG Storage node now rebuilt and iftop on the master node shows traffic to the storage node sitting at around 520Kb. I’ll check again in the morning to make sure once image has transferred in full the traffic dies off and doesn’t keep firing packets over.

    regards Kiweegie.

  • @Tom-Elliott @Wayne-Workman

    Well the server was still hogging a boat-load of bandwidth so I’m in the process of rebuilding it from scratch. Should know in about an hour or so if the bandwidth issue is sorted.

    regards Kiweegie.

  • I want to work on the replication stuff - I want it to be hash based.

  • @Tom-Elliott Thanks Tom - checked this morning when I came into work and while the bandwidth wasn’t spiking above the 500Kbps limit it was still copying to the same storage node even though image was there already. Server and disk appear ok so I’m running with the theory (courtesy of your suggestion) that the image on node was corrupt. Binned that and allowing replication to repeat clean upload and will check how that goes.

  • Senior Developer

    I’m half tempted to find out first if there was another problem altogether. I say this because there is checking. So the only way I can think of, assuming the actual replication service is fine based on the other nodes receiving the image, is the image is transferred but at the remote site the image was corrupted. So every cycle would cause it to try to replacing the file. Maybe HDD on other side was having an issue? Just thinking.