File size/hash mismatch - Only on one storage node replicating nonstop



  • Our deployment of Fog has gone fairly smoothly aside from this weird issue I’m having. Our 1st and 2nd remote nodes have no issues and only replicate when changes are made. On the 3rd node, the fog replication service always reports every single file in all images as a file size mismatch and deletes and copies a new file. And will do that infinitely (which is a bit problematic over a WAN). The weird thing is, the images are copying correctly. After the FTP job finishes, I checked the md5sum of every file and they all match the master. If I disable the replication, the images work correctly so I know there is nothing wrong with them.

    In the fogreplicator.log on the master I keep seeing something like this on all the files at that node:

    [01-23-20 10:33:19 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1.fixed_size_partitions: 9 !=
    [01-23-20 10:33:19 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1.fixed_size_partitions
    [01-23-20 10:33:20 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1.mbr: 1048576 !=
    [01-23-20 10:33:20 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1.mbr
    [01-23-20 10:33:20 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1.minimum.partitions: 793 !=
    [01-23-20 10:33:20 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1.minimum.partitions
    [01-23-20 10:33:20 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1.original.fstypes: 30 !=
    [01-23-20 10:33:20 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1.original.fstypes
    [01-23-20 10:33:21 am]   # Win10_1903_64bit_Nov2019: File hash mismatch - d1.original.swapuuids: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
    [01-23-20 10:33:21 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1.original.swapuuids
    [01-23-20 10:33:21 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1.partitions: 793 !=
    [01-23-20 10:33:21 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1.partitions
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1p1.img: 421826977 !=
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1p1.img
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1p2.img: 13556397 !=
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1p2.img
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1p3.img: 254129 !=
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1p3.img
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: File size mismatch - d1p4.img: 9857003499 !=
    [01-23-20 10:33:22 am]   # Win10_1903_64bit_Nov2019: Deleting remote file d1p4.img
    
    

    Going by the some of the other examples of failed hashes I have seen on here, there should be another value behind the !=. I assume that isn’t right and is why its failing.

    Does someone have an idea where I should look to correct this? Thanks.

    Fog version: 1.5.7
    OS on all hosts: CentOS 7


  • Senior Developer

    @Demache Nice we found this and you were able to fix it so quickly. When looking through the code I thought about HTTP/HTTPS possibly being an issue but dropped that idea. Now looking at it again I think you have found a bug in the code! Just pushed a fix.

    Though I still really wonder why the backup logic of checking size via FTP didn’t work in your case either.



  • @Sebastian-Roth Aha! You were right, there was an issue with it communicating with HTTP, but it turns out that 3rd node had the HTTP protocol set to HTTPS in .fogsettings for some reason and it was causing that fail. Must have been because I was following my own documentation blindly because the master node does have HTTPS enforced and I wrote that after the 1st and 2nd node were already set up. Whoops. I’m guessing that function doesn’t work if the remote storage enforces HTTPS?

    Anyway, I set that back to HTTP in .fogsettings and reran the install script. Turned the service back on again and now I get “no need to sync” as it should on the 3rd node. Perfect!

    Thanks for pointing me in the right direction.


  • Senior Developer

    @Demache What’s the difference between 1st/2nd storage node and the 3rd one? I suppose the WAN tunnel between head quarter and 3rd location doesn’t allow the communication needed to query the information from the 3rd storage node. For both size and hash the replication service first uses HTTP or HTTPS (depending on how you installed) and if it doesn’t receive a proper answer it also tries to retrieve the size and hash (only for files smaller than 10 MB) information via FTP. So if all those protocols are blocked the replication cannot work.

    Now as I re-think what I wrote this doesn’t make sense because you said that it does replicate the files (using FTP!) properly. So we probably need to take a closer look.

    As I said, it checks size first which seems to fail in most cases considering the log you posted. BUT there is one file where it seems to be ok with size and then goes ahead to match the checksums and fails on that. Seems to be a bit random.

    Well, using the information given you might try to see if the WAN tunnel might oppose some restrictions that might cause this.

    As well you want to check apache access and error log on the 3rd storage node to see if those requests actually ever read that node.


Log in to reply
 

323
Online

7.5k
Users

14.6k
Topics

137.4k
Posts