Replication deletes file before transfer is complete


  • Moderator

    r6575.

    What I’m noticing is that due to the slowness of my fog servers transfer rates, it can’t get done before the replicator checks again and determines the files aren’t the same, then the replicator deletes the remote file and starts over. It appears the time limit is 20 minutes.

    This particular file that is being re-replicated over and over happens to be the biggest one in the storage group.

    Here are some of the logs from the replicator:

    [03-05-16 9:58:55 pm] * Started sync for Image Win7
    [03-05-16 9:58:55 pm] | PID: 30391
    [03-05-16 9:58:55 pm] | Replication not complete
    [03-05-16 9:58:55 pm] * Deleting remote file: /images/Win7/d1p2.img
    [03-05-16 9:58:55 pm] | Files do not match
    [03-05-16 9:58:55 pm] | Remote File size: 7051149312
    [03-05-16 9:58:55 pm] | Local File size: 9291944405
    [03-05-16 9:58:55 pm] | Remote File: /images/Win7/d1p2.img
    [03-05-16 9:58:55 pm] | Local File: /images/Win7/d1p2.img
    
    [03-05-16 9:38:13 pm] * Started sync for Image Win7
    [03-05-16 9:38:13 pm] | PID: 15700
    [03-05-16 9:38:13 pm] | Replication not complete
    [03-05-16 9:38:13 pm] * Deleting remote file: /images/Win7/d1p2.img
    [03-05-16 9:38:13 pm] | Files do not match
    [03-05-16 9:38:13 pm] | Remote File size: 7149060096
    [03-05-16 9:38:13 pm] | Local File size: 9291944405
    [03-05-16 9:38:13 pm] | Remote File: /images/Win7/d1p2.img
    [03-05-16 9:38:13 pm] | Local File: /images/Win7/d1p2.img
    
    [03-05-16 9:07:15 pm] * Started sync for Image Win7
    [03-05-16 9:07:15 pm] | PID: 11812
    [03-05-16 9:07:15 pm] | Replication not complete
    [03-05-16 9:07:15 pm] * Deleting remote file: /images/Win7/d1p2.img
    [03-05-16 9:07:15 pm] | Files do not match
    [03-05-16 9:07:15 pm] | Remote File size: 4564320256
    [03-05-16 9:07:15 pm] | Local File size: 9291944405
    [03-05-16 9:07:15 pm] | Remote File: /images/Win7/d1p2.img
    [03-05-16 9:07:15 pm] | Local File: /images/Win7/d1p2.img
    
    [03-05-16 8:47:02 pm] * Started sync for Image Win7
    [03-05-16 8:47:02 pm] | PID: 11210
    [03-05-16 8:47:02 pm] | Replication not complete
    [03-05-16 8:47:02 pm] * Deleting remote file: /images/Win7/d1p2.img
    [03-05-16 8:47:02 pm] | Files do not match
    [03-05-16 8:47:02 pm] | Remote File size: 7483969536
    [03-05-16 8:47:02 pm] | Local File size: 9291944405
    [03-05-16 8:47:01 pm] | Remote File: /images/Win7/d1p2.img
    [03-05-16 8:47:01 pm] | Local File: /images/Win7/d1p2.img
    

    Here’s what the bandwidth looks like during the time of these logs:
    0_1457236867130_Replicator bandwidth.png

    Given, my servers are very old and slow, and given the bandwidth is capped at 100Mbps (I’ll look into that but it’s not a fog problem), this still is a potential issue for slow links, or for really large images. If the checksum is not getting done within the allowed time due to the image being 100 or 150GB in size (There is a site in my organization with an image that big), or if the image isn’t replicated in time due to a slow link, this same problem will crop up again.


  • Moderator

    @Tom-Elliott Confirmed fixed.


  • Senior Developer

    Still me being an idiot.

    I had an iterator being reused within an already existing iterator.

    This should now be fixed, (based on my testing).

    I’m solving again.


  • Moderator

    After rebooting both master and slave nodes, I watched the logs cycle through the replicator checks 3 times, watched it say it’s deleting the d1p2 file 3 times in a row.

    [03-06-16 1:20:08 pm] * Started sync for Image Win7
    [03-06-16 1:20:03 pm] * Deleting remote file: /images/Win7/d1p2.img
    [03-06-16 1:20:03 pm] | Files do not match
    [03-06-16 1:20:03 pm] | Remote File size: 2715025408
    [03-06-16 1:20:03 pm] | Local File size: 9291944405
    [03-06-16 1:20:03 pm] | Remote File: /images/Win7/d1p2.img
    [03-06-16 1:20:03 pm] | Local File: /images/Win7/d1p2.img
    

  • Moderator

    I updated both servers to r6597 at about 10:55, I just got back home and found this on the bandwidth graph:

    0_1457291080274_Replicator Bandwidth 4.png

    I think the graph stopped working after an hour because my laptop went to sleep. But logging back into the fog web interface, the bandwidth is still riding at 100Mbps, which is the max on this particular link.

    This is what just now showed up in the logs, which is the same thing as before:

    [03-06-16 1:06:48 pm] * Started sync for Image Win7
    [03-06-16 1:06:46 pm] * Deleting remote file: /images/Win7/d1p2.img
    [03-06-16 1:06:46 pm] | Files do not match
    [03-06-16 1:06:46 pm] | Remote File size: 3003860200
    [03-06-16 1:06:46 pm] | Local File size: 9291944405
    [03-06-16 1:06:45 pm] | Remote File: /images/Win7/d1p2.img
    [03-06-16 1:06:45 pm] | Local File: /images/Win7/d1p2.img
    

    I’m going to reboot both servers and see if that does anything.


  • Senior Developer

    found there was a bad call (cause I’m a dumba$$), in that the checker would not work the first time around. It would actually break on the wrong file which would have caused some issues. Fixed by moving the check to the start of the main calling loop and not even do anything further down if the original item is already running.


  • Moderator

    Ok more information on this.

    I was keeping an eye on the file itself, and right before the replicator re-runs, the file is actually still growing:

    [root@application Win7]# ls -la d1p2.img
    -rw-r--r--. 1 fog fog 6725181416 Mar  6 09:22 d1p2.img
    [root@application Win7]# ls -la d1p2.img
    -rw-r--r--. 1 fog fog 6803567240 Mar  6 09:22 d1p2.img
    [root@application Win7]# ls -la d1p2.img
    -rw-r--r--. 1 fog fog 6952289792 Mar  6 09:22 d1p2.img
    [root@application Win7]# ls -la d1p2.img
    -rw-r--r--. 1 fog fog 7065146216 Mar  6 09:22 d1p2.img
    [root@application Win7]# ls -la d1p2.img
    ls: cannot access d1p2.img: No such file or directory
    [root@application Win7]# ls -la d1p2.img
    ls: cannot access d1p2.img: No such file or directory
    [root@application Win7]# ls -la d1p2.img
    ls: cannot access d1p2.img: No such file or directory
    [root@application Win7]# ls -la d1p2.img
    ls: cannot access d1p2.img: No such file or directory
    [root@application Win7]# ls -la
    total 9732
    drwxr-xr-x.  2 fog  fog      174 Mar  6 09:23 .
    drwxrwxrwx. 10 root root     228 Mar  5 16:36 ..
    -rw-r--r--.  1 fog  fog        1 Mar  5 14:40 d1.fixed_size_partitions
    -rw-r--r--.  1 fog  fog  1048576 Mar  5 14:40 d1.mbr
    -rw-r--r--.  1 fog  fog      190 Mar  5 14:40 d1.minimum.partitions
    -rw-r--r--.  1 fog  fog       30 Mar  5 14:40 d1.original.fstypes
    -rw-r--r--.  1 fog  fog        0 Mar  5 14:40 d1.original.swapuuids
    -rw-r--r--.  1 fog  fog  8899842 Mar  5 14:40 d1p1.img
    -rw-r--r--.  1 fog  fog      190 Mar  5 14:40 d1.partitions
    

    Right where you see the file become unavailable, that’s when the replicator deleted it:

    [03-06-16 9:23:01 am] * Started sync for Image Win7
    [03-06-16 9:23:01 am] | PID: 31237
    [03-06-16 9:23:01 am] | Replication not complete
    [03-06-16 9:23:01 am] * Deleting remote file: /images/Win7/d1p2.img
    [03-06-16 9:23:01 am] | Files do not match
    [03-06-16 9:23:01 am] | Remote File size: 7170686976
    [03-06-16 9:23:01 am] | Local File size: 9291944405
    [03-06-16 9:23:01 am] | Remote File: /images/Win7/d1p2.img
    [03-06-16 9:23:01 am] | Local File: /images/Win7/d1p2.img
    

    Here is the bandwidth at this exact moment, there is no change in bandwidth. Later on, the bandwidth drops off to nothing:

    0_1457278501816_Replicator Bandwidth 3.png


  • Moderator

    [03-06-16 9:02:01 am] * Started sync for Image Win7
    [03-06-16 9:02:01 am] | PID: 17781
    [03-06-16 9:02:01 am] | Replication not complete
    [03-06-16 9:02:01 am] * Deleting remote file: /images/Win7/d1p2.img
    [03-06-16 9:02:01 am] | Files do not match
    [03-06-16 9:02:01 am] | Remote File size: 7127236608
    [03-06-16 9:02:01 am] | Local File size: 9291944405
    [03-06-16 9:02:01 am] | Remote File: /images/Win7/d1p2.img
    [03-06-16 9:02:01 am] | Local File: /images/Win7/d1p2.img
    

  • Moderator

    I’m still seeing the same issue. Last night I updated both the master and slave and then went to bed. This morning I find that replication didn’t get done still. So I re-updated both again and I’m seeing this entry in the logs every 10 minutes:

    [03-06-16 8:41:19 am] * Started sync for Image Win7
    [03-06-16 8:41:19 am] | PID: 4079
    [03-06-16 8:41:19 am] | Replication not complete
    [03-06-16 8:41:18 am] * Deleting remote file: /images/Win7/d1p2.img
    [03-06-16 8:41:18 am] | Files do not match
    [03-06-16 8:41:18 am] | Remote File size: 6607823640
    [03-06-16 8:41:18 am] | Local File size: 9291944405
    [03-06-16 8:41:17 am] | Remote File: /images/Win7/d1p2.img
    [03-06-16 8:41:17 am] | Local File: /images/Win7/d1p2.img
    

    Another thing I noticed is that the bandwidth maxes out when the d1p2.img file is checksummed. At this point in the logs below, it’s not even deleted yet but this is the moment where the bandwidth maxes out:

    [03-06-16 8:51:25 am] * Started sync for Image Win7
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 8899842
    [03-06-16 8:51:25 am] | Local File size: 8899842
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1p1.img
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1p1.img
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 190
    [03-06-16 8:51:25 am] | Local File size: 190
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.partitions
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.partitions
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 0
    [03-06-16 8:51:25 am] | Local File size: 0
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.original.swapuuids
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.original.swapuuids
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 30
    [03-06-16 8:51:25 am] | Local File size: 30
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.original.fstypes
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.original.fstypes
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 190
    [03-06-16 8:51:25 am] | Local File size: 190
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.minimum.partitions
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.minimum.partitions
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 1048576
    [03-06-16 8:51:25 am] | Local File size: 1048576
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.mbr
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.mbr
    [03-06-16 8:51:25 am] | Files match
    [03-06-16 8:51:25 am] | Remote File size: 1
    [03-06-16 8:51:25 am] | Local File size: 1
    [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.fixed_size_partitions
    [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.fixed_size_partitions
    [03-06-16 8:51:24 am] | Image name: Win7
    

    0_1457276104529_Replicator Bandwidth 2.png


  • Senior Developer

    Should be fixed, thanks for reporting.

    I had the delete happening before checking if the prior replication was still running. Added a check to break out if it is already running.


Log in to reply
 

518
Online

39.4k
Users

11.1k
Topics

105.5k
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.