Replication deletes file before transfer is complete
-
r6575.
What I’m noticing is that due to the slowness of my fog servers transfer rates, it can’t get done before the replicator checks again and determines the files aren’t the same, then the replicator deletes the remote file and starts over. It appears the time limit is 20 minutes.
This particular file that is being re-replicated over and over happens to be the biggest one in the storage group.
Here are some of the logs from the replicator:
[03-05-16 9:58:55 pm] * Started sync for Image Win7 [03-05-16 9:58:55 pm] | PID: 30391 [03-05-16 9:58:55 pm] | Replication not complete [03-05-16 9:58:55 pm] * Deleting remote file: /images/Win7/d1p2.img [03-05-16 9:58:55 pm] | Files do not match [03-05-16 9:58:55 pm] | Remote File size: 7051149312 [03-05-16 9:58:55 pm] | Local File size: 9291944405 [03-05-16 9:58:55 pm] | Remote File: /images/Win7/d1p2.img [03-05-16 9:58:55 pm] | Local File: /images/Win7/d1p2.img
[03-05-16 9:38:13 pm] * Started sync for Image Win7 [03-05-16 9:38:13 pm] | PID: 15700 [03-05-16 9:38:13 pm] | Replication not complete [03-05-16 9:38:13 pm] * Deleting remote file: /images/Win7/d1p2.img [03-05-16 9:38:13 pm] | Files do not match [03-05-16 9:38:13 pm] | Remote File size: 7149060096 [03-05-16 9:38:13 pm] | Local File size: 9291944405 [03-05-16 9:38:13 pm] | Remote File: /images/Win7/d1p2.img [03-05-16 9:38:13 pm] | Local File: /images/Win7/d1p2.img
[03-05-16 9:07:15 pm] * Started sync for Image Win7 [03-05-16 9:07:15 pm] | PID: 11812 [03-05-16 9:07:15 pm] | Replication not complete [03-05-16 9:07:15 pm] * Deleting remote file: /images/Win7/d1p2.img [03-05-16 9:07:15 pm] | Files do not match [03-05-16 9:07:15 pm] | Remote File size: 4564320256 [03-05-16 9:07:15 pm] | Local File size: 9291944405 [03-05-16 9:07:15 pm] | Remote File: /images/Win7/d1p2.img [03-05-16 9:07:15 pm] | Local File: /images/Win7/d1p2.img
[03-05-16 8:47:02 pm] * Started sync for Image Win7 [03-05-16 8:47:02 pm] | PID: 11210 [03-05-16 8:47:02 pm] | Replication not complete [03-05-16 8:47:02 pm] * Deleting remote file: /images/Win7/d1p2.img [03-05-16 8:47:02 pm] | Files do not match [03-05-16 8:47:02 pm] | Remote File size: 7483969536 [03-05-16 8:47:02 pm] | Local File size: 9291944405 [03-05-16 8:47:01 pm] | Remote File: /images/Win7/d1p2.img [03-05-16 8:47:01 pm] | Local File: /images/Win7/d1p2.img
Here’s what the bandwidth looks like during the time of these logs:
Given, my servers are very old and slow, and given the bandwidth is capped at 100Mbps (I’ll look into that but it’s not a fog problem), this still is a potential issue for slow links, or for really large images. If the checksum is not getting done within the allowed time due to the image being 100 or 150GB in size (There is a site in my organization with an image that big), or if the image isn’t replicated in time due to a slow link, this same problem will crop up again.
-
Should be fixed, thanks for reporting.
I had the delete happening before checking if the prior replication was still running. Added a check to break out if it is already running.
-
I’m still seeing the same issue. Last night I updated both the master and slave and then went to bed. This morning I find that replication didn’t get done still. So I re-updated both again and I’m seeing this entry in the logs every 10 minutes:
[03-06-16 8:41:19 am] * Started sync for Image Win7 [03-06-16 8:41:19 am] | PID: 4079 [03-06-16 8:41:19 am] | Replication not complete [03-06-16 8:41:18 am] * Deleting remote file: /images/Win7/d1p2.img [03-06-16 8:41:18 am] | Files do not match [03-06-16 8:41:18 am] | Remote File size: 6607823640 [03-06-16 8:41:18 am] | Local File size: 9291944405 [03-06-16 8:41:17 am] | Remote File: /images/Win7/d1p2.img [03-06-16 8:41:17 am] | Local File: /images/Win7/d1p2.img
Another thing I noticed is that the bandwidth maxes out when the d1p2.img file is checksummed. At this point in the logs below, it’s not even deleted yet but this is the moment where the bandwidth maxes out:
[03-06-16 8:51:25 am] * Started sync for Image Win7 [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 8899842 [03-06-16 8:51:25 am] | Local File size: 8899842 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1p1.img [03-06-16 8:51:25 am] | Local File: /images/Win7/d1p1.img [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 190 [03-06-16 8:51:25 am] | Local File size: 190 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.partitions [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.partitions [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 0 [03-06-16 8:51:25 am] | Local File size: 0 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.original.swapuuids [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.original.swapuuids [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 30 [03-06-16 8:51:25 am] | Local File size: 30 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.original.fstypes [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.original.fstypes [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 190 [03-06-16 8:51:25 am] | Local File size: 190 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.minimum.partitions [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.minimum.partitions [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 1048576 [03-06-16 8:51:25 am] | Local File size: 1048576 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.mbr [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.mbr [03-06-16 8:51:25 am] | Files match [03-06-16 8:51:25 am] | Remote File size: 1 [03-06-16 8:51:25 am] | Local File size: 1 [03-06-16 8:51:25 am] | Remote File: /images/Win7/d1.fixed_size_partitions [03-06-16 8:51:25 am] | Local File: /images/Win7/d1.fixed_size_partitions [03-06-16 8:51:24 am] | Image name: Win7
-
[03-06-16 9:02:01 am] * Started sync for Image Win7 [03-06-16 9:02:01 am] | PID: 17781 [03-06-16 9:02:01 am] | Replication not complete [03-06-16 9:02:01 am] * Deleting remote file: /images/Win7/d1p2.img [03-06-16 9:02:01 am] | Files do not match [03-06-16 9:02:01 am] | Remote File size: 7127236608 [03-06-16 9:02:01 am] | Local File size: 9291944405 [03-06-16 9:02:01 am] | Remote File: /images/Win7/d1p2.img [03-06-16 9:02:01 am] | Local File: /images/Win7/d1p2.img
-
Ok more information on this.
I was keeping an eye on the file itself, and right before the replicator re-runs, the file is actually still growing:
[root@application Win7]# ls -la d1p2.img -rw-r--r--. 1 fog fog 6725181416 Mar 6 09:22 d1p2.img [root@application Win7]# ls -la d1p2.img -rw-r--r--. 1 fog fog 6803567240 Mar 6 09:22 d1p2.img [root@application Win7]# ls -la d1p2.img -rw-r--r--. 1 fog fog 6952289792 Mar 6 09:22 d1p2.img [root@application Win7]# ls -la d1p2.img -rw-r--r--. 1 fog fog 7065146216 Mar 6 09:22 d1p2.img [root@application Win7]# ls -la d1p2.img ls: cannot access d1p2.img: No such file or directory [root@application Win7]# ls -la d1p2.img ls: cannot access d1p2.img: No such file or directory [root@application Win7]# ls -la d1p2.img ls: cannot access d1p2.img: No such file or directory [root@application Win7]# ls -la d1p2.img ls: cannot access d1p2.img: No such file or directory [root@application Win7]# ls -la total 9732 drwxr-xr-x. 2 fog fog 174 Mar 6 09:23 . drwxrwxrwx. 10 root root 228 Mar 5 16:36 .. -rw-r--r--. 1 fog fog 1 Mar 5 14:40 d1.fixed_size_partitions -rw-r--r--. 1 fog fog 1048576 Mar 5 14:40 d1.mbr -rw-r--r--. 1 fog fog 190 Mar 5 14:40 d1.minimum.partitions -rw-r--r--. 1 fog fog 30 Mar 5 14:40 d1.original.fstypes -rw-r--r--. 1 fog fog 0 Mar 5 14:40 d1.original.swapuuids -rw-r--r--. 1 fog fog 8899842 Mar 5 14:40 d1p1.img -rw-r--r--. 1 fog fog 190 Mar 5 14:40 d1.partitions
Right where you see the file become unavailable, that’s when the replicator deleted it:
[03-06-16 9:23:01 am] * Started sync for Image Win7 [03-06-16 9:23:01 am] | PID: 31237 [03-06-16 9:23:01 am] | Replication not complete [03-06-16 9:23:01 am] * Deleting remote file: /images/Win7/d1p2.img [03-06-16 9:23:01 am] | Files do not match [03-06-16 9:23:01 am] | Remote File size: 7170686976 [03-06-16 9:23:01 am] | Local File size: 9291944405 [03-06-16 9:23:01 am] | Remote File: /images/Win7/d1p2.img [03-06-16 9:23:01 am] | Local File: /images/Win7/d1p2.img
Here is the bandwidth at this exact moment, there is no change in bandwidth. Later on, the bandwidth drops off to nothing:
-
found there was a bad call (cause I’m a dumba$$), in that the checker would not work the first time around. It would actually break on the wrong file which would have caused some issues. Fixed by moving the check to the start of the main calling loop and not even do anything further down if the original item is already running.
-
I updated both servers to r6597 at about 10:55, I just got back home and found this on the bandwidth graph:
I think the graph stopped working after an hour because my laptop went to sleep. But logging back into the fog web interface, the bandwidth is still riding at 100Mbps, which is the max on this particular link.
This is what just now showed up in the logs, which is the same thing as before:
[03-06-16 1:06:48 pm] * Started sync for Image Win7 [03-06-16 1:06:46 pm] * Deleting remote file: /images/Win7/d1p2.img [03-06-16 1:06:46 pm] | Files do not match [03-06-16 1:06:46 pm] | Remote File size: 3003860200 [03-06-16 1:06:46 pm] | Local File size: 9291944405 [03-06-16 1:06:45 pm] | Remote File: /images/Win7/d1p2.img [03-06-16 1:06:45 pm] | Local File: /images/Win7/d1p2.img
I’m going to reboot both servers and see if that does anything.
-
After rebooting both master and slave nodes, I watched the logs cycle through the replicator checks 3 times, watched it say it’s deleting the d1p2 file 3 times in a row.
[03-06-16 1:20:08 pm] * Started sync for Image Win7 [03-06-16 1:20:03 pm] * Deleting remote file: /images/Win7/d1p2.img [03-06-16 1:20:03 pm] | Files do not match [03-06-16 1:20:03 pm] | Remote File size: 2715025408 [03-06-16 1:20:03 pm] | Local File size: 9291944405 [03-06-16 1:20:03 pm] | Remote File: /images/Win7/d1p2.img [03-06-16 1:20:03 pm] | Local File: /images/Win7/d1p2.img
-
Still me being an idiot.
I had an iterator being reused within an already existing iterator.
This should now be fixed, (based on my testing).
I’m solving again.
-
@Tom-Elliott Confirmed fixed.