Image/Snapin Replication Failed: Group Ownership

Sebastian Roth

@markbam said in Image/Snapin Replication Failed: Group Ownership:

only about 70% are successful.

This doesn’t make sense to me.

@Tom-Elliott Any idea?!

Tom Elliott

@Sebastian-Roth https://stackoverflow.com/questions/25308977/site-chmod-command-failed-through-ftp-cant-figure-out-why

Maybe we need to add the chmod_enabled=YES to the VSFTPD configuration?

Sebastian Roth

@Tom-Elliott But why should it work in some case but not in others??

Tom Elliott

@Sebastian-Roth Well, looking further, I don’t understand why lftp is doing chmod. No where do I see it attempting to do chmod for replication elements.

This leads me to think, while the master side permissions are working, maybe the nodes trying to receive the replicated items are owned by root? Meaning maybe fogproject is not the owner on the remote nodes, rather they’re owned by root or fog?

I can only surmise that the files that are failing already exist on the remote side and are owned by a different user, likely one who does not exist on the remote side.

Hopefully that makes sense.

markbam

@Tom-Elliott

This was my initial thinking and why I started over from scratch on both the Server and Storage Node.
The failing snapins are not present on the storage node. For troubleshooting, I’ve even deleted all items in the snapins folder to try and discover a pattern to the failures. It does not seem to be consistent.

Sebastian Roth

@Tom-Elliott said in Image/Snapin Replication Failed: Group Ownership:

Hopefully that makes sense.

Oh yes it does!! That rang a bell for me.

@markbam Please manually adjust the files ownership to be the same on all nodes! See if that fixed the issue for you.

markbam

@Sebastian-Roth
I’m not sure I understand. The failing snapins do not exist on the storage node so I’d have nothing to adjust ownership on.

Sebastian Roth

@markbam Quite obviously I headed down the wrong lane! If I had waited a few more mintues to read you last answer I wouldn’t have posted that.

I’ll probably need to dig into this further and test myself. Will need a bit of time though.

markbam

This may or may not be related:

To update Fog, the installer tells me I need to delete the user account fogproject. When I do so, it changes the user:group of the files in my snapins and images folder from fogproject:fogproject to fogproject:www-data.

So now I know where the www-data is coming from.

markbam

I think I’m on to something. Restating the problem: The Fog Snapin Replication log shows that the snapin transfers are successful but then fails to chmod and the snapins are deleted from the Storage Node.

But, even though Fog records the transfer as successful, it looks like the snapins don’t actually finish their copy. The snapins only transfer ~100MB, then something goes wrong. Fog logs the transfer as successful anyway and tries to chmod which fails because the file isn’t completely there.

So I’m guessing I’m either dropping a connection or hitting a FTP timeout somewhere?

Sebastian Roth

@markbam Good catch!! Are you sure the disk on the storage node has enough free space?

markbam

@Sebastian-Roth

Yup plenty of space. Only 34 GB of 2TB used.

markbam

I’ve been noticing some odd things happening:

“Test1.zip” starts an lftp command to transfer to /opt/fog/snapins on the Storage node as expected. Then at ~100mb transferred, the file “Test1.zip” disappears from /opt/fog/snapins on the Storage node BUT the ftp command is still active and transferring. The vsftp processes still have cpu and network activity.

It seems that the file is still stuck transferring to memory but, since it ceases to physically exist, lftp can’t perform a clean termination (the chmod command).

I am able to reproduce this myself using just the command I pulled from Fog. This is run on from the Fog Server:
lftp -e ‘set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-rate 0:128000; mirror -c --parallel=20 -R -i “Test1.zip” --ignore-time -vvv --exclude “.srvprivate” “/opt/fog/snapins” “/opt/fog/snapins”; exit’ -u fogproject,‘xxxxxx’ xxx.xxx.xxx.xxx

With this I can see a progress bar continuing even after the file disappears from the Storage Node at ~100mb.

Tom Elliott

@markbam Can you get logs from the remote side?

I don’t know exactly what logs to look for, likely FTP logs if you have them as well as output of dmesg

I’m wondering if selinux is stopping the connection for any files that are beyond 100M. The other thing to look at is /tmp on the remote box. If there’s not enough room there (maybe it’s only got around 100M available too?) I imagine it could be causing an unexpected issue too?

Just spit-balling of course.

Sebastian Roth

@markbam Ok, let’s try to enable VSFTP verbose logging: Edit /etc/vsftpd/vsftpd.conf, it should look like this:

max_per_ip=200
anonymous_enable=NO
local_enable=YES
write_enable=YES
local_umask=022
dirmessage_enable=YES
xferlog_enable=YES
connect_from_port_20=YES
xferlog_std_format=YES
listen=YES
pam_service_name=vsftpd
userlist_enable=NO
tcp_wrappers=YES

Add the following two lines:

log_ftp_protocol=YES
vsftpd_log_file=/var/log/vsftpd.log

and change this one line:

xferlog_std_format=NO

This cahnge is important, as it won’t properly log if the later is still set to YES.

Now restart vsftpd service and watch the log /var/log/vsftpd.log…

Sebastian Roth

@markbam Any news on this?

markbam

@Sebastian-Roth

Nothing definitive yet. The ftp logs aren’t showing anything out of the ordinary.
I’m now thinking somehow my network backend between the two sites could be a culprit. I’ve put in a request for more bandwidth and am waiting for that to kick in.

Sebastian Roth

@markbam Are you good with network analyzing using tcpdump/wireshark? If you are keen to give it a try let and need help just let us know.

Image/Snapin Replication Failed: Group Ownership

155

12.6k

17.5k

156.3k