NFS problems after upgrade to trunk

John Sartoris

Friday I spent some time upgrading our old but well working fog server prepping for our Windows 10 roll out. I upgraded Ubuntu from 10.04 to 12.04, recovered from the apache 2.2 to 2.4 upgrade, installed the new Fog Package, finally got “everything” online, logged in to find another update has been release. I know it’s trunk so it’s going to happen. That’s what I want. Installed 6237 and started testing.

Problem is I can’t deploy an image right now. I’m getting an error “Could not mount images folder (/bin/fog.download)” . As far as I can see the nfs exports are still correct. I found something about creating .mntcheck files in /images and /images/dev, I’ve done that.

Please help, I’m not seeing anything in logs to point me in a direction.

Wayne Workman

@John-Sartoris

Let’s see the contents of your /etc/exports file.
Also, ensure that NFS and RPC are running:

sudo service nfs-kernel-server restart
sudo service rpcbind restart

You might also take a look at this: https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_NFS

John Sartoris

/etc/exports

/images *(ro,sync,no_wdelay,no_subtree_check,insecure_locks,no_root_squash,insecure,fsid=0)
/images/dev *(rw,async,no_wdelay,no_subtree_check,no_root_squash,insecure,fsid=1)

# sudo service nfs-kernel-server restart
 * Stopping NFS kernel daemon                                                                                                                                       [ OK ]
 * Unexporting directories for NFS kernel daemon...                                                                                                                 [ OK ]
 * Exporting directories for NFS kernel daemon...                                                                                                                   [ OK ]
 * Starting NFS kernel daemon                                                                                                                                       [ OK ]
# sudo service rpcbind restart
rpcbind: unrecognized service

Is rpcbind required? my second storage node which has been on 12.04 since it was built doesn’t have it either?

Actually it is installed on both…

# apt-cache policy rpcbind
rpcbind:
  Installed: 0.2.0-7ubuntu1.3

Wayne Workman

@John-Sartoris Your exports file looks good. The next step is to mount the /images directory on a remote machine manually. The troubleshooting article has steps on how to do that.

If the machine says RPC is installed, I trust it’s probably installed. I’m not a Ubuntu kind of person, my closet is full of red hats.

However, @ch3i is a Ubuntu kind of person, just FYI.

John Sartoris

@Wayne-Workman

Ok, I’ve tested mounting the master node exports on the storage node in a temp folder and received a timeout. I repeated in reverse mounting the storage node on the master node in a temp folder without issue. Different mount location of course.

Tried mounting from masternode to masternode, still got the timeout, so it’s not a firewall issues, and it shouldn’t be because I disabled ufw as per the troubleshooting wiki.

showmount shows the exports, any suggestions? I don’t work with NFS much.

# showmount -e 10.2.yyy.xxx
Export list for 10.2.yyy.xxx:
/images/dev *
/images     *
# mount 10.2.xxx.yyy:/images masterimages
mount.nfs: Connection timed out

Wayne Workman

@John-Sartoris It does sound a lot like a firewall issue though. Re-disable UFW again just to see? It will only take moments.

Also, I suppose we should compare the master and the storage node now. What OS is the node using? What version of FOG? Are both updated to the same position (i.e. updates run at the same time) ?

Finally, just to back up a bit, try some simple network troubleshooting. from the storage node, try pinging the master. Try FTP-ing into the master, try ssh-ing into the master. And - check for IP conflicts for the master.

John Sartoris

@Wayne-Workman

I’ve redisabled ufw, added allow rules for good measure. ping works, ssh works, ftp works. Both are running “Ubuntu 12.04.5 LTS”. Storage was installed with 12.04, 6+ months ago after a HD failure. Master was upgraded Friday from 10.04. Ran “apt-get update” and “apt-get upgrade” on both after the 12.04 upgrade. However just tried it again on the master and the 1 package it has an update for is “nfs-kernel-server”.

Master had “1.2.0-4ubuntu4.2” while storage had “1.2.5-3ubuntu3.2” , both now have “1.2.5-3ubuntu3.2”. Problem still existed, after 5 minutes and a manual “sudo service nfs-kernel-server restart” post upgrade , nfs seems to be better, I can now mount the master from storage.

Fog deploy still however still has the same issue when trying to use the local master node. If I point to the cross wan storage node deployment works.

Wayne Workman

@John-Sartoris Ok. I guess the next thing to check is probably the most simple, but I just didn’t think about it. Check permissions.

ls -lahRt /images

Do that on both the master and storage node. Compare results.

to set permissions on the images directory recursively, it’s just:

chmod -R 777 /images

We recommend 777 for troubleshooting.

John Sartoris

@Wayne-Workman

Ok, I didn’t expect that considering Thursday both sites were working fine. I’ll try making the master match storage and see what happens.

Storage - working

/images/Win10BaseR1:
total 18G
-rwxr-xr-x  1 fog  fog   18G Feb  4 15:06 sys.img.000
drwxr-xr-x  2 fog  fog  4.0K Feb  4 15:03 .
-rwxr-xr-x  1 fog  fog  299M Feb  4 15:03 rec.img.000
-rwxr-xr-x  1 fog  fog     0 Feb  4 14:33 d1.original.swapuuids
-rwxr-xr-x  1 fog  fog   259 Feb  4 14:33 d1.original.partitions
-rwxr-xr-x  1 fog  fog    15 Feb  4 14:33 d1.original.fstypes
-rwxr-xr-x  1 fog  fog     2 Feb  4 14:33 d1.fixed_size_partitions
drwxrwxrwx 18 root root 4.0K Feb  4 12:22 ..

Master - not working

/images/Win10BaseR1:
total 18G
drwxrwxrwx 18 fog  root 4.0K Feb  4 15:02 ..
-rwxrwxrwx  1 root root  18G Feb  4 15:02 sys.img.000
drwxrwxrwx  2 root root 4.0K Feb  4 14:33 .
-rwxrwxrwx  1 root root 299M Feb  4 14:33 rec.img.000
-rwxrwxrwx  1 root root   15 Feb  4 14:32 d1.original.fstypes
-rwxrwxrwx  1 root root    0 Feb  4 14:32 d1.original.swapuuids
-rwxrwxrwx  1 root root  259 Feb  4 14:32 d1.original.partitions
-rwxrwxrwx  1 root root    2 Feb  4 14:32 d1.fixed_size_partitions

Wayne Workman

@John-Sartoris also, chown -R fog:root /images may help as well. But you still need 777 perms from the earlier command.

John Sartoris

@Wayne-Workman

I was already at 777 on the master but after resetting own:group and perms for good measure, and restarting nfs-kernel-server, still no luck on the deploy.

Sebastian Roth

You mentioned creating .mntcheck files. Are you sure they are properly in place?

Wayne Workman

@John-Sartoris said:

I can now mount the master from storage.

Fog deploy still however still has the same issue when trying to use the local master node. If I point to the cross wan storage node deployment works.

make sure you have the right IP for the main server. Make sure the image path and FTP path are correct inside storage management. If all looks good, click save anyways just to push the settings. Sometimes the auto-fill feature in web browsers really screw with this area.

It’s just not typical for NFS to break in this manner, that’s why I ask you to check these things.

John Sartoris

@Sebastian-Roth

Same places as the working node.

/images:
total 72
drwxrwxrwx 18 root root 4096 Feb  4 12:22 .
drwxr-xr-x 26 root root 4096 Feb  8 09:01 ..
-rwxrwxrwx  1 root root    0 Jul 29  2014 .mntcheck

/images/dev:
total 8
drwxrwxrwx  2 root root 4096 Jul 29  2014 .
drwxrwxrwx 18 root root 4096 Feb  4 12:22 ..
-rwxrwxrwx  1 root root    0 Jul 29  2014 .mntcheck

Wayne Workman

@John-Sartoris Are you using the location plugin?

John Sartoris

@Wayne-Workman said:

@John-Sartoris Are you using the location plugin?

Yes, I am using the location plugin. Is there a known issue?

I completely understand. When it doesn’t make sense double check things the wouldn’t make sense…

IP addresses and Paths are correct. Re-saved, I also double checked and re-saved the node choice in the location plugin.

Still no luck. Is there anyway I can get more detail on the client machine to see exactly what is erroring? It just says “An error has been detected!”, it doesn’t specify.

Wayne Workman

@John-Sartoris Do a debug download. on the task confirmation page, the debug option is a checkbox. This boots the target host into a shell. On the target host, there is a variable dump initially on the screen, it can be quite valuable to see what it says.

Additionally, I would recommend removing the location plugin entirely (via plugin management) and then reinstalling it and re-setting it up for your various locations.

John Sartoris

@Wayne-Workman

From the client in debug…

mount: mounting 10.2.yyy.xxx:/images on images failed: Connection refused

sounds to me like it’s still a permissions or acl type issue.

Wayne Workman

@John-Sartoris refused is different than denied… Have you checked for IP conflicts?

george1421

I didn’t read all of the posts here, but could you do a showmount -e 127.0.0.1 This will show us what you have NFS shared on your FOG server. You could then do the same command but use your FOG servers external IP address instead of the loopback interface. It could be a firewall issue.

NFS problems after upgrade to trunk

89

12.2k

17.4k

155.5k