RC6 - Snapins no longer working

bluenix

Since upgrading to RC6, snapins no longer deploy.

When sending a single snapin to a host, the status of the task changes to checked in after a while, but stays on that status indefinitely.

fog.log on the client:

------------------------------------------------------------------------------
---------------------------------SnapinClient---------------------------------
------------------------------------------------------------------------------
 4-8-2016 10:03 Client-Info Client Version: 0.11.4
 4-8-2016 10:03 Client-Info Client OS:      Windows
 4-8-2016 10:03 Client-Info Server Version: 1.3.0-RC-6
 4-8-2016 10:03 Middleware::Response Success
 4-8-2016 10:03 SnapinClient Snapin Found:
 4-8-2016 10:03 SnapinClient     ID: -1
 4-8-2016 10:03 SnapinClient     Name: 
 4-8-2016 10:03 SnapinClient     Created: -1
 4-8-2016 10:03 SnapinClient     Action: 
 4-8-2016 10:03 SnapinClient     Pack: False
 4-8-2016 10:03 SnapinClient     Hide: False
 4-8-2016 10:03 SnapinClient     Server: 
 4-8-2016 10:03 SnapinClient     TimeOut: -1
 4-8-2016 10:03 SnapinClient     RunWith: 
 4-8-2016 10:03 SnapinClient     RunWithArgs: 
 4-8-2016 10:03 SnapinClient     Args: 
 4-8-2016 10:03 SnapinClient     File: 
 4-8-2016 10:03 SnapinClient ERROR: Snapin hash does not exist
------------------------------------------------------------------------------

Tom Elliott

Fairly sure we got this figured out tonight. I’ve added the changes to the head state of svn and git so the fixes are now a part of 1.3.0-RC-8 so no need to wait.

Thanks @Wayne-Workman for the teamviewer which helped us narrow down what the issue was.

For all following along, it basically boiled down to file hashing taking far too long on large files. The checker would fail if the script took longer than 30 seconds and would also fail if the connection time was too long. It would also fail if the connection time was too long (timeout was defaulted to 15 seconds). So by default, if large file was not hashed within 45 seconds, it would fail the snapin completely. The fix, for simplicity sake, is to allow the hashing call to run unlimited and increase the connection timeout to a day. This could’ve been fixed had I kept a hash within DB, but I have a hard time trusting it as if somebody manually updates a file it would always fail the hash.

For Huge snapins (> 5gb) I’d recommend installing the software in your image rather than rely on the snapin system to install them. I say this, because even without hashing, having such a large file (especially spread out to many hosts) will create a lot of bandwidth usage (leaving the server less to perform imaging with if needed) and would be called that number of times to start transferring the file. Add in the hashing (which is another way The client and server help prevent bad files) and you have one big mess of Load and IO access issues.

Tom Elliott

Confirmed issue.

Sorry this was me working on commonizing and making a more friendly filesize checker/getter.

The file that this references to get the hash was missing a required item that allows access to the rest of the FOG information. Is currently fixed in RC-7.

Wayne Workman

@Tom-Elliott When is RC-7 expected to be released?

Tom Elliott

@Wayne-Workman I don’t know. It’s literally only 2 days old.

Wayne Workman

So, I know this thread has been marked as solved already,

but my building is on RC-6 and we have the same problem with snapins. None of our snapins work, and we’re hurting over it.

I’m highly anticipating RC-7, and I hope it is released soon.

Wayne Workman

Confirmed working again in RC-7.
I deployed 1,200 single-snapins this morning to mixed groups, some hosts got one single snapin, some hosts got 3. It’s knocking them out pretty quickly, the FOG server has a 1Gbps connection and it’s pegged right now.

Wayne Workman

This exact problem still exists in the current working-RC-8 branch, for multi-node fog systems with locations enabled.

I have a 800MB MSI that I cannot deploy from anywhere except the main server.

Tom Elliott

@Wayne-Workman Nothing changed for snapins.

Wayne Workman

@Tom-Elliott then it was never fixed for locations, then.

Tom Elliott

@Wayne-Workman Yes it was. You just need to get all items on the same page. All nodes need the update, not just the main.

Wayne Workman

@Tom-Elliott All nodes are on RC-7. The main is on working-RC-8

Tom Elliott

@Wayne-Workman Right but with the replication issue, it’s likely unable to use the location properly.

Tom Elliott

Fairly sure we got this figured out tonight. I’ve added the changes to the head state of svn and git so the fixes are now a part of 1.3.0-RC-8 so no need to wait.

Thanks @Wayne-Workman for the teamviewer which helped us narrow down what the issue was.

For all following along, it basically boiled down to file hashing taking far too long on large files. The checker would fail if the script took longer than 30 seconds and would also fail if the connection time was too long. It would also fail if the connection time was too long (timeout was defaulted to 15 seconds). So by default, if large file was not hashed within 45 seconds, it would fail the snapin completely. The fix, for simplicity sake, is to allow the hashing call to run unlimited and increase the connection timeout to a day. This could’ve been fixed had I kept a hash within DB, but I have a hard time trusting it as if somebody manually updates a file it would always fail the hash.

For Huge snapins (> 5gb) I’d recommend installing the software in your image rather than rely on the snapin system to install them. I say this, because even without hashing, having such a large file (especially spread out to many hosts) will create a lot of bandwidth usage (leaving the server less to perform imaging with if needed) and would be called that number of times to start transferring the file. Add in the hashing (which is another way The client and server help prevent bad files) and you have one big mess of Load and IO access issues.

RC6 - Snapins no longer working

151

12.2k

17.3k

155.5k