Middleware:: Response ERROR: Object reference not set to an instance of an object
- Version: 1.3-RC8
- OS: CentOS 7
- Version: 0.11.5
- OS: Windows 7 Enterprise
On Friday when I left for the weekend I kicked off an imaging task for a group of laptops. When I came back on Monday I found a number of them had not completed their hostnamechanger or snapin tasks and were logging two middleware errors every time they checked in with the server.
Middleware: :Response ERROR: Unable to get subsection
Middleware: :Response ERROR: Object reference not set to an instance of an object.
Figuring it was just a fluke I set them all to re-image and now all of them produce the same result immediately after the reboot for joining the domain. I’ve rebooted the FOG server as well just to be sure but the issue still persists after that as well. I did notice while rebooting the FOG server that it hung shortly after being issued the reboot command so it had to be hard powered off; may be unrelated but figured it was worth mentioning. I can provide any logs needed to troubleshoot.
@Tom-Elliott Thanks Tom . I’ll have a chance to do a confirmation test in my setup sometime next week and let you know if I encounter any issues related to this.
With the release of rc10 I’ve solved this thread.
@Wayne-Workman I don’t think size, alone, is the issue. It’s the number of hosts hashing which I’m fairly confident we got working properly for rc9. Loaf on the server will still likely happen if all the files are relatively large in size. I’m working on a few kinks with the database currently which should have limited impact to people if you wanted to give it a test just to see if snapins are less impacting to your server. Just talk to Wayne or myself on how to implement as I don’t want everybody jumping ship. The working RC branch is what will become the next RC release, but I’m still working on at least one semi-major issue and don’t want everybody just giving a shot on it.
@Darrin-Enerson How large are your snapins? In size ?
I was able to do a bit more testing on this today and think I may have discovered the root cause of the issue. The problem doesn’t actually appear to be in the imaging system but rather in the snapin system. I say this because I can image 25 computers simultaneously with no snapins and don’t encounter any issues, however, if I deploy to the same 25 with snapins it exhibits this behavior as soon as the snapins start applying. The issue appears to be that when a snapin is first pushed out it looks to be running a hash function on the server and client, presumably to make sure it received an unaltered file before executing. The problem is that if a number of these tasks start at roughly the same time it maxes out the CPU and RAM on the server and strange things start to happen. I haven’t found the lower limit of where this starts to occur but I can reproduce it with my process every time I try it so can do any further troubleshooting needed as well as provide an entire batch of logs.
So - the fix, limit your maximum clients in the storage node settings. that’s in Storage Management.
You should also look into Multicast - it’ll be your best friend ever.
As an update to this I can image smaller batches of 10 with no issue. That would tend to indicate a load, memory leak, or process leak issue. Let me know if you need more troubleshooting on my end to narrow this down. I’ll proceed in smaller batches for now so that I can get everything prepped by my deadlines but I can set up larger batches for testing this as needed.
@Wayne-Workman On these the WiFi settings aren’t loaded until the computers join the domain so they’re entirely LAN based connection until then. As I mentioned if I image each of these one at a time it works without a hitch but in a batch of 24 it fails seemingly randomly and hammers the FOG web interface into instability in the process. I haven’t tried smaller batches yet because I didn’t want to destroy the client logs if they could be helpful. I will note however that in FOG 1.2.0 with the old client this setup doesn’t cause any issues so it’s definitely related to 1.3.0 and the new client.
@Darrin-Enerson I read in the OP that these are laptops. I would imagine that this has something to do with
ping fogservereither not working on LAN or on Wifi.
Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\IT>ping fogserver Pinging fogserver.beaconacademy.com [172.16.12.6] with 32 bytes of data: Reply from 172.16.12.6: bytes=32 time<1ms TTL=63 Reply from 172.16.12.6: bytes=32 time<1ms TTL=63 Reply from 172.16.12.6: bytes=32 time<1ms TTL=63 Reply from 172.16.12.6: bytes=32 time<1ms TTL=63 Ping statistics for 172.16.12.6: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 0ms, Maximum = 0ms, Average = 0ms C:\Users\IT>
I can pull up the FOG management page from that computer as well.
@Darrin-Enerson On that computer, open up cmd and
ping fogserverwhat happens?
@Wayne-Workman The system won’t let me upload it as a file and it’s too long for the posting requirements. Here’s the Google Drive link to one of the logs though: https://drive.google.com/file/d/0BzdlDm_GwIvwdllHakduWGdwUTQ/view?usp=sharing.
This comes from a host that was one of the last to finish imaging in this batch. As such the other hosts were already going through joining the domain and attempting to apply snapins. Since whatever is causing this issue also causes the web interface to become increasingly unstable this one failed to even join the domain. Let me know what logs, if any, you need from the server itself.
@Darrin-Enerson Please post a log from just one of the affected hosts, grab a copy of it after imaging finishes and after the fog client does all it’s stuff. We need a full log, but it doesn’t need to be overly long - as long as all the happenings are in it, we can use it.
@Tom-Elliott Each host has the same set of 17 snapins associated to them.
@Darrin-Enerson Do you have a lot of snapins?
@Tom-Elliott If you want a whole slew of fog.log files from these hosts I can provide them or I can deploy some debugging scripts if you want since it seems like I can reproduce the issue on command currently. I can even send you our image for testing if you’d like.
For reference we’re using a gold master image with the auto driver install method from the wiki. The only software installed in the image is the FOG Client 0.11.5 with the FOGService set to “Disabled” before sysprep. We are using a KMS key so setupcomplete.cmd runs after without an issue. As part of the auto driver install we push up a fresh copy of setupcomplete.cmd right after the image is finished deploying that contains the follow:
net STOP "FOGService" start "Reset C:\Drivers Perms" /wait "icacls" C:\drivers /reset /T >>"C:\FOGDrivers.log" 2>&1 sc config FOGService start= auto shutdown -t 0 -r
I did notice on the run that I left over the weekend that the client did try to update and then started throwing the errors so I updated the client in the image before retrying. After that deployment all clients exhibited the issue immediately. I can provide the logs from that second run.
Oddly I don’t see this issue if I only image a single station but if I image the entire batch of 24 it seems to show up consistently. As I mentioned I noticed that the server was getting hammered with sha2sum processes as soon as the clients start checking in with a full batch of 24. The specs on the VM the server was running are 1 core 2.7 GHz, 4 GBs RAM, 10 Gb VMXNET3 Ethernet. I have since upped this to 2 core 2.7 GHz and 8 GBs RAM but have yet to retest just in case you need me to reproduce this issue.
The problem is your hosts likely have the fog client installed and running as a part of the image.
So what’s occurring is before the systems reboot they’re being assigned a security token. This by itself is fine, but as the system comes up, the client is relinking and causing the Security token to get updated (as they are cleared during the imaging process itself).
Then they finish their setup items and likely update the client to the latest version, which fails because the security token doesn’t match what was expected.
This is all speculation, but I suspect this is what you’re running into.
Joe and I have been trying to figure out exactly how to replicate the issue, and all we’ve been able to do is speculate. It seems, to us, to fix the issue, you need to reset the encryption data once the system is up and waiting for normal items to occur.
Or if your image is sysprepped, before uploading make sure the client is disabled and your host isn’t joined to the domain. As a part of the firstlogon or setupcomplete scripts (Firstlogon is needed for windows 8 and up OEM licensing as it no longer runs setupcomplete for you) have the script enable the fog client and reboot the machine. I force a reboot on my side without starting the service that way I know all is fresh and clean. You should be able to just have the service start after enabling, but I don’t know if the client will be forced to update and if so if the service will automatically restart after the update.
Should that be necessary immediately after being imaged though? It seems odd that I’d need to do that on a freshly imaged machine right after it changes the hostname and joins the domain.
Looking at the FOG server it has maxed out it’s CPU usage from a whole slew of sha512sum processes now and is intermittently dumping me to the database schema installer/updater screen when navigating the web interface.
EDIT: I was able to get back to a stable web interface by shutting down the entire batch of clients that had just been imaged. This allowed me to reset the encryption on one of them to test this out. It does appear to fix the issue for the client. The web interface issue though seems to be a legitimate bug as 24 clients checking in while in this state were able to completely cripple it. I’m going to reset the encryption on all of them, start them back up, and see if the web interface stays stable.
EDIT #2: I went to the group that this batch of laptops are a part of clicked on “Reset Encryption Data” and then clicked “Update”. After that I started them all back up and am receiving the same error messages as before. I tried “Reset Encryption Data” on one of the individual unit’s host entries as well while they were all still running. This didn’t correct the issue for that unit like it did when I tested with only the single unit. With all 24 clients powered on the web interface is getting increasingly sluggish but is yet to become unstable.
Not a bug.
Please go to the host giving this issue and choose reset encryption data.