SOLVED FOG service on 0.10.6 not restarting after reboot

  • I seem to be seeing a race condition on the FOG service on Windows 10 where the startup fails after windows starts. My server is at 7647 and the client FOG service runs okay after I start it up via the services panel. Just to test it out I added an automatic restart on the service after it fails and this starts it up after a reboot. Is it dependent on the Task Scheduler service?

  • @gwhitfield said in FOG service on 0.10.6 not restarting after reboot:

    demon ninja gremlins

    We are certified demon ninja gremlin dispatchers.

  • @Wayne-Workman @Joe-Schmitt I’ve had some initial success with using a “net start FOGService” batch file as a boot time task. 100% so far in 3-4 individual attempts, just haven’t proven it out as 100% successful on a large group of 790’s. If I find the crazy driver that’s doing this I will report back in this thread.

    The client works on everything I have newer than the 790 both with LTSB and CBB images (both 10240 and 1511) , and has been perfect in Win7 on every machine I’ve tried, even MUCH older ones. In the long run (for me anyway), as the legacy machines go away hopefully it becomes a distant memory. I don’t envy you trying to deal with demon ninja gremlins of various operating systems and hardwares!

  • @gwhitfield I’ve typically had success using Vista drivers for win7, and win 7 drivers for 8.1. You have to manually specify the drivers in device manager. You could try and it may work fine.

  • @Joe-Schmitt Well, no good news to report at the moment. I remembered this morning that Dell doesn’t offer Win10 drivers for the 790 so it’s extremely difficult to tell whether the “correct” drivers were installed by Windows. There are no obvious problems in Device Manager. I have toyed with several ideas:

    1. Creating a scheduled task to “net start FOGService” that I can either set up in Task Manager prior to sysprepping the image, or to use GPO to push the task (which of course doesnt work if the machines aren’t joined to the domain yet).
    2. Since I have been pushing drivers using Lee Rowlett’s scripts, maybe I’ll stop doing that for the 790 and let the Windows install figure them out, maybe it will come up with different ones than I was sending.

    So many projects underway, it may be a while before I can get through these tests but I’ll let you know what happens as I do them.

  • @Joe-Schmitt

    Yes, FOG service is disabled in the image, setupcomplete.cmd re-enables.

  • @Mentaloid said in FOG service on 0.10.6 not restarting after reboot:

    Anyways, if it is a driver update/WU based reboot, then this reg key will have been created. Could the fog client detect that, and instead of exiting, accelerate the reboot schedule somehow?
    HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Component Based Servicing\RebootPending

    Now there’s a great idea.

  • Senior Developer

    @Mentaloid quick question. You mentioned you used a universal image from a VM. Is it sysprepped? And if so, is the client service disabled on startup as described in our wiki?

  • The driver thing is interesting - and possibly could be part of things.

    So today a colleague deployed to ~60 computers.
    Half of them failed to start fog service after reboot from FOGClientupdate/renaming/joining domain.
    Half of them worked fine.
    The difference between them is a different model motherboard (Both ASUS, H61 and B75 based respectively)

    The image used was made on a VM, with no drivers pre-installed for either motherboard/chipset. In theory, both would need drivers to be installed during OOBE, so I’m not so sure what…
    The H61 based systems should be slower, but not by a whole lot, at least not for imaging tasks.

    Anyways, if it is a driver update/WU based reboot, then this reg key will have been created. Could the fog client detect that, and instead of exiting, accelerate the reboot schedule somehow?

    HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Component Based Servicing\RebootPending

    Here is the relevant section of the log file - which looks ok, but looks like for some reason (scheduled reboot?) it did not reboot.

     6/27/2016 4:04 PM Client-Info Client Version: 0.11.2
     6/27/2016 4:04 PM Client-Info Client OS:      Windows
     6/27/2016 4:04 PM Client-Info Server Version: 8263
     6/27/2016 4:04 PM Middleware::Response Success
     6/27/2016 4:04 PM HostnameChanger Checking Hostname
     6/27/2016 4:04 PM HostnameChanger Removing host from active directory
     6/27/2016 4:04 PM HostnameChanger The machine is not currently joined to a domain, code =  2692
     6/27/2016 4:04 PM HostnameChanger Renaming host to HiLibLabBack4
     6/27/2016 4:04 PM Power Creating shutdown request
     6/27/2016 4:04 PM Power Parameters: /r /c "FOG needs to rename your computer" /t 0
     6/27/2016 4:04 PM Bus {
      "self": true,
      "channel": "Power",
      "data": "{\r\n  \"action\": \"shuttingdown\"\r\n}"
     6/27/2016 4:04 PM Bus Emmiting message on channel: Power

    At this point, no snapins have run, and the computer does not reboot…I logged in via RDP:

    6/27/2016 10:03 PM Main Overriding exception handling
     6/27/2016 10:03 PM Main Bootstrapping Zazzles
     6/27/2016 10:03 PM Controller Initialize
     6/27/2016 10:03 PM Entry Creating obj
     6/27/2016 10:03 PM Controller Start
     6/27/2016 10:03 PM Service Starting service
     6/27/2016 10:03 PM Bus Became bus server
     6/27/2016 10:03 PM Bus {
      "self": true,
      "channel": "Status",
      "data": "{\r\n  \"action\": \"load\"\r\n}"
     6/27/2016 10:03 PM Bus Emmiting message on channel: Status
     6/27/2016 10:03 PM Client-Info Version: 0.11.2
     6/27/2016 10:03 PM Client-Info OS:      Windows
     6/27/2016 10:03 PM Middleware::Authentication Waiting for authentication timeout to pass
     6/27/2016 10:03 PM Middleware::Communication Download: http://fog.XYZ.local/fog/management/other/ssl/srvpublic.crt
     6/27/2016 10:03 PM Data::RSA FOG Server CA cert found
     6/27/2016 10:03 PM Middleware::Authentication Cert OK
     6/27/2016 10:03 PM Middleware::Communication POST URL: http://fog.XYZ.local/fog/management/index.php?sub=requestClientInfo&authorize&newService
     6/27/2016 10:03 PM Middleware::Response Success
     6/27/2016 10:03 PM Middleware::Authentication Authenticated

    As you can see, for some reason (foguser?) the service restarted when a user logged in, and carried on - despite a pending reboot that never happened.

    UPDATE: As of ~10:30PM, the machines that hadn’t finished the rename/reboot, rebooted on their own (must have been that scheduled power thing!) and started carrying on.

  • Senior Developer

    @gwhitfield Those logs indicate that the client is performing perfectly. Here is our current theory on what is happening:

    You have the issue on Dell machines. Dell machines are notorious for having slight hardware configuration differences even on the same model (wifi, graphics, …). When deployed, Windows notices that driver XX doesn’t quite match the hardware. It applies the correct driver and schedules a restart. This scheduling of the restart locks Window’s power operations.

    The client then starts up, joins the domain, and asks Windows to restart. Windows refuses since there is already a lock on the power operations. The client doesn’t know this, and assumes the power operation is under way. It then gracefully exits.

    This theory is consistent with the logs you provided. Here is a quick test for you:

    Select one of your problem machines. Make sure all drivers are installed correctly on this one machine. Re-upload the image using this machine and re-deploy to all of your problem machines. Are there less occurrences of your issue? If so then this is your problem.

  • @Joe-Schmitt Joe, I edited my post (a couple times now) of logs to include a machine that DID join the domain, but then the service didn’t restart after the second reboot. These have both been manually rebooted a couple times since and the service refuses to start.

    There IS one machine which seems to have experienced the same failures (non-joining domain and service start failure), but after a single manual reboot, the service did restart and the machine joined the domain and the service has continued to successfully restart during multiple manual reboots since. Unfortunately I will have to wait until tomorrow to get those logs since this machine is sleeping soundly and won’t wake-on-LAN.

    EDIT: reading back early in the thread I find it may help to note these machines are WIN 10x64 LTSB (10240)

    Thanks for your help and Wayne’s input as well!!

  • @Joe-Schmitt The post below this one, that’s the issue I was referring to here:

    I wasn’t able to find a log though, yet. Good thing @gwhitfield posted one.

  • @Joe-Schmitt - Trunk 8257 - Joe, I had client 11.2 deploy to 25 Dell 790 machines today as a test and only 2 of them joined the domain. They all renamed but the service doesn’t appear to restart after the first reboot. No logs on PC after first shutdown command. I am attaching a fog.log from one of the machines and the apache error log during the time of the task. Hopefully this will tell you something:

    1_1467056255448_apache error 11.2 client non-restart.txt

    0_1467072182110_apache err 11.2 next 60 minutes.txt



    Oddly, I don’t have this issue on all machines I’ve been working on, mainly Dell 7020 and later. Maybe it’s a problem with the older hardware keeping up??

    Edits: I found the zazzles log and user log thought that may also help so I just uploaded them. Seems like a clue here but fog.log hasn’t changed any since it stopped at 1:36.

  • @Joe-Schmitt Sweet. Will do!

  • Senior Developer

    Please update your server again. v0.11.2 was released and patches another power issue.

  • @Joe-Schmitt I am having a similar issue though I have already updated to client 11.1 after I read this post a couple days ago. Ubuntu 14.0.4, FOG trunk 8095. In a lab of 25 Dell 790’s I had 10 or so not join the domain. I found the FOG service not running on each of them. I had no problems with my lab of Dell 7020’s or Dell 7040’s, both of which have better processors and more memory. The fog.log has no errors, immediately after imaging it commands a shutdown to rename/rejoin and then the logging stops. Unfortunately I am off-site until Monday so I can’t reach a machine to test much but I am trying to get one of my slower VM’s tested to see if it happens there.

  • Senior Developer

    v0.11.0 is released and fixes this issue. Can you test when you have a chance?

  • Senior Developer

    v0.11 of the client should prevent this. The FOG Service will have a dependency of dnscache which is Window’s DNS Client. The DNS Client is one of the last network services to start and all version of Windows within reason use it.

  • I do - I had a lab of 28 computers imaged on a fresh image of win10 edu/1511. Setupcomplete re-enables FOGService after it was disabled for sysprep. Image has nothing else added except office 2016.

    Of the 28 computers, 2 of them failed to start fogservice during the setupcomplete, and 8 more of them failed to start fogservice after a snapin called for a reboot after install. All 28 computers are identical, bought at the same time. i5@2.3g,4gb ram.

    As with the other examples here, the fog.log has nothing in it except stating it’s gone to sleep - no exit info for shutting down with the system, and no startup. Zazzles.log doesn’t appear to exist…

    I’ve changes my setupcomplete.cmd to have this in it…

    sc config FOGService start= delayed-auto
    waitfor /t 5 null
    sc failure FOGService reset= 86400 actions= restart/60000/restart/60000/restart/60000
    waitfor /t 5 null
    sc failureflag FOGService 1
    waitfor /t 5 null
    sc start FOGService

    to see if this has any effect. Note that the failure flag of 1 tells the system to restart the service if it exits unexpectedly (non-0 exit code) using the failure sequence.

    This workaround won’t be tested until next deploy (next weekend maybe?)

    Also - I had periodically experienced this in a VM guest (VMware ESXi 6.0 host, 4virt cpu, 8gvem). Server was practically idle at the time, and is ridiculously fast (deploy, oobe, reboot, desktop takes about 6 minutes). This same guest after another reboot started the service fine, and had been used to test images repeatedly. I’d say it was only about a 5% rate of failure to start the service in the VM.

  • Senior Developer

    @tmerrick do you still have some computers experiencing this issue? I am attempting to track down exactly what’s failing but can’t replicate the issue.