Weird Host behavior (some disappearing/losing primary mac, some suddenly needing approval)


  • Testers

    TL;DR

    Changed the display name of my image.
    Deployed said image.
    Suddenly some hosts have to be approved.
    Hosts are typically added via pxe boot menu. There was a snapin hash error on one host and then it suddenly needed to be approved but didn’t show up in pending hosts list.
    Other hosts did show up in pending hosts lists, but all of them had been authenticated through the fog service and working previously.


    I am on fog 1.5.8.13 on centos 7.

    I have discovered some weird behavior and I’m not sure where it’s coming from.

    I’m still troubleshooting to find more exact parameters for when this is happening so I’ll do my best to explain.

    Scenario 1 - host losing primary mac

    I have a bundle of devices in my environment that don’t have integrated ethernet. So I have a solution for using usb/usb-c ethernet adapters that support pxe.
    I have a handful of these adapters I use for specific models and such. For example, I’m imaging a microsoft surface go 2, it requires a surface usb-c ethernet adapter to pxe boot.
    So I have 1 adapter (they don’t need ethernet once they’re deployed) and I have a function at the very end of my provisioning script that removes the mac address of this usb ethernet adapter, and then approves all the pending macs that the fog client would have found by that point. Thus making it so the host will be found over wifi by the fog client identified by its wifi mac(s). This has worked well for quite some time.

    However, right now I just imaged this surface go 2 and somehow before reaching the part of my provisioning process that would remove that, the host has lost its mac address. If I search for the host in the gui (or via the api) it doesn’t show up anywhere. If I try to make a new one, it says a host with that name already exists. So I searched the database and found that the host still exists there, but it has no matching record in the HostMac table. Which is where I’m getting the ‘lost its primary mac’ as the name of this problem.

    Some other notes: I create the hosts and queue them for imaging in the pxe boot menu (Add host full registration option). The fog client is embedded in my image but the service is disabled before capturing the image. The provisioning process starts it on demand as needed till it’s completely enabled for installing assigned snapins. I noticed that this happened when it was attempting to install snapins because it got stuck in a loop since the fog client log kept throwing an ‘invalid host’ error. I use the fogservice to rename the computer and join the domain, which did happen successfully

    This happened with fog client 0.11.9, I just saw that 0.12.0 is available and added it to my server so it will start updating my clients. I didn’t see anything in the release notes that leads me to believe it will fix anything though. My current hypothesis is that the client is the culprit of this problem, but I don’t know for sure.

    Edit

    I don’t want to remove my other notes but before I finished writing this out did a bit more log tracing.
    I found the time that these things changes via the fog.log

    ------------------------------------------------------------------------------
    ---------------------------------SnapinClient---------------------------------
    ------------------------------------------------------------------------------
     7/17/2020 1:44:51 PM Client-Info Client Version: 0.11.19
     7/17/2020 1:44:51 PM Client-Info Client OS:      Windows
     7/17/2020 1:44:51 PM Client-Info Server Version: 1.5.8.13
     7/17/2020 1:44:51 PM Middleware::Response Success
     7/17/2020 1:44:51 PM SnapinClient Running snapin 7zipZstd
     7/17/2020 1:44:51 PM Middleware::Communication Download: http://192.168.100.117//fog/service/snapins.file.php?mac=60:6D:3C:3D:41:97|F8:E4:E3:FA:6D:E8|FA:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:EB&taskid=5397
     7/17/2020 1:44:52 PM SnapinClient C:\Program Files (x86)\FOG\tmp\chocoPkgSnapin.ps1
     7/17/2020 1:44:52 PM Bus Emmiting message on channel: Notification
     7/17/2020 1:44:52 PM SnapinClient Starting snapin
     7/17/2020 1:46:27 PM SnapinClient Snapin finished
     7/17/2020 1:46:27 PM SnapinClient Return Code: 0
     7/17/2020 1:46:27 PM Bus Emmiting message on channel: Notification
     7/17/2020 1:46:27 PM Middleware::Communication URL: http://fogserver/fog/service/snapins.checkin.php?taskid=5397&exitcode=0&mac=60:6D:3C:3D:41:97|F8:E4:E3:FA:6D:E8|FA:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:EB&newService&json
     7/17/2020 1:46:28 PM SnapinClient Running snapin DeviceForm - Detachable
     7/17/2020 1:46:28 PM Middleware::Communication Download: http://192.168.100.117//fog/service/snapins.file.php?mac=60:6D:3C:3D:41:97|F8:E4:E3:FA:6D:E8|FA:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:EB&taskid=5393
     7/17/2020 1:46:28 PM SnapinClient C:\Program Files (x86)\FOG\tmp\chocoPkgSnapin.ps1
     7/17/2020 1:46:28 PM SnapinClient ERROR: Hash does not match
     7/17/2020 1:46:28 PM SnapinClient ERROR: --> Ideal: 817B881A4D07DCFF51F67A47D9595DDE59D010FE0C018BDF5B57C502CB83A14F3085AF48987B392E8BF38E2C6B29D094055BDFFB6BBCD5647697D64AA4BDD668
     7/17/2020 1:46:28 PM SnapinClient ERROR: --> Actual: 5271D1B9F35EA9FC2F1E3CE474C20641A15B37A02035E7E8210DFB513518CBC79FDFD25092A41491C2AB02AF8E0913C99819C6066B9883F398D90C5D045E4A1D
     7/17/2020 1:46:28 PM Middleware::Communication URL: http://fogserver/fog/service/snapins.checkin.php?taskid=5393&exitcode=-1&mac=60:6D:3C:3D:41:97|F8:E4:E3:FA:6D:E8|FA:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:EB&newService&json
    ------------------------------------------------------------------------------
    
    
    ------------------------------------------------------------------------------
    --------------------------------PrinterManager--------------------------------
    ------------------------------------------------------------------------------
     7/17/2020 1:46:28 PM Client-Info Client Version: 0.11.19
     7/17/2020 1:46:28 PM Client-Info Client OS:      Windows
     7/17/2020 1:46:28 PM Client-Info Server Version: 1.5.8.13
     7/17/2020 1:46:28 PM Middleware::Response Module is disabled globally on the FOG server
    ------------------------------------------------------------------------------
    
    
    ------------------------------------------------------------------------------
    --------------------------------PowerManagement-------------------------------
    ------------------------------------------------------------------------------
     7/17/2020 1:46:28 PM Client-Info Client Version: 0.11.19
     7/17/2020 1:46:28 PM Client-Info Client OS:      Windows
     7/17/2020 1:46:28 PM Client-Info Server Version: 1.5.8.13
     7/17/2020 1:46:28 PM Middleware::Response Success
     7/17/2020 1:46:28 PM PowerManagement Calculating tasks to unschedule
     7/17/2020 1:46:28 PM PowerManagement Calculating tasks to schedule
    ------------------------------------------------------------------------------
    
    
    ------------------------------------------------------------------------------
    ----------------------------------UserTracker---------------------------------
    ------------------------------------------------------------------------------
     7/17/2020 1:46:28 PM Client-Info Client Version: 0.11.19
     7/17/2020 1:46:28 PM Client-Info Client OS:      Windows
     7/17/2020 1:46:28 PM Client-Info Server Version: 1.5.8.13
     7/17/2020 1:46:28 PM Middleware::Response Success
     7/17/2020 1:46:28 PM Middleware::Communication URL: http://fogserver/fog/service/usertracking.report.php?action=login&user=QA-TAB-6105\AdlAdmin&mac=60:6D:3C:3D:41:97|F8:E4:E3:FA:6D:E8|FA:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:EB&newService&json
    ------------------------------------------------------------------------------
    
     7/17/2020 1:46:28 PM Service Sleeping for 98 seconds
     7/17/2020 1:48:07 PM Middleware::Communication URL: http://fogserver/fog/management/index.php?sub=requestClientInfo&configure&newService&json
     7/17/2020 1:48:07 PM Middleware::Response Success
     7/17/2020 1:48:07 PM Middleware::Communication URL: http://fogserver/fog/management/index.php?sub=requestClientInfo&mac=60:6D:3C:3D:41:97|F8:E4:E3:FA:6D:E8|FA:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:E7|F8:E4:E3:FA:6D:EB&newService&json
     7/17/2020 1:48:07 PM Middleware::Response Invalid host
     7/17/2020 1:48:07 PM Middleware::Communication URL: http://fogserver/fog/service/getversion.php?clientver&newService&json
     7/17/2020 1:48:07 PM Middleware::Communication URL: http://fogserver/fog/service/getversion.php?newService&json
    
     7/17/2020 1:48:07 PM Service Creating user agent cache
     7/17/2020 1:48:07 PM Middleware::Response Module is disabled on the host
     7/17/2020 1:48:07 PM Middleware::Response Module is disabled globally on the FOG server
     7/17/2020 1:48:07 PM Middleware::Response Module is disabled globally on the FOG server
    
    

    So right after a snapin hash didn’t match suddenly everything is disabled and invalid. (the snapin hash not matching relates to a separate issue that makes no sense when updating snapins via the api that use a snapin file that already exists, but I have a separate failsafe that gets that installed so don’t worry about that right now). This error isn’t unfamiliar to me but I can usually ignore it. I looked at my personal provisioning process logs and there wasn’t anything running at that time as an api call to the fog server or anything like that.

    This lead me to think, wait is this task still active?
    So I checked the gui and lo and behold it was. For some reason I could get to host through the link in the tasks screen but not anywhere else. It said it was pending and needed to be approved…
    It did not show up in the pending hosts and the hostPending field in the database was empty as far as I could tell. Here’s the database host record query for reference (with some private info ‘redacted’)

     select * from hosts where hostname = "qa-tab-6105";
    +--------+-------------+-----------------------------------------------+--------+-----------+--------------+---------------------+---------------------+--------------+-----------+---------------------+-----------------------------------------------------------------------------+------------+------------+------------------+----------------+------------------+----------------+------------+------------+----------+-------------+------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+---------------------+--------------+--------------+-------------+-------------+
    | hostID | hostName    | hostDesc                                      | hostIP | hostImage | hostBuilding | hostCreateDate      | hostLastDeploy      | hostCreateBy | hostUseAD | hostADDomain        | hostADOU                                                                    | hostADUser | hostADPass | hostADPassLegacy | hostProductKey | hostPrinterLevel | hostKernelArgs | hostKernel | hostDevice | hostInit | hostPending | hostPubKey                                                       | hostSecToken                                                                                                                     | hostSecTime         | hostPingCode | hostExitBios | hostExitEfi | hostEnforce |
    +--------+-------------+-----------------------------------------------+--------+-----------+--------------+---------------------+---------------------+--------------+-----------+---------------------+-----------------------------------------------------------------------------+------------+------------+------------------+----------------+------------------+----------------+------------+------------+----------+-------------+------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+---------------------+--------------+--------------+-------------+-------------+
    |   1737 | qa-tab-6105 | Created by FOG Reg on July 17, 2020, 12:35 pm |        |        27 |            0 | 2020-07-17 12:35:12 | 2020-07-17 12:41:47 | fog          | 1         | domain.site | OUstring | user  | password  |                  |                |                  |                |            |            |          |             | secretKey | secretKey | 2020-07-17 14:09:29 | 0            |              |             | 1           |
    +--------+-------------+-----------------------------------------------+--------+-----------+--------------+---------------------+---------------------+--------------+-----------+---------------------+-----------------------------------------------------------------------------+------------+------------+------------------+----------------+------------------+----------------+------------+------------+----------+-------------+------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+---------------------+--------------+--------------+-------------+-------------+
    

    So I guess that this all relates to hosts suddenly being unapproved

    Other Hosts suddenly requiring approval

    During imaging

    Another host that we imaged yesterday is a more standard desktop. It did not lose its mac address but instead suddenly required approval in the gui. It was also giving the ‘invalid host’ error.
    This one appeared in the pending hosts lists, got approved and then continued through its provisioning via the fogservice as normal.

    Randomly (seemingly at least)

    While troubleshooting the surface go scenario, checked the pending hosts in the web gui again and found 3 active/deployed hosts suddenly needing approval.
    Sadly their fog logs didn’t go as far back to be able to see if they perhaps had the same snapin hash error before becoming invalid.

    Other notes

    The only other thing I can think of that might be worth noting is that I did change the display name of the image all of these hosts had associated with them. When each of them had to be approved they also had lost any image associations. Could changing the display name of an image cause such problems. I also overwrote the image file on the server manually but that’s something I’ve also done a million times before. We have 3 images, a dev, stable, and previous. New images are uploaded as dev, then tested, if it’s stable we move the image files to the stable folder and keep a previous version of stable as a fail safe. Then we just add what version of windows 10 is in the image on the display name. i.e. the current one is Base-Stable-1909. Before renaming it it was Base-Stable-1809. The image id and all that stayed the same so this shouldn’t have caused any issue, but that’s the only thing I can think of that we changed before this happened.


  • Testers

    @Sebastian-Roth I updated and imaged a computer. It didn’t happen during imaging again, yay!

    However, I just saw another host that was imaged back in February come up as pending for no apparent reason. It still has its image history, inventory, membership, snapin history, login history, etc. It is missing its assigned image (it was assigned before the image’s display name was changed) and it is missing its active directory OU settings. I have not taken it out of pending in case there’s something more to look at for debugging purposes. I’m working remotely at the moment and it seems the computer is turned off, otherwise I’d grab the fog.log in case there’s something in there that hasn’t been truncated that might help.


  • Testers

    @Sebastian-Roth I updated to the RC2 from the master branch. Gonna be imaging a machine (well a virtual machine) shortly. I’ll let you know how it goes


  • Testers

    This post is deleted!

  • Testers

    @Sebastian-Roth I will give that a go and see if it that eradicates the issue. I haven’t seen it happen since I fixed all the snapin hashes and updated the client, but I’ve also only imaged one computer since then.


  • Senior Developer

    @JJ-Fullmer Would you mind updating to the very latest RC version. While I am not sure I do remember having looked into a similar sounding issue and fixed that only a few weeks ago.


  • Testers

    So in my second test to see if this happened again it didn’t happen again. The second surface go 2 I’m imaging got past the part the other one magically became pending.

    So there were 2 differences one or both of which fixed the problem

    • Manually Re-uploaded the file for the snapin that had a mismatch hash error
    • Fog Client 0.12.0 was used instead of 0.11.19

    I need to image another different device before I can be really sure.

    So now I’m wondering why the new 0.12.0 fog client isn’t being auto updated to on all my clients through the client updater? As I recall from past updates, simply updating the installer on the fog server side should get all the clients to update. Am I forgetting a step? @Sebastian-Roth maybe you have some insights.
    I can deploy it through my own software deployment tools if I must.


  • Testers

    @JJ-Fullmer I just relooked at the database output and noted that the hostPending field was set to ‘1’. So it did at least say in the database that the host became pending, but it still wasn’t showing in the gui outside of the existing tasks.
    I’m currently testing to see if I can recreate the issue or if having fixed the snapin with the hash error solves the issue.



323
Online

7.2k
Users

14.4k
Topics

135.6k
Posts