Failing to image after VPN drop between FOG Node & primary server



  • Hi All,

    Been using FOG in a multi-site setup for the past year quite successfully. Some 800+ jobs have been kicked off.

    The main FOG server is located in our data centre and we have local storage nodes each at each site.

    Recently during a unicast clone session at one site, the VPN link dropped out whilst imaging was taking place. When these machines finished imaging (from the local storage node) they then bombed out with an error message because the primary FOG server couldn’t be contacted. (Sorry I don’t recall what this message was!)

    Once the VPN was back up, the jobs were cancelled. That actual deployment still worked OK, but the couple of machines we have since attempted to reimage are now not working. The machines still PXE boot in to the FOG OS, then instead of partclone executing, it appears to skip straight over it and then reboot. FOG then thinks the job is completed successfully.

    My guess is some temporary file/flag is set somewhere still? Any ideas of where I should start looking to clean this up?

    EDIT: Running Fog 1.2.0 too :)



  • Odd thing too is the problem lab was still booting Win7 fine despite gparted showing a blank partition table too! Back to reimaging now :)



  • Ah man… :eek:

    Just ran gparted over it and it said there was no partition table. Created it with gParted and then rebooted, and it’s imaged up fine then. Which also explains why a brand new (replacement) drive didn’t work.

    I’d just assumed that FOG would restore all that from the image too. (mult-partition image).


  • Developer

    do those computers have partitions on the drives?



  • Also if I tell it to do a Memtest task that works fine. Just seems to totally skip the imaging step.



  • Hi Tom,
    I’ve tried that and it’s no different.

    Looking at the apache log on the primary server, I’m seeing this when I PXE boot a problematic machine, then select the ‘quick image’ task.

    10.1.1.110 - - [28/Oct/2014:12:59:59 +1000] “POST /fog/service/ipxe/boot.php HTTP/1.1” 200 609 “-” "iPXE/1.0.0+ (3a02)"
    10.1.1.110 - - [28/Oct/2014:12:59:59 +1000] “POST /fog/service/ipxe/boot.php HTTP/1.1” 200 948 “-” "iPXE/1.0.0+ (3a02)"
    10.1.1.110 - - [28/Oct/2014:13:00:08 +1000] “POST /fog/service/inventory.php HTTP/1.1” 200 299 “-” "Wget"
    10.1.1.110 - - [28/Oct/2014:13:00:08 +1000] “GET /fog/service/Pre_Stage1.php?mac=78:45:c4:2f:61:09&type=down HTTP/1.1” 200 300 “-” "Wget"
    10.1.1.110 - - [28/Oct/2014:13:00:08 +1000] “GET /fog/service/Post_Stage3.php?mac=78:45:c4:2f:61:09&type=down HTTP/1.1” 200 297 “-” “Wget”

    Update:
    ImagingLog db table shows an entry for that machine ID with identical start/finish times too, along with the TaskLog table showing State 3 & 4 entries with identical times too.


  • Senior Developer

    [quote=“netbootdisk, post: 38245, member: 5249”]Hi All,

    Been using FOG in a multi-site setup for the past year quite successfully. Some 800+ jobs have been kicked off.

    The main FOG server is located in our data centre and we have local storage nodes each at each site.

    Recently during a unicast clone session at one site, the VPN link dropped out whilst imaging was taking place. When these machines finished imaging (from the local storage node) they then bombed out with an error message because the primary FOG server couldn’t be contacted. (Sorry I don’t recall what this message was!)

    Once the VPN was back up, the jobs were cancelled. That actual deployment still worked OK, but the couple of machines we have since attempted to reimage are now not working. The machines still PXE boot in to the FOG OS, then instead of partclone executing, it appears to skip straight over it and then reboot. FOG then thinks the job is completed successfully.

    My guess is some temporary file/flag is set somewhere still? Any ideas of where I should start looking to clean this up?

    EDIT: Running Fog 1.2.0 too :)[/quote]

    The only place I’m aware of to “clean-up” tasks is from the task page. Since the start of the 1.x.x series, we’re not building files to load the system for pxe booting. We’re using just straight database values. My only guess is maybe with the VPN link dropping the database may have become corrupted?

    You can try Repairing your database tables. Make your life a little easier and install phpMyAdmin. Login to your database. Select the FOG database. It should show a page on the left containing all of the fog tables. scroll to the bottom and choose check all. From the Drop down option, choose Repair and let it run.

    It’s only a guess. It doesn’t necessarily mean there is a problem with the DB, but it’s where I would start.


Log in to reply
 

421
Online

39.3k
Users

11.0k
Topics

104.3k
Posts

Looks like your connection to FOG Project was lost, please wait while we try to reconnect.