Wrong target device

Floppyrub

I’ve currently run into a problem in FOG that I can’t resolve. I’m using version 1.5.10.1698. I have different systems with varying numbers of hard drives. All of the systems have an NVMe drive, which is where the installation is supposed to go. Sometimes the systems also have one or two additional data HDDs.

As soon as even one HDD is connected, FOG always selects it as the target device for deployment. This has already led to a painful data loss. In debug mode, the NVMe drive is displayed correctly, and if I disconnect the HDD, the installation works fine on the NVMe.

Could this issue be related to how the Windows image was created? I currently can’t rule out that when the Sysprep was performed, the HDD was also connected alongside the NVMe (the drive on which Windows is installed).

I seem to remember that in earlier FOG versions, it always selected the NVMe drives automatically. I’m now trying to find out where exactly the error occurs. Could you tell me which file in FOG determines the target disk for deployment?

Tom Elliott

@Floppyrub /usr/share/fog/lib/funcs.sh getHardDisk function is where the code is located.

You’ll want to look at github.com/fogproject/fos for this under the path https://github.com/FOGProject/fos/tree/master/Buildroot/board/FOG/FOS/rootfs_overlay/usr/share/fog/lib/funcs.sh specifically.

Recent updates have been made that were attempting to account for “Host Primary Disk” field allowing serial/wwn/disk size lookups to help pinpoint what drive to use for imaging when this is set.

In a point of consistency it now de-duplicates and sorts the drives, so it’s possible:

/dev/hdX is chosen as the primary drive before /dev/nvmeX because of the sorting feature.

There’s no real way to consistently ensure nvme is loaded before HDD’s though so there was always the potential, just that nvme runs on the PCI bus directly rather than the ATA busses (which are generally much slower to power on)

Now /dev/sdX (in the new layout) would most likely be safe because lexicographically speaking it would fall in after the nvme’s in naming sorting I’d imagine?

Currently, I’m aware that the released version of inits likely is also presorting by disk size first (assuming the largest drives are the primary disk you’d want to send the image to when you’re not using the Host Primary Disk feature.)

From my viewpoint (limited as it may be) you may need to start using UUID/WWN/Serial formatting more for these multi-disk connections where you don’t want to accidentally overwrite a disk.

Easier said than done, but my point is the getHardDisk feature is a best guess algorithm at its core. It “seemed” better in older systems, but as new technologies and methodologies of reading data come about, there’s no real “this is definitely the drive this user wants the OS to sit on” method available to anyone.

Floppyrub

@Tom-Elliott Thanks for your detailed response. I can understand the problem well. On desktops, you can simply disconnect the HDDs if necessary. However, the day before yesterday, I had a laptop with two NVMe drives, where that would be very inconvenient. There, the installation also went to the second NVMe instead of the first.

The hint about funcs.sh is extremely valuable. I had already come across the line /usr/share/fog/lib/funcs.sh in the file fog.custominstall. When I wanted to look at the file, I noticed that this directory does not exist in my installation. I had assumed it was old custom code I had written. Now I know that’s not the case. I will download the file from GitHub and place it on the system, and I’m quite confident that this will solve the problem, at least initially.

The system was freshly installed at the end of August with version 1.5.10.1600, so the error must have occurred there.

Tom Elliott

@Floppyrub The code exists in the FOS system (when you boot a machine for a task, not on your server)

Floppyrub

I wanted to just accept the situation as it is. However, I’ve already suffered a second painful data loss because I overlooked a SATA HDD and didn’t disconnect it. I don’t think it’s working the way it should. All my systems are Dell machines that were previously deployed using FOG, and the correct NVME drive was always selected as the target before. Is there anything I can do to solve the problem?

Tom Elliott

@Floppyrub Have you updated to dev-branch? You are free to run any task as a debug, initially to validate things are working as expected before things get too far and do any actual “data loss activities”.

The latest FOS code is the default pull in:

if it helps you to see how it functions please review the getHardDisk funciton starts at line 1501 of the Code link I provided you:

If it helps to see the funciton as a whole:

getHardDisk() {
    hd=""
    disks=""

    # Get valid devices (filter out 0B disks) once, sort lexicographically for stable name order
    local devs
    devs=$(lsblk -dpno KNAME,SIZE -I 3,8,9,179,202,253,259 | awk '$2 != "0B" { print $1 }' | sort -u)

    if [[ -n $fdrive ]]; then
        local found_match=0
        for spec in ${fdrive//,/ }; do
            local spec_resolved spec_norm spec_normalized matched
            spec_resolved=$(resolve_path "$spec")
            spec_norm=$(normalize "$spec_resolved")
            spec_normalized=$(normalize "$spec")
            matched=0

            for dev in $devs; do
                local size uuid serial wwn
                size=$(blockdev --getsize64 "$dev" | normalize)
                uuid=$(blkid -s UUID -o value "$dev" 2>/dev/null | normalize)
                read -r serial wwn <<< "$(lsblk -pdno SERIAL,WWN "$dev" 2>/dev/null | normalize)"

                [[ -n $isdebug ]] && {
                    echo "Comparing spec='$spec' (resolved: '$spec_resolved') with dev=$dev"
                    echo "  size=$size serial=$serial wwn=$wwn uuid=$uuid"
                }
                if [[ "x$spec_resolved" == "x$dev" || \
                      "x$spec_normalized" == "x$size" || \
                      "x$spec_normalized" == "x$wwn" || \
                      "x$spec_normalized" == "x$serial" || \
                      "x$spec_normalized" == "x$uuid" ]]; then
                    [[ -n $isdebug ]] && echo "Matched spec '$spec' to device '$dev' (size=$size, serial=$serial, wwn=$wwn, uuid=$uuid)"
                    matched=1
                    found_match=1
                    disks="$disks $dev"
                    # remove matched dev from the pool
                    devs=${devs// $dev/}
                    break
                fi
            done

            [[ $matched -eq 0 ]] && echo "WARNING: Drive spec '$spec' does not match any available device." >&2
        done

        [[ $found_match -eq 0 ]] && handleError "Fatal: No valid drives found for 'Host Primary Disk'='$fdrive'."

        disks=$(echo "$disks $devs" | xargs)   # add unmatched devices for completeness

    elif [[ -r ${imagePath}/d1.size && -r ${imagePath}/d2.size ]]; then
        # Multi-disk image: keep stable name order
        disks="$devs"
    else
        if [[ -n $largesize ]]; then
            # Auto-select largest available drive
            hd=$(
                for d in $devs; do
                    echo "$(blockdev --getsize64 "$d") $d"
                done | sort -k1,1nr -k2,2 | head -1 | cut -d' ' -f2
            )
        else
            for d in $devs; do
                hd="$d"
                break
            done
        fi
        [[ -z $hd ]] && handleError "Could not determine a suitable disk automatically."
        disks="$hd"
    fi

    # Set primary hard disk
    hd=$(awk '{print $1}' <<< "$disks")
}

Ultimately, the part I’m worried about is the sort -u as that will lexographically sort the drives regardless of how lsblk returns (which is the part I was stating earlier, there’s no true OS load method as PCI tends to load faster than serial -> parallel:

I have adjusted the code slightly and am rebuilding with that adjustment in the beginning of the function where we get all available drives:

devs=$(lsblk -dpno KNAME,SIZE -I 3,8,9,179,202,253,259 | awk '$2 != "0B" { print $1 }' | sort -u)

Instead of sort -u I’m going to try:

devs=$(lsblk -dpno KNAME,SIZE -I 3,8,9,179,202,253,259 | awk '$2 != "0B" && !seen[$1]++ { print $1 }')

Basically that will get only unique drive entries but keep in in the order of which lsblk sees the drives.

I doubt this will “fix” the issue you’re seeing, but it’s worth noting.

I still need to clarify, however, that this isn’t the coding fault. There’s 0 guaranteed method to ensure we always get the right drive, because in newer systems what is labelled the drive this cycle, can easily be labelled something else the next cycle.

hdd will always load in hda, hdb, hdc, hdd - this is about the only “guarantee” we can give.

Serial (USB, SATA, etc…) SATA would load (generally) in the channel order appropriately, but USB might or might not load before: so Something in the USB might take /dev/sda on this boot, and on the next, the channel 0 controller might take /dev/sda.

NVME, what’s nvme0n1 on this cycle, might become nvme1n1 on the next.

This is why the function you see is “best guess” at best.

I am wanting to make this seemingly more stable on your side of things, for sure, but just want to be clear on what you’re seeing, there’s never any potential we can guarantee we got the “right” thing.

Fog_Newb

@Floppyrub said in Wrong target device:

@Tom-Elliott Thanks for your detailed response. I can understand the problem well. On desktops, you can simply disconnect the HDDs if necessary. However, the day before yesterday, I had a laptop with two NVMe drives,

I’ve run into this problem on a PC that had 2 NVME’s, It’s my understanding the reason is, sometimes one NVME initializes first, so /nvme0n is sometimes /nvme1n. If the drives are different sizes, you can specify the size of the target drive / Host Primary Disk (or at least you could). Currently I have 2 NVME’s the same size so I use the serial number of the drive as Host Primary Disk. It works.

Tom Elliott

@Floppyrub We have performed the same action (we get all hdd’s and only return the unique drives on the system. it will fall back to preferring the order of the drive in which lsblk returns them instead of lexicographically sorting.)

Floppyrub

Thank you very much for the detailed explanation of the problem and for adjusting the code.
I’m not on the dev branch yet, but I’ll make sure to switch and use debug mode to trace the function.
It’s likely that the tip about using the serial number as the Primary Host Disk will need to be incorporated into my process.

At the moment, I have a large number of Windows 11 reinstallations to handle due to the end of support, so it will probably be a few weeks before I have time for this.
I’ll post an update in this thread once I get to it.

Wrong target device

142

12.4k

17.4k

155.9k