Tom and I chatted about this issue a little this morning. I think I have an explanation of why it worked when you removed one of the disks from the array.
Understand these are only anecdotal understandings of what might be going on.
Tom said you shouldn’t be able to image any array that is degraded or rebuilding.
I said that the error was representing the hardware can’t keep up with the processor causing a thread to timeout.
You created the array using the built in Intel Raid firmware and then attempted to image the computer.
You couldn’t image because the inits were missing a critical utility.
Your raid controller is one of those “fake-raid” controllers otherwise known as a hardware assisted software raid.
Hardware assisted software raid relies on the main CPU for array activities. Unlike a hardware raid which has its own processor to manage the array, a software raid uses the main CPU’s extra processing capacity.
Once the inits were fixed you booted into FOG and attempted to image the computer with an array.
At the time FOS loaded on the target computer, it saw the array was uninitialized so linux started to rebuild the array.
This is where things went sideways.
Now consider what is going on here. The linux OS is trying to rebuild the array that wasn’t initialized using the main CPU. At the same time FOG is trying to push the image to the disk subsystem as fast as it can. So this is where you have a chicken and egg situation.
The OS is busy building the array at some rate of build, for the sake of argument lets say 70MB/s translate that over to 4.2GB/min. My fog server and target computer can push images at 6.2GB/min. At some point we are going to have a data collision between the array being built and FOG laying the image down on the disk. This may explain why it gets to about 5GB deployed and it crashes.
The other side of this is that the array is being built the same time FOG is laying down the image and the FOG thread has to way too long for the disk subsystem to complete and times out.
Understand the above is just based on noodling about this issue all day and not really based on any hard evidence.
In my case and on my test system. I found the utility was missing so I copied it from another server. Once the array came up the OS started building the array. Tom patched the inits and wanted me to confirm they worked before he pushed them to the production server. I was in debug mode just watching the array rebuild so I updated the inits and rebooted the test box (mind you I’m doing this debugging remotely). The box rebooted, but I forgot it was in debug deploy mode, so I lost access to the box remotely. Deciding to call it a night I logged out of the dev system. Now that dev pxe target computer was running all night. Since the OS was up it was rebuilding the array. When I came in, in the morning I keyed in fog on the console and it started to deploy. Since I was only pushing out a 5GB image AND the array was already built it deployed correctly.
So what to do for the next time?
I guess if you are building an array and its initialized, pxe boot the target computer into hardware compatibility mode and let it sit until its disk activity is done. If we run across a lot of raid systems we might add a function to the hardware compatibility tests to report the percentage of array synchronization. I wouldn’t want to ask the developers to spend time on this for just a one off situation. But its something to consider.