unable to deploy RAID 1 disk
-
@george1421 Only note to myself
[Wed May 31 root@fogclient ~]# mdadm --create --verbose /dev/md/imsm /dev/sd[a-b] --raid-devices 2 --metadata=imsm [Wed May 31 root@fogclient ~]# mdadm -C /dev/md124 /dev/md125 -n 2 -l 1 mdadm: array /dev/md124 started. mdadm: failed to launch mdmon. Array remains readonly
Ok after about 5 hours of working on this I have a solution, there is a missing array management utility that needs to be in FOS to get the array to switch from active (read-only) to active (auto-read-only) [small but important difference]. Once I copied the utility over and recreated the array by hand it started syncing (rebuilding) the raid-1 array. I need to talk to the developers to see if we can get this utility built into FOS.
The document that lead to a solution: https://www.spinics.net/lists/raid/msg35592.html
-
@george1421 said in unable to deploy RAID 1 disk:
Ok after about 5 hours of working on this I have a solution
5 hours ! It’s unbelievable !!
So, if you want some command tests output, tell us and we will help you with pleasure to debug FOGMany thank for your investigation !
-
OK we have a functional fix in place now. This fix will only work for FOG 1.4.0 and 1.4.1. You will need to go to where you installed fog from. For git installs it may be /root/fogproject or for svn it may be /root/fog_trunk or what ever. The idea is that there are
binariesXXXXXX.zip
files. remove all of those files the fog installer will download what it needs again. There will be one binariesXXXXX.zip for each version of fog you installed.Once those files are removed, rerun the installer with its default values (you already configured). This will download the updated kernels and inits from the FOG servers.
Now pxe boot your target computer with the raid using debug deploy [or capture depending on what you wanted to do] like was done before. Once you are at the fog key in
cat /proc/mdstat
if md126 now says (auto-read-only) then you win!!. IF it still says (read-only) then you might not have the most current inits. We will deal with that once we see the output of the cat /proc/mdstat.I was able to deploy an image to a test system in the lab using the intel raid so I know it does work with the new inits.
-
@george1421 Correction, it will only work for 1.4.1
-
@george1421 You are absolutely great!!
-
@Tom-Elliott said in unable to deploy RAID 1 disk:
@george1421 Correction, it will only work for 1.4.1
Then corrected, I do stand.
-
@george1421
I have downloaded fog.1.4.1.tgz. After extract this file removed binary1.4.1.zip file. Then started the
./installfog.sh file.
After installation finished.I have checked md126 on debug mode . It was not readonly;
Then i have started to deploy image;
It looks good. I have to leave from Office. we will see the result tomorow
Thanx for everything -
@eistek That first screen shot is prefect!!
What it tells us that the newly created raid array is resyncing (copying the master disk to the slave disk and thus building the array). It rebuilding the array at 70MB/s (which is about the top speed of your sata disks).
-
-
-
@eistek said in unable to deploy RAID 1 disk:
i have changed undionly.kpxe to undionly.kkpxe
The iPXE kernel only manages the FOG iPXE menu and launching of the FOS image. Once FOS Linux has started the iPXE kernel (undionly.kpxe) is discarded. Changing this boot kernel will not have an impact on the error message you now have.
What these images show is that the FOS linux kernel is crashing. This is typically related to a hardware error. This is only a wild guess but I would say memory (RAM chip) or hard drive.
-
@eistek Do you have a smaller image you can deploy to this system? The image doesn’t have to run on the target computer, I’m more interested in if it deploys completely to the target computer. I’m looking for one in the 20-40GB range to test deployment. 135GB is a bit rare and I can’t say for sure if FOG can handle that size, I simply don’t know.
According to the partclone info the image failed at about 28GB of the file being transferred to the target computer.
-
Here is the my solution;
1- i have created new clean raid.
2- i have added mdraid= true and /dev/md126 to my host configuration.
3- I have removed one of the raid disk from PC.
4- Started to deploy. Deploying image is finished without any error.
5- Restarted PC and Windows has started without any problem.
6- Then power off the PC and plugged in second disk to PC and turned on .
7- Raid started to rebuild.
8- Rebuild is finished and everything is looks well now -
@eistek Well done!!
-
I don’t know why but if i try to deploy with 2 disk pluged in it gives kernel panic.
Special thanx to @george1421 @Jonathan-Cool @Tom-Elliott
-
@eistek well nice that you found a solution but you need to work physical at the machine to solve this, it would be really nice if the problem could be solved while deploying to a working raid instead of letting it rebuild after deploying to just a single drive.
Regards X23
-
@eistek Its hard to describe this, but the error means that the hardware can’t keep up with the CPU. Its like (not technically correct) the disk subsystem can’t keep up with the volume of data being written to the disks. Where you are getting buffer overruns in the disks. From researching this, the error happens more often when writing a lot of data to the local console and the console can’t keep up with the data stream.
-
Tom and I chatted about this issue a little this morning. I think I have an explanation of why it worked when you removed one of the disks from the array.
Understand these are only anecdotal understandings of what might be going on.
- Tom said you shouldn’t be able to image any array that is degraded or rebuilding.
- I said that the error was representing the hardware can’t keep up with the processor causing a thread to timeout.
- You created the array using the built in Intel Raid firmware and then attempted to image the computer.
- You couldn’t image because the inits were missing a critical utility.
- Your raid controller is one of those “fake-raid” controllers otherwise known as a hardware assisted software raid.
- Hardware assisted software raid relies on the main CPU for array activities. Unlike a hardware raid which has its own processor to manage the array, a software raid uses the main CPU’s extra processing capacity.
- Once the inits were fixed you booted into FOG and attempted to image the computer with an array.
- At the time FOS loaded on the target computer, it saw the array was uninitialized so linux started to rebuild the array.
This is where things went sideways.
Now consider what is going on here. The linux OS is trying to rebuild the array that wasn’t initialized using the main CPU. At the same time FOG is trying to push the image to the disk subsystem as fast as it can. So this is where you have a chicken and egg situation.
The OS is busy building the array at some rate of build, for the sake of argument lets say 70MB/s translate that over to 4.2GB/min. My fog server and target computer can push images at 6.2GB/min. At some point we are going to have a data collision between the array being built and FOG laying the image down on the disk. This may explain why it gets to about 5GB deployed and it crashes.
The other side of this is that the array is being built the same time FOG is laying down the image and the FOG thread has to way too long for the disk subsystem to complete and times out.
Understand the above is just based on noodling about this issue all day and not really based on any hard evidence.
In my case and on my test system. I found the utility was missing so I copied it from another server. Once the array came up the OS started building the array. Tom patched the inits and wanted me to confirm they worked before he pushed them to the production server. I was in debug mode just watching the array rebuild so I updated the inits and rebooted the test box (mind you I’m doing this debugging remotely). The box rebooted, but I forgot it was in debug deploy mode, so I lost access to the box remotely. Deciding to call it a night I logged out of the dev system. Now that dev pxe target computer was running all night. Since the OS was up it was rebuilding the array. When I came in, in the morning I keyed in
fog
on the console and it started to deploy. Since I was only pushing out a 5GB image AND the array was already built it deployed correctly.So what to do for the next time?
I guess if you are building an array and its initialized, pxe boot the target computer into hardware compatibility mode and let it sit until its disk activity is done. If we run across a lot of raid systems we might add a function to the hardware compatibility tests to report the percentage of array synchronization. I wouldn’t want to ask the developers to spend time on this for just a one off situation. But its something to consider.