Could not open inode xxxxx through the library - HP EliteDesk 705 G3 Mini
-
Hello,
thank you for the prompt answer!Well, I agree that the messages are not exactly the same and that the inode (in my case) has a different value from the image on the link. I added the exact number to be precise. Good that you noticed this, as we (developers) really need to pay attention to the details.
Well, thanks for the provided link. Sorry for not including it before, but I had already found it. Actually, it was from this message that I learned about the chkdisk and shutdown procedure. Already did both of them before posting here, but forgot to mention. No, it didn’t solved the issue. Sorry, but I’m working on this for a lot of time. Was too tired when writing this. And yes, it was always complaining on the same inode number I wrote before. I also ran chkdisk and defrag, but nothing changed.
It is nice to hear that I have read and done the right things. Looks like I’m on the right track, which is encouraging.
Here is a picture of the problem.
And the second screen.
Tomorrow I will try the reboot test you mentioned. But before, there are new details I observed during my last tests.
First, it seems related to the way the disk is captured. This issue only happens when I select the first approach for disk capturing (whole disk, resizable image). I’m not at work now, so… can’t provide the exact option.But When the second method is select, combined with “partClone (gzip)”, I was able to correctly capture a valid disk image. The folder was there, with all the files related to my seven partitions and with about 100GB of compressed files. Most probably there is a real issue here (somewhere), which was only circumvented by changing the cloning approach.
If you wish I can reproduce the problem, in order to help nail down the root cause. But before, I really need to prepare my labs.Well, after getting the disk image, I took another machine with exactly the same hardware (brand, disk, ram, AMD cpu, devices, etc) and tested to deploy it. My test setup is currently very simple. Two machines connected only by a crossover cable. This is to test if everything will work. As soon as the image is deployed successfully, I will perform a full deploy.
The deploy test failed near the end. It started well and wrote one or two small partitions. On the windows partition, it failed. This area has aout 500gb of space, with about 130 being used. The procedure was going on just fine for about 25 minutes. After several GB being transferred, the client machine crashed with only about 8 minutes left to finish the partition. This is important to point out that the network card and cable were working ok during this time.
The failure was the tg3 timeout bug on the client machine. And now it became the (currently) most important bug for me to fix (or avoid, somehow). Unfortunately, today I could test it only once. I wonder if, by trying it a few more times, it manage to get to the end. If the problem is related to a concurrency problem, maybe I get lucky.
On the client, the last message was: “tg3_stop_block timed out, ofs=4c00 enable_bit=2”. This is from the tg3 kernel module, responsible for the Tigor3 wired network device.
On the server, the kernel log shows an ERROR related to a FIFO underrun. Here are the pictures.
Tomorrow I will run more tests and try to provide more information.Best regards,
Paulo -
@Paulo-Guedes As you provide great infirmation I think we should be able to figure this out. I already had a look at the ntfs-3g code but have not seen an obvious issue yet.
To not mix up things I may ask you to split the two issues in different posts. It will be way clearer if we don’t cross talk about different subjects in one post. Just open a new one and edit/copy the tg3 related part over. You can simply edit your last post…
Surely the second image type is non-resizable and therefor does not Show the error. The message comes from the third-party tool we use to make resizable images. If all your disks have identical sector counts you could just stick to that image type.
-
Hello, I managed to avoid the issue, but it’s still not solved. However, now I can work by using another setting.
After a good night of sleep, I tested more things and nothing seemed to be wrong. Then I changed some configurations, then replaced the crossover cable and… it worked.
However, it took a lot of observation and discussion to discover the (probable) root cause for the timeout issue. The cable had a tiny broken piece: the plastic latch. It should lock the cable ending inside the network conector. Without this, I am assuming that the cable was subtly unstable. It worked reasonably well, as long as no one touched the machines, cable or the table.
After half an hour and several dozen gb transferred, I am guessing that it was slightly displaced from the connector. Not enough to bring down the network link, but only the minimum necessary to break the link stability and cause the timeout in the tg3 kernel module.
Again: this is an assumption, made after eliminating a lot of other possible issues. But I had no time to carefully reproduce the issue, and check if this was actually the cause.
Other than this problem, there was another one. When configuring the image to be captured with the first approach (entire disk, resizable), every attempt just failed. But when I changed to the second (partClone (gzip)), it succeeded.
This fail was the inode thing. If you wish, I can reproduce the problem to try to gather more information. If I remember correctly, this problem was already present about four years ago.
Or maybe I’m just doing something wrong in the setup. Who knows.
What do you think?
Regards,
Paulo -
Sorry, but I’m still not used to this forum. Just answered your message using the wrong button. Please take a look when it’s possible. At least now I am ready to clone everything next Monday. So, if you think it’s helpful, I can take a look at the issues in more detail, as soon as this deploy is ready (about 100 machines).
Thank you,
Paulo -
@sebastian-roth Hello Sebastian, you’re right, thanks for helping me. Here is my new post.
https://forums.fogproject.org/topic/10731/crash-due-to-timeout-in-tg3-kernel-module-tg3_stop_block-timed-out-ofs-4c00-enable_bit-2I confirm that my machines have identical sector counts. They were bought in a batch and are all from exactly the same type and manufacturer, with the same devices inside. Or at least, they should be.
I’m glad that you think I providede great information. I can provide more if you wish, including from dmesg, the debug deploy shell or from the server. You name it.
The problem is still here and it’s preventing me and all my colleagues to work. My department is almost completely stuck due to this issue:(.
So, thank you very much (in advance) for any help you can provide.
Paulo
-
@sebastian-roth Just learning how to reply in this forum. Yes, I tried to reboot into windows. No, the numbers didn’t changed, unless I perform a defrag operation. A full disk scan with chkdisk was performed, but nothing changed. The machines are brand new, they arrived two or three weeks ago. I don’t think the problem is related to a bad disk, specially because it happened in several machines without further obvious issues. I mean, when the cloning works, the underlying OS boots and works without a glitch (quick test, less than an hour).
About SMART data: it’s good to check, will try tomorrow.
Paulo -
@Paulo-Guedes Now let’s see if we can figure out this one as well.
The machines are brand new, they arrived two or three weeks ago. I don’t think the problem is related to a bad disk
You should be right, let’s assume the disks are ok. See SMART values just to be save.
specially because it happened in several machines without further obvious issues
Hmm, this makes me think… This sounds like the issue is not on all machines? Did all of those come pre-installed?? Or did you install the OS yourself? If it came pre-installed then take one of the machines wipe the disk and install a plain windows 10 build 1703 by hand to see if you’d still run in the same issue with that!
As well, please post the contents of the files
d1.partitions
andd1.minimal.partitions
here. You should find those on your FOG server disk in/images/c8d3ff001d64/
directory.I just spotted something in the picture you posted that looks weird on first sight but most probably is not causing this. It says:
Args Passed: /dev/sda3 /images/c8d3ff001d64/d1.original.fstypes :1:2:6:5
. So partitions 6 and 5 are flipped in the fixed partitions list. Don’t think this is causing a problem but maybe this is a hint on why things go wrong…As all disk seem to have the same sector count you could just go with the non-resizable image type for now. This should clearly workaround the issue till we are able to figure where it comes from.
-
@Paulo-Guedes Could you please post the requested information so we can look into this and possibly help you.
-
@sebastian-roth
Hello,
working with the non-resizable image type really helped. Since that allowed us to capture an image, I am focusing now a bit more on the “tg3 timeout issue”. But I will post the information you asked for, as soon as things calm down in here. Probably in a few days. Sorry, but that’s all we can do now, until our labs are up and running again.Regards,
Paulo -