Posts made by Paulo.Guedes

Paulo.Guedes

Hello, I managed to avoid the issue, but it’s still not solved. However, now I can work by using another setting.

After a good night of sleep, I tested more things and nothing seemed to be wrong. Then I changed some configurations, then replaced the crossover cable and… it worked.

However, it took a lot of observation and discussion to discover the (probable) root cause for the timeout issue. The cable had a tiny broken piece: the plastic latch. It should lock the cable ending inside the network conector. Without this, I am assuming that the cable was subtly unstable. It worked reasonably well, as long as no one touched the machines, cable or the table.

After half an hour and several dozen gb transferred, I am guessing that it was slightly displaced from the connector. Not enough to bring down the network link, but only the minimum necessary to break the link stability and cause the timeout in the tg3 kernel module.

Again: this is an assumption, made after eliminating a lot of other possible issues. But I had no time to carefully reproduce the issue, and check if this was actually the cause.

Other than this problem, there was another one. When configuring the image to be captured with the first approach (entire disk, resizable), every attempt just failed. But when I changed to the second (partClone (gzip)), it succeeded.

This fail was the inode thing. If you wish, I can reproduce the problem to try to gather more information. If I remember correctly, this problem was already present about four years ago.

Or maybe I’m just doing something wrong in the setup. Who knows.

What do you think?

Regards,
Paulo

Paulo.Guedes

Hello,
thank you for the prompt answer!

Well, I agree that the messages are not exactly the same and that the inode (in my case) has a different value from the image on the link. I added the exact number to be precise. Good that you noticed this, as we (developers) really need to pay attention to the details.

Well, thanks for the provided link. Sorry for not including it before, but I had already found it. Actually, it was from this message that I learned about the chkdisk and shutdown procedure. Already did both of them before posting here, but forgot to mention. No, it didn’t solved the issue. Sorry, but I’m working on this for a lot of time. Was too tired when writing this. And yes, it was always complaining on the same inode number I wrote before. I also ran chkdisk and defrag, but nothing changed.

It is nice to hear that I have read and done the right things. Looks like I’m on the right track, which is encouraging.

Here is a picture of the problem.

And the second screen.

Tomorrow I will try the reboot test you mentioned. But before, there are new details I observed during my last tests.
First, it seems related to the way the disk is captured. This issue only happens when I select the first approach for disk capturing (whole disk, resizable image). I’m not at work now, so… can’t provide the exact option.

But When the second method is select, combined with “partClone (gzip)”, I was able to correctly capture a valid disk image. The folder was there, with all the files related to my seven partitions and with about 100GB of compressed files. Most probably there is a real issue here (somewhere), which was only circumvented by changing the cloning approach.

If you wish I can reproduce the problem, in order to help nail down the root cause. But before, I really need to prepare my labs.

Well, after getting the disk image, I took another machine with exactly the same hardware (brand, disk, ram, AMD cpu, devices, etc) and tested to deploy it. My test setup is currently very simple. Two machines connected only by a crossover cable. This is to test if everything will work. As soon as the image is deployed successfully, I will perform a full deploy.

The deploy test failed near the end. It started well and wrote one or two small partitions. On the windows partition, it failed. This area has aout 500gb of space, with about 130 being used. The procedure was going on just fine for about 25 minutes. After several GB being transferred, the client machine crashed with only about 8 minutes left to finish the partition. This is important to point out that the network card and cable were working ok during this time.

The failure was the tg3 timeout bug on the client machine. And now it became the (currently) most important bug for me to fix (or avoid, somehow). Unfortunately, today I could test it only once. I wonder if, by trying it a few more times, it manage to get to the end. If the problem is related to a concurrency problem, maybe I get lucky.

On the client, the last message was: “tg3_stop_block timed out, ofs=4c00 enable_bit=2”. This is from the tg3 kernel module, responsible for the Tigor3 wired network device.

On the server, the kernel log shows an ERROR related to a FIFO underrun. Here are the pictures.

Tomorrow I will run more tests and try to provide more information.

Best regards,
Paulo

Paulo.Guedes

Re: "Could not open inode XXXXXX through the library…" Windows 10 Sysprep Capture##### Server

FOG Version: 1.4.4
OS: Ubuntu 16.04.3 LTS

Client

Service Version:
OS: Win 10

Description

Re: "Could not open inode XXXXXX through the library…" Windows 10 Sysprep Capture

Hello,

I have a very similar setup and a very similar issue in here. I am building an image for a few labs, in a school.

My ubuntu is (lsb_release -a)
Description: Ubuntu 16.04.3 LTS

About fog:
Running Version 1.4.4
SVN Revision: 6077

I am trying to get a NTFS imagem from an HP EliteDesk 705 G3 mini. It has GPT with a set of partitions. Windows, EFI, recovery, linux, etc. I am not running a sysprep procedure, but my windows had any programs installed and was updated. Already tested the HW with a memory stick. Everything seems ok.

Sometimes, the error related to the tg3 network module has appeared. However, the same module under Ubuntu seems to be working fine (in a quick test). But now, tg3 is not the main issue. The error I see is the same: “could not open inode 998968 through the library: input/output error”.

I also turned off fast startup. The machine was shutdown accordingly, as described in here:
https://wiki.fogproject.org/wiki/index.php?title=Windows_Dirty_Bit
Used the “Way 1 (clean method)”.

Already tried to update the kernel to the latest available on the fog project.

Would you like more information? Config files, logs, anything?

Any ideas?

Thanks,
Paulo