• Recent
    • Unsolved
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Register
    • Login
    1. Home
    2. Paulo.Guedes
    3. Posts
    P
    • Profile
    • Following 0
    • Followers 0
    • Topics 2
    • Posts 23
    • Best 3
    • Controversial 0
    • Groups 0

    Posts made by Paulo.Guedes

    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      Hello, just updating.

      1. No answer so far from Broadcom. Tom, adding the patch would be good.

      2. Added a link to this discussion in another thread. I think it’s the same problem.
        Maybe they can also report on the problem.
        https://forums.fogproject.org/topic/9976/hp-elitedesk-705-g2-mini

      3. Mentioned the patch and test results in another forum. Hope this helps the patch to enter the main kernel faster.
        https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: HP EliteDesk 705 G2 MINI

      @andrew-asph
      Hello, this thread has a patch that solved the bug.
      https://www.mail-archive.com/netdev@vger.kernel.org/msg189347.html

      The patch is here:
      https://www.mail-archive.com/netdev@vger.kernel.org/msg189923/0001-tg3-Add-clock-override-support-for-5762.patch

      More details in here:
      https://forums.fogproject.org/topic/10731/crash-due-to-timeout-in-tg3-kernel-module-tg3_stop_block-timed-out-ofs-4c00-enable_bit-2?loggedin=true

      posted in Hardware Compatibility
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @sebastian-roth
      Hello Sebastian, all,

      1. Stable kernels 4.13.3 and 4.15 crash without the patch. Patch is not merged yet in the main branch.

      2. Stable kernels 4.13.3 and 4.15 work great with the patch: no timeouts on tg3. Fast transfers on gigabit links and 10/100 links.

      3. Wrote to the patch author as Sebastian suggested, with my results and asking when it will be merged. Waiting for his answers. Patch has a slight offset for 4.15 (2 lines, probably new comments or code) but works anyway. Will keep you updated on this.

      4. Deploy for single machines (in parallel without multicast) is finally checked. Tested overnight with a bunch of machines and it’s ok.

      5. If you wish, I can upload the patched 4.15 kernel tomorrow, just in case someone wants to use it.

      6. Multicast deploy for groups of machines is working too, but much slower (about 10x) than my 10/100 network could transfer. Same network, same machines, no cable touched, nothing reset and… the deploy already starts at a slow speed (between 100 and 200 MB/min). Just reporting. Will start reading about it, to try to understand the problem. If anyone can point me on the right direction, please answer this message.

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @sebastian-roth
      As far as I can tell, the patch for tg3 was not inside the release candidates for the current kernel. I’ve tested 4.15-RC8 and it was not working. Then RC9 was released (no idea about it). Two days ago a brand new stable version was released. Will try it and see what happens.
      https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.15.tar.xz

      I just checked the changelog and it mentions nothing related to tg3, tigon, timeout or broadcom. I would bet this patch is not in here yet. Here it is.
      https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.14.15

      I will try to run more tests today. One with a 4.13.3 without the patch, to see if it breaks (and hence, the patch is the real fix). And another with 4.15 (with and without patch), to see if it is fixed and, in case it’s not, if the patch applies cleandly and works. Meanwhile, yesterday I wrote in another thread (with the same bug), asking people from there to double check our findings. Maybe they can take a look too, and see what happens.

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: HP EliteDesk 705 G2 MINI

      @andrew-asph
      Hello, today I just managed to make it work, based on directions provided by the fog team. I will double check it tomorrow, but the details (and the patch) are all described on the other thread (please see my last message in this thread).

      By the way, if any of you could repeat what I did in order to check that my findings also work for you, it would provide valuable evidence that the issue was nailed. Can you please try the fix and report the results? Any result will help, since a failure can show that there’s something else that I missed.
      Looking forward for your testing.
      Regards,
      Paulo

      posted in Hardware Compatibility
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @tom-elliott
      Hello Tom, I have added a few dmesg logs in the messages below. I think it’s not related to the firmwares, since the kernel builds ok, but the module crashes.

      Hello all, it’s a real pleasure to finally say that IT WORKED!!! Wow, it finally worked! I almost can’t believe it. Thank you so much for all your help.

      Aham. The solution was found by Sebastian (thanks Sebastian!!!). Here I just describe the process.

      The message thread that contains the solution and a patch. It describes precisely the failure scenario: The same NIC, boot over the network, then a 10/100 switch, then the way the tg3 kernel module breaks with a timeout.
      https://www.mail-archive.com/netdev@vger.kernel.org/msg189347.html

      The patch:
      https://www.mail-archive.com/netdev@vger.kernel.org/msg189923/0001-tg3-Add-clock-override-support-for-5762.patch

      The kernel version: 4.13.3
      https://www.kernel.org/pub/linux/kernel/v4.x/
      https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.13.3.tar.xz

      Basically I followed the instructions to rebuild a static image.
      Download the kernel and the patch; extract the kernel, apply the patch. Build an image (mine was a 64 bit one).
      https://wiki.fogproject.org/wiki/index.php?title=Build_TomElliott_Kernel

      Install the build inside fog, then try to image something over ethernet with the regular procedure: using pxe to boot.

      Without a patch, the deploy will fail with a timeout crash inside tg3. Now it should work flawlessly.
      If you wish to just

      If you wish, I’ve built a 64-bit image, ready to be used inside fog. Here it is.
      https://goo.gl/n1qBES

      Regards,
      Paulo
      p.s.: I really hope nothing has changed inside the firmware repository, and the fix is not due to a new firmware. Maybe it’s worth trying the same kernel with the same firmware repository, but without the patch (to see if it breaks). Anyway, it works, and this is what matters:)

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      Short version:

      • Bug still hapening.
      • Addded more info on launchpad bug 1447664 (basically what you see in here), just to share the logs.
        https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664
      • Will test 2 things you suggested tomorrow.
      • Thank you both!

      Now the long version:

      @sebastian-roth
      Hello Sebastian,

      the bug is still hapenning. Let me explain.

      I have tried what you suggested first about modifying the patch for Dell machines. Tried it with a few printk comments, to prove it was being correctly built. The patch “as is” fails the if condition (as expected), most probably because my machines are not Dell. It prints my message that proves the module is being compiled, but not the one stating that the “if body” runs.

      Then I commented the “if condition” as you suggested, in order to force the body to always execute. It runs (added a printk to prove that), but the problem still happens.

      I could not yet try your last suggestion (next link).
      https://www.mail-archive.com/netdev@vger.kernel.org/msg189347.html

      But it states that “Booting from a harddrive works fine”, which is very encouraging. It also describes precisely the scenario I see in here, with the bios code loading the kernel and ramdisk file correctly over ethernet, and with the network breaking after the boot process.

      Well, I will prepare a new kernel today and will run two new tests tomorow.

      1. Try the patch you suggested from the kernel list.
      2. Try to boot from an USB boot drive as suggested by George

      Will return to you as soon as I have more information.

      By the way, you’re right: “Nasty stuff and really hard to find and fix”. By the way, I stumbled upon the NIC development datasheet and it’s quite large (600+ pages: ouch!), so I gave up this route.

      Thanks!

      @george1421
      Hello George,

      1. Next monday I will test with “lspci -nn”.

      2. Actually the issue only hapens in slow speeds (100). With a gigabit link it never happens.

      3. Good to know about the usb boot FOS image. May I create it as described in the following link, or it’s something else? Anyway I just created a bootx64.efi as it describes. Will try it tomorrow morning, as soon as I get at work.
        https://wiki.fogproject.org/wiki/index.php?title=USB_Bootable_Media

      About the fog usb boot drive and the USB network adapter, if it works, it will be a great solution. Currently, we really don’t care much about the link speed, since we have more than 100 machines to clone. We can let them work overnight and that would be just fine, even if it takes two days.

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @george1421
      Hello George,

      1. The HW id: this is a Broadcom card. This came from lspci. Is that what you’re asking?
        01:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10)

      2. I have tried that. Forcing the port to 100mb does not change anything when it’s connected to a 10/100 switch: the bug still happens. Actually, the bug ALWAYS happens when connected to a slower switch (10/100). When I connect to a gigabit port (crossover cable or gigabit switch), the communication flows perfectly. It’s clearly something time-related.

      3. We tried to do that, but I actually… don’t know how. I mean, I can use a tethered cell phone with an “ethernet over USB” inside Ubuntu, quite easily. I can write udev rules and the like. However, I never had to do that inside busybox. The kernel is clearly recognizing the device, but I don’t know how to proceed in order to set a (virtual) network interface. Today I tried with “mdev -s”, but could not properly search for a tutorial to learn how to finish it. If you can point me on the right direction, it would be great.

      Sebastian, I’ve also tried the kernel patch without success. With and without the “if condition” (and with printk messages). Later I will elaborate on that a bit further.

      Unfortunately we’re still stuck.

      Thank you all for your ideas,
      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: HP EliteDesk 705 G2 MINI

      @andrew-asph
      Hello, I’m having the very same issue. We’re trying to work around this since last semester, and the best we could get was to create a set of scripts in order to use fog files for MANUAL cloning, without network (ouch!). Basically we managed to:

      1. Grab an image using a single ethernet cable (crossover), with gigabit board on both ends and NO SWITCH in between. One of the machines was NOT the HP (hence, it most probably has another kind of NIC).

      2. Copy fog files to an external, usb hard disk. Adjust scripts to fix paths.

      3. Boot a live ubuntu from a memory stick (already prepared with the necessary tools inside).

      4. Run the scripts, in order to setup the gpt partitioning and the disk deploy

      My machines are “HP EliteDesk 705 G3 MINI” (G3 instead of G2), but most probably the motherboard is the same. Or at least close enough to have the same issue. The process has to be repeated for every single machine (yeah, it’s a pain). Here it takes about 50 minutes each, when everything works fine (over USB 3).

      Maybe you would like to take a look at the issue I created. It has a lot of information, including kernel logs and such. Just updated it today. Here it is. Maybe when a fix (or workaround) is found in the future, you can use it too.

      https://forums.fogproject.org/topic/10731/crash-due-to-timeout-in-tg3-kernel-module-tg3_stop_block-timed-out-ofs-4c00-enable_bit-2

      Regards,
      Paulo

      posted in Hardware Compatibility
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      Hello, sorry for taking long to answer this. Too much (time-sensitive) things at work. Well, I managed to find out a few new details on this bug. It seems that this AMD architecture is still not yet well supported.

      This message mentions a few changes on the tigon3, including a workaround that is specific for my network card. I tested it, but it’s not working.
      https://lkml.org/lkml/2017/12/31/125
      <…>
      Siva Reddy Kallam (3):
      tg3: Update copyright
      tg3: Add workaround to restrict 5762 MRRS to 2048
      tg3: Enable PHY reset in MTU change path for 5720
      <…>

      According to this thread, the fix still does not solve the issue. Last post: 2018-01-16.
      It’s the patch for tg3, aimed to my specific ethernet card (5762).
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664

      Meanwhile, I have downloaded and rebuilt the latest linux release candidate, which has this patch for the tg3 module.

      The 4.15-rc8 is available here:
      https://git.kernel.org/torvalds/t/linux-4.15-rc8.tar.gz

      The bzimage file was created as a static TomElliot 64 bit image.
      https://wiki.fogproject.org/wiki/index.php?title=Build_TomElliott_Kernel

      Unfortunately, my tests with this kernel showed no improvements on the timeout issue. The problem still happens. I tried a few kernel parameters, without success. This is a vanilla (+TomElliot config) kernel. Not tainted, although it has the firmware repository inside.

      However, I finally got kernel logs. You can check them in the links below.

      log_01_acpi_off.txt
      https://pastebin.com/FGQNiLqk

      log_02_maxcpus_1.txt
      https://pastebin.com/2eEJnA3Z

      log_03_nmi_watchdog_off.txt
      https://pastebin.com/Su44AqiX

      log_04_nmi_watchdog_off.txt
      https://pastebin.com/4ja0UZ0c

      log_05_noapic_nolapic.txt
      https://pastebin.com/fZNJbME5

      The kernel parameters were used as follows. Some were inspired by the logs (tsc), some just to… see what happens.

      debug loglevel=7
      debug loglevel=7 acpi=off
      debug loglevel=7 acpi=off tsc=unstable
      debug loglevel=7 acpi=off tsc=unstable maxcpus=1
      debug loglevel=7 acpi=off tsc=unstable maxcpus=1 nmi_watchdog=0
      debug loglevel=7 acpi=off tsc=unstable maxcpus=1 nmi_watchdog=0 noapic nolapic

      Sometimes it’s difficult to get logs as the machine hangs right after the network stops working.

      Here is the mrrs patch for tg3, related to the 5762 hw version. My test has this applied, but still does not fix the problem.
      https://github.com/torvalds/linux/commit/4419bb1cedcda0272e1dc410345c5a1d1da0e367#diff-ee9b0abeec638cc316efd5b30e0e01e8

      Any ideas? Would you like logs with other parameters? Is there anything I can do to provide further information? lsusb? lspci? lscpu? anything?

      Regards,
      Paulo

      p.s.: by the way, I also spotted network issues on a live Ubuntu image (17.10.1), both on wired (tg3) and wireless (iwlwifi) network cards.

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @sebastian-roth Hello Sebastian,
      I believe it’s a regression. I saw the old messages on the 2.6 kernels, this bug has happened before, maybe a few times already. Most probably the module worked for similar devices, but for this specific combination of new device plus old switch, it breaks. If I understood it correctly, there is a watchdog which keeps running, expecting something that never happens (such as a reply/control flow message/packet).

      Anyway, I agree that solid evidence is always better than the best guess. I will schedule a debug test as you mentioned. Hopefully, will have a result in the next few days (things are a bit tricky in here, this week). By the way, I can add some printk messages inside the module, in case you want to see something specific into the sequence of function calls or about the variables (e.g. state of the device, etc.).

      Thank you!
      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @sebastian-roth Hello, you described it very well: it was a really difficult time for us to manually clone all our labs. That’s something I don’t wish for anyone.

      Anyway, I should have time to check again on this issue next week, I hope. Will try with the latest fog + latest kernel, to see if the issue is still happening. I already checked some posts and it seems that the problem was not yet fixed. If anyone has more information on the problem, please let me know. I will report back when I reinstall, run and test it.
      Regards, Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: "Could not open inode XXXXXX through the library..." Windows 10 Sysprep Capture

      @tom-elliott
      Hello Tom, I am using the latest stable version (1.4.4) and seeing something similar when trying to clone with the “shrink partition” method. Maybe it’s something inside partclone itself…?

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Could not open inode xxxxx through the library - HP EliteDesk 705 G3 Mini

      @sebastian-roth
      Hello,
      working with the non-resizable image type really helped. Since that allowed us to capture an image, I am focusing now a bit more on the “tg3 timeout issue”. But I will post the information you asked for, as soon as things calm down in here. Probably in a few days. Sorry, but that’s all we can do now, until our labs are up and running again.

      Regards,
      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @sebastian-roth
      Hello Sebastian,

      well, yes and no. I have a lot of information on this, but I’m still working to bring up my labs. Please allow me a few days to work this out. I have a few hundred students and teachers that are eagerly expecting our labs to be ready, so it’s a big issue for us.

      Currently I am working with a very small team, cloning about 100+ machines, one by one. Yes, you’ve read it right: it’s currently impossible to multicast images with this bug and our infrastructure (gigabit ethernet mixed with 10/100 switches).

      We setup three fog servers and are using them with crossover cables. I’ve got also two external hard drives (USB 3.0), and hacked my way out through them by using a few shell scripts and a lot of tinkering.

      We are able to clone about five machines in parallel with this scheme. However, the cloning process is very very unstable. About 30% to 50% of the cloning operations are failing (roughly).

      From these, only about half is due to the tg3 problem and is related to fog. Yes, that’s right: with a pair of distinct machines and a single crossover cable across them, the “tg3 timeout” issue is still happening. Both machines (in each pair) have gigabit cards, but they are different. The bug is way less frequent, and we managed to finish many cloning operations successfully. But it’s still hapening.

      This means the 10/100 switch makes the bug more reproducible, but it’s not the root cause. It still happens, even without any 10/100 network interface in the middle.

      The other half of the failures are due to crashes and freezes from a couple of live memory sticks running ubuntu and pumping about 200GB over USB3.0 (about 45 min to 1h to finish).

      I could not dig deeper into this since we need to finish the work. Hope to have it done by next friday, maybe before of that.

      About iommu=soft, I also tried it a few times, without any success. Both in a 64 and 32bit kernel, and also with the latest “vanilla + firmware repo” kernel. I also tried many other things, such as noapic, nolapic, both of them, turning off autonegotiation, raising the log level to look for more messages and the like. Oh, and I also updated the HP BIOS firmware, turned on traffic shaping and tried other things (isolated and combined).

      Nothing solved the problem. It’s clearly a regression somewhere between the HW, the firmware and the kernel driver.

      With all the respect to Broadcom, this is something that they should have catched in a reasonably easy way. Since they gives explicit support for the kernel module, a goot testbench should have exposed the problem. Most probably their test setup has only gigabit cards, otherwise the bug would be exposed more easily.

      I really believe that a testbench with Fog, a set of machines (with many distinct cards) and a set of images (with many distinct sizes, partition layouts and the like) would be great to catch this kind of thing. Oh my, I would love to help them setup something like that…

      Aham. Well, let me see how things are moving. Will get back to you in a few days.

      Thank you all for your support,
      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2

      @sebastian-roth said in Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2:

      PCI

      Hello, here is the output for lspci.

      00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1576
      00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Carrizo (rev e4)
      00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kabini HDMI/DP Audio
      00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 157b
      00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 157c
      00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 157c
      00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 157b
      00:08.0 Encryption controller: Advanced Micro Devices, Inc. [AMD] Device 1578
      00:09.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 157d
      00:09.2 Audio device: Advanced Micro Devices, Inc. [AMD] Device 157a
      00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 20)
      00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49)
      00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 49)
      00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 4a)
      00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
      00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1570
      00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1571
      00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1572
      00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1573
      00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1574
      00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 1575
      01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10)
      02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)

      The interesting line is the following:
      01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10)

      This is a Broadcom BCM5762 device.

      Unfortunately, I still don’t know how to workaround this issue. I have systematically tried all sorts of kernel parameters, ethtool parameters and other things with no luck. And yes, I tried to turn off autoneg, with no luck.

      In the meantime, I gathered a lot of information. It’s a bit messy, so I’ll have to organize it.
      I don’t have gigabit in both ends, other than a couple of machines. That is making my cloning process a real pain, and is also the main reason I’ve not answered before :(.

      I also downloaded and rebuilt the latest kernel (linux-4.12.10) based on your .config files and instructions in here.
      https://wiki.fogproject.org/wiki/index.php?title=Build_TomElliott_Kernel#Build_TomElliott_Kernel_for_FOG_0.33b_and_newer

      This still does not solved the issue. And yes, I’m sure my kernel is running because I added a few messages in Portuguese to make sure it was going up, instead of the previous one.

      Currently I am trying more things, including tinkering with this module myself. It seems that there is something wrong with ACPI. But tg3 is a quite complex module (at least for me). Looks more like a ton of modules merged together, with dozens of special cases, switches and paths to accommodate a large family of devices. Ouch!

      Will try to better organize my ideas in order to share the details with you.
      Talk to you soon. Thank you for helping me out with this crazy bug!

      Regards,
      Paulo

      By the way, there are others looking at it right now. Check out this:
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Could not open inode xxxxx through the library - HP EliteDesk 705 G3 Mini

      @sebastian-roth Just learning how to reply in this forum. Yes, I tried to reboot into windows. No, the numbers didn’t changed, unless I perform a defrag operation. A full disk scan with chkdisk was performed, but nothing changed. The machines are brand new, they arrived two or three weeks ago. I don’t think the problem is related to a bad disk, specially because it happened in several machines without further obvious issues. I mean, when the cloning works, the underlying OS boots and works without a glitch (quick test, less than an hour).
      About SMART data: it’s good to check, will try tomorrow.
      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Could not open inode xxxxx through the library - HP EliteDesk 705 G3 Mini

      @sebastian-roth Hello Sebastian, you’re right, thanks for helping me. Here is my new post.
      https://forums.fogproject.org/topic/10731/crash-due-to-timeout-in-tg3-kernel-module-tg3_stop_block-timed-out-ofs-4c00-enable_bit-2

      I confirm that my machines have identical sector counts. They were bought in a batch and are all from exactly the same type and manufacturer, with the same devices inside. Or at least, they should be.

      I’m glad that you think I providede great information. I can provide more if you wish, including from dmesg, the debug deploy shell or from the server. You name it.

      The problem is still here and it’s preventing me and all my colleagues to work. My department is almost completely stuck due to this issue:(.

      So, thank you very much (in advance) for any help you can provide.

      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • Crash due to timeout in tg3 kernel module: tg3_stop_block timed out, ofs=4c00, enable_bit=2
      Server
      • FOG Version: 1.4.4
      • OS: 16.04.3 LTS
      Client
      • Service Version:
      • OS: Win 10 + Ubuntu 16.04.3
      Description

      Hello,

      In order to better organize ideas and to separate unrelated issues, I am creating a new post as suggested by Sebastian Roth. You’re right, thanks. This one will focus on the tg3_stop_block timed out problem.

      My first post is here (sorry, was describing the two problems in one place).
      https://forums.fogproject.org/topic/10711/could-not-open-inode-xxxxx-through-the-library-hp-elitedesk-705-g3-mini/4

      The problem:

      I am seeing a timeout error during the cloning process. I believe it is related to the tg3 kernel module, which is responsible for handling the tigor3 wired Ethernet device.
      0_1503972985241_tg3_stop_block_timed_out.png

      The observed behavior is as follows. I start a deploy, the machine sometimes starts the deploy process and after a while, it gets stuck. Then after some time (a few minutes), the kernel crashes with a timeout error.

      This happens with both a crossover cable and over a wired Ethernet across a switch. It is an intermittent issue. Last Friday I managed to clone about five machines with the crossover cable, plus one that failed.

      Today, two failed using the crossover cable. The deploy starts but at some point during the partition writing, it crashes. After 8 minutes or so, with an NTFS partition partially deployed. I tested only one machine at a time, due to the limitation of the crossover cable.

      All tests I did through the network switch also failed, but in a somewhat different way. Right after writing GPT data, but before starting to write data inside the partition. I tested with a small group of four, then two and finally with a single machine. All tests failed the same way, both with UDPCAST method (multicast deploy) and NFS method (unicast, if I remember
      correctly).

      Possible causes:

      1. My first guess was related to an issue on the crossover cable being too loose. Now I don’t think this is the root cause, since I replaced the cable by a new one. With the new cable I observed both successful image capture and image deploy. But failed captures
        and deploys happened too. So, I don’t think it’s the cable anymore.

      2. Failure on the tg3 kernel module.


      Current investigation:

      After some reading, I’ve found a few references.

      This (old) message suggests that the problem happens in one kernel version, but not in the previous one.

      https://askubuntu.com/questions/88319/server-getting-error-after-doing-distro-upgrade-tg3-stop-block-timed-out


      This (also old) message points out that:
      “When using TSO property of the TG3 driver to transmit a packet with a large header, such as over 80 bytes, an error message similar to the following appears in the Kernel log when using the TG3 3.66d version of the driver with GA3 firmware…”

      https://www.ibm.com/support/home/docdisplay?lndocid=migr-5071755

      Is also suggests a workaround.

      "Turn off the TSO functionality of the driver using the following command from Linux:

      ethtool -K eth0 tso off 
      

      "

      I started a “deploy (debug)” task and tried to do that once, manually. But the problem is still there in the very same way. If it worked, I would workaroud it by using a postinit script.

      I also tried to limit bandwidth with wondershaper a few times, but could not see much difference: same error. The idea was based on a possible concurrency issue. If the tg3 problem is due to some subtle race condition on the buffer handling for the network card, slowing it down could (possibly) reduce the issue likelyhood.

      Finally, I started playing with different kernels.

      With Kernel.TomElliott.4.1.0.64, it was “too old” and refused to work.
      With the following kernel versions, the issue is still there.
      4.12.3.64,
      Kernel.TomElliott.4.10.1.64 and
      Kernel.TomElliott.4.9.0.64

      By the way, other than showing the same issue, the last version (4.9.0.64) also complained about an APIC issue. It reads: “Firmware bug”, and also “APIC ID mismatch”. Here is the screenshot.
      0_1503972858770_Problema_APIC.png
      And that’s it: I’m getting out of ideas, other than trying the other kernels.

      Any suggestion? Anything I can do under a debug deploy, even manually to workaround this? Is there a wireless option? Anything?

      Thank you very much,
      Paulo

      p.s.: tomorrow I will try this search:
      linux kernel tg3 tso

      And read this to see what happens.
      https://blog.sleeplessbeastie.eu/2017/04/17/how-to-install-missing-firmware-for-tg3-module/

      posted in FOG Problems
      P
      Paulo.Guedes
    • RE: Could not open inode xxxxx through the library - HP EliteDesk 705 G3 Mini

      Sorry, but I’m still not used to this forum. Just answered your message using the wrong button. Please take a look when it’s possible. At least now I am ready to clone everything next Monday. So, if you think it’s helpful, I can take a look at the issues in more detail, as soon as this deploy is ready (about 100 machines).
      Thank you,
      Paulo

      posted in FOG Problems
      P
      Paulo.Guedes
    • 1 / 1