• Recent
    • Unsolved
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Register
    • Login

    Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job

    Scheduled Pinned Locked Moved Solved
    FOG Problems
    5
    18
    2.4k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • fenix_teamF
      fenix_team
      last edited by

      I’m having this same issue with another type of Legacy BIOS machine, which BIOS and CPU details are:

      • System Type Type: Main Server Chassis
      • BIOS Vendor: American Megatrends Inc.
      • BIOS Version: 2.1
      • BIOS Date: 12/30/2011
      • Motherboard Manufacturer: Supermicro
      • Motherboard Product Name: X8DAL
      • CPU Manufacturer: Intel
      • CPU Version: Intel® Xeon® CPU E5620 @ 2.40GHz Intel® Xeon® CPU E5620 @ 2.40GHz
      • CPU Normal Speed: Current Speed: 2400 MHz
      • CPU Max Speed: Max Speed: 2400 MHz

      I’m executing the tests as I post it, including that above described machine.

      george1421G 1 Reply Last reply Reply Quote 0
      • george1421G
        george1421 Moderator @fenix_team
        last edited by

        @fenix_team Again thank you for the details all of it is helpful trying to deduce the issue. The one question I forgot to ask is what FOS kernel version where you originally on? Looking at the FOS kernel list I see as current 4.19.6 released in December, 4.19.1 released in November, and 4.18.3 released in August.

        Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

        fenix_teamF 1 Reply Last reply Reply Quote 0
        • fenix_teamF
          fenix_team @george1421
          last edited by

          @george1421 My original FOS image is the latest available at Kernel Update GUI page, which currently is 4.19.6 (both bzImage and bzImage32)

          I just finished one of the 2 machines with systems as described in OP. The capture job succeeded with not much as a single warning! The system with American Megatrends BIOS is also smoothly past the point in which the issue was happening. I also noticed another change, all of these machines sometimes got stuck in iPXE boot while loading “/default.ipxe”, at 0%, and forced me to reboot lots of times until randomly it boot correctly. After changing kernel and init versions, that problem vanished (I don’t know if things are related, tho).

          My last machine with legacy Phoenix Awards BIOS did not complete the capture job, but looking at partclone.log I found out a bad block issue. I’m now trying to capture it using DD method, which increased time but is worth the test. I’ll report on it later.

          george1421G 1 Reply Last reply Reply Quote 0
          • george1421G
            george1421 Moderator @fenix_team
            last edited by

            @fenix_team Since you seem to have a fleet (smile) of impacted systems, could you help with debugging? What I’d like to know if you use the 1.5.5 inits and then test kernel 4.18.3 from this site: https://fogproject.org/kernels/ What I want to see if its an issue with the 4.19.x branch. We know that 4.15.2 was a very stable kernel build and we had to backup to that release a few times because of dramatic changes in the linux kernel after the 4.15.x versions.

            Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

            fenix_teamF 1 Reply Last reply Reply Quote 0
            • fenix_teamF
              fenix_team @george1421
              last edited by

              @george1421 question, should I use original init.xz packed with FOG 1.5.5 or should I downgrade it as well to the 1.5.2 one as you suggested earlier?
              I ask it because I already extensively tested 4.18.x down to 4.16.x branches (for both archs) and in all cases I had kernel panic “FATAL: Kernel too old” issues.

              george1421G 1 Reply Last reply Reply Quote 0
              • george1421G
                george1421 Moderator @fenix_team
                last edited by

                @fenix_team OK, I didn’t think the FOG project devs held the kernel that tightly with the inits. Please use the 1.5.2 inits then. But I need to ask you to try the 4.19.6 kernel with the 1.5.2 inits for completeness. Its almost clear in my head the inits are not the issue here, but to rule it out and to complete the truth table if you have time, please test.

                1.5.2 inits
                4.19.6
                4.18.3

                We know 4.15.2 is good with 1.5.2 inits
                We know 4.19.6 is bad with the 1.5.5 inits.

                Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                fenix_teamF Q 2 Replies Last reply Reply Quote 0
                • fenix_teamF
                  fenix_team @george1421
                  last edited by

                  @george1421 very well, I’ll test as you specified and report the results later.

                  1 Reply Last reply Reply Quote 1
                  • Q
                    Quazz Moderator @george1421
                    last edited by

                    @george1421 The problem is for new kernels we need to update the kernel headers. Programs built against that (such as the programs in the init files) require the minimum supported kernel of those headers to run.

                    In practice, it will often work anyway, but sometimes the changes make it impossible, nothing that can really be done about that from our side afaik.

                    1 Reply Last reply Reply Quote 2
                    • S
                      Sebastian Roth Moderator
                      last edited by

                      @fenix_team said in Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job:

                      My original FOS image is the latest available at Kernel Update GUI page, which currently is 4.19.6 (both bzImage and bzImage32)
                      I just finished one of the 2 machines with systems as described in OP. The capture job succeeded with not much as a single warning! The system with American Megatrends BIOS is also smoothly past the point in which the issue was happening.

                      Hey, thanks for reporting so many details about this! I started to look into this and reading all the messages posted. That one really caught my attention. Are you saying that it does work “sometimes” without an issue. Is that on the same kernel version 4.19.6 that is causing the error initially posted?? Would make it even harder for us to nail this issue down.

                      And a quick comment on the kernel/init versions. There is no strict rule that kernels are compiled against exactly one init version or vice versa. But looking into this more closely I just figured something out that I wasn’t aware of until now: There is an option within buildroot (the toolstack we use for the inits) that is used to optimize glibc compilation. The more recent kernel version you choose the less compatibility code is needed to be build into glibc and therefore the binaries are smaller in size. Sounds pretty straight forward and if I had known this before (there are hundreds of buildroot options and I really don’t know what they are all doing exactly) I would have build with more compatibility!
                      I will compile a new set of inits with more compatibility now and see if it matters in size much. I guess it won’t as the inits are huge ( just under 20 MB) anyway. We’ll see. I will let you all know.

                      Ok, back to the initial posted issue: Trying to figure out what might be causing this on your hardware I started by reading the kernel docs on this. Essentially it says that this can be caused by many different things (see a detailed list in the document linked) and we might need to turn on CONFIG_RCU_TRACE in the kernel to get an idea where things go wrong. But as a start we would need to have a clear picture of the exact error messages on screen.

                      @fenix_team said:

                      I also noticed another change, all of these machines sometimes got stuck in iPXE boot while loading “/default.ipxe”, at 0%, and forced me to reboot lots of times until randomly it boot correctly. After changing kernel and init versions, that problem vanished (I don’t know if things are related, tho).

                      From my point of view those two things can’t be related as the Linux kernel is not running when you get to loading default.ipxe yet! It’s interesting you seem to have this fixed by changing the Linux kernel and inits though. I suspect it to be just a coincidence. Usually when thinks don’t load properly at that stage it’s a network driver problem within the iPXE code. Another thing very hard to debug as it is hardware specific and needs to be reproduced to find and fix. But for iPXE there might be a different solution for you. We provide a set of different binaries which you all find in /tftpboot on your FOG server. Default for legacy BIOS machines is undionly.kkpxe. You can try undionly.*pxe (UNDI network stack only), ipxe.*pxe (native driver stack all included), intel.*pxe (native driver but Intel NICs only) and realtek.*pxe (native driver but Realtek NICs only).

                      Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                      Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                      Tom ElliottT 1 Reply Last reply Reply Quote 1
                      • Tom ElliottT
                        Tom Elliott @Sebastian Roth
                        last edited by

                        @Sebastian-Roth you’re absolutely right about the knot sizes. One thing I want to add, however, is the old inits from 0.32 were 30mb in size and often the kernels were aroun 10-15 mb in size. Our current inits are 18-20mb and the kernels are around 7-9 mb in size. So we’re actually doing pretty well I think.

                        Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG! Get in contact with me (chat bubble in the top right corner) if you want to join in.

                        Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                        Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                        1 Reply Last reply Reply Quote 0
                        • S
                          Sebastian Roth Moderator
                          last edited by

                          @fenix_team @george1421 @Quazz Ok, I just compiled inits that should work with kernels all the way back to 4.15.x (64 bit and 32bit). Can you guys give those a try in your environments before I make those the default?

                          @Tom-Elliott Turns out size is not a matter. The inits only grew by about 3-4 KB.

                          Web GUI issue? Please check apache error (debian/ubuntu: /var/log/apache2/error.log, centos/fedora/rhel: /var/log/httpd/error_log) and php-fpm log (/var/log/php*-fpm.log)

                          Please support FOG if you like it: https://wiki.fogproject.org/wiki/index.php/Support_FOG

                          1 Reply Last reply Reply Quote 1
                          • fenix_teamF
                            fenix_team
                            last edited by

                            Hi guys, I didn’t forget about this topic. I’m just currently dealing with some iPXE booting challenges due to the big differences between system archs I have here. I’m almost finished, so I can try out these tests you asked.

                            So far, some answers:

                            @Sebastian-Roth said in Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job:

                            Are you saying that it does work “sometimes” without an issue. Is that on the same kernel version 4.19.6 that is causing the error initially posted?? Would make it even harder for us to nail this issue down.

                            As much as I wanted to answer it technically, the best I have is: yes, it’s kinda random. Our business model demands constant infrastructure changes as our clients points out their needs, so we have lots of machines that although are the same models, have slightly different CPUs and BIOS versions, a challenging scenario for applications such as FOG to be set up as an automation tool. So at each node I have to test what FOS image will be the best fit.

                            So far I had 4.19.6 bzImage + FOG 1.5.5 init.xz working on about 90% of my systems with no bugs, hangings or issues of other nature. For the ones I did find issues, switching it to 4.15.2 as suggested by @george1421 fixed the problems, but only when I used init.xz packed with FOG 1.5.2 binaries.

                            Using bzImage 4.15.2 + FOG 1.5.5 init.xz gave me kernel panic “FATAL: Kernel too old” messages on every single system I’ve tried it. It happens also with bzImage of all versions from this up to 4.19.6 with the same init.xz, which works fine to boot and start the task, but throws me the errors reported in the title of this topic at given point in image deploy/capture tasks (it’s not always the same point and I didn’t test other kinds of tasks).

                            @Sebastian-Roth said in Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job:

                            Trying to figure out what might be causing this on your hardware I started by reading the kernel docs on this. Essentially it says that this can be caused by many different things (see a detailed list in the document linked) and we might need to turn on CONFIG_RCU_TRACE in the kernel to get an idea where things go wrong. But as a start we would need to have a clear picture of the exact error messages on screen.

                            Ok, I’ll reproduce the error scenario and take a picture of the screen. I’m doing this right now.

                            @Sebastian-Roth said in Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job:

                            @fenix_team @george1421 @Quazz Ok, I just compiled inits that should work with kernels all the way back to 4.15.x (64 bit and 32bit). Can you guys give those a try in your environments before I make those the default?

                            Will test it right after the rcu_sched issue.

                            george1421G 1 Reply Last reply Reply Quote 1
                            • george1421G
                              george1421 Moderator @fenix_team
                              last edited by

                              @fenix_team said in Error "rcu_sched self detected stall on CPU" on legacy BIOS Capture job:

                              Using bzImage 4.15.2 + FOG 1.5.5 init.xz gave me kernel panic “FATAL: Kernel too old” messages on every single system I’ve tried it.

                              Ok, I just compiled inits that should work with kernels all the way back to 4.15.x (64 bit and 32bit). Can you guys give those a try in your environments before I make those the default?

                              What Sebastian is saying here is he recompiled the inits to move the minimum kernel requirement back to support the 4.15.x series of linux kernels. So (for now) you only need to manage bzImage and bzImage4152 kernels using the same inits (virtual hard drive).

                              Please help us build the FOG community with everyone involved. It's not just about coding - way more we need people to test things, update documentation and most importantly work on uniting the community of people enjoying and working on FOG!

                              fenix_teamF 2 Replies Last reply Reply Quote 0
                              • fenix_teamF
                                fenix_team @george1421
                                last edited by

                                @george1421 yes, I understood that, I’m downloading the inits and will test it asap. What I stated was just to confirm why these issues were happening.

                                1 Reply Last reply Reply Quote 1
                                • fenix_teamF
                                  fenix_team @george1421
                                  last edited by

                                  @george1421 @Sebastian-Roth Hello guys! I’m here to say that I had no problems loading tasks, neither capture or deploy, since I’ve updated the init files.

                                  I’m using bzImage at latest version, did extensive tests on Legacy BIOS systems that were presenting the “rcu_sched” warnings and so far I’ve never saw them again or any other hanging issues.

                                  If I can help with any othe kind of tests, please let me know.
                                  Thanks everyone, awesome work!

                                  1 Reply Last reply Reply Quote 1
                                  • 1 / 1
                                  • First post
                                    Last post

                                  260

                                  Online

                                  12.0k

                                  Users

                                  17.3k

                                  Topics

                                  155.2k

                                  Posts
                                  Copyright © 2012-2024 FOG Project