entr0py

entr0py

@brakcounty

root@fogserver:~# ethtool eth0 | grep Speed
Speed: 40000Mb/s

Raising txqueuelen on the interface to 40K (seems like 1K was the default when 1 gig cards became kind of standard) got rid of the falling bandwidth issue that was present on the throughput graph yesterday. I’m wondering if it didn’t get a buffer overrun or some other kind of nonsense at the kernel level and that was why it would run great for a while, then once it threw enough errors or whatever it started falling on it’s face? IDK.

As for the Hyper-V settings, I was noticing odd CPU behavior on the Hyper-V host when I’d saturate the network from on the FOG guest. Setting RSS to NUMA scaling instead of closest processor static got rid of that behavior.

txqueuelen made the biggest different in getting it to a stable state, then the NIC settings increased the total throughput. But there were a ton of things played with too so I’m not exactly sure what all “tweaks” helped, but those two were the largest factors I noticed.

entr0py

TL;DR,

If you’re seeing performance issues scaling with high bandwidth FOG servers that should have the hardware horsepower to make it work, check your NIC settings, especially when virtualized. txqueuelen parameter in Linux can make a huge difference. If it’s virtualized, look into the settings on both the host and the guest.

You should be able to get >10Gbit from your FOG server if you have hardware that will support it. We hit ours with 20 devices this afternoon all pulling 47gb images and the whole shebang was over in 7 minutes.

Solved!

entr0py

@brakcounty

root@fogserver:~# ethtool eth0 | grep Speed
Speed: 40000Mb/s

Raising txqueuelen on the interface to 40K (seems like 1K was the default when 1 gig cards became kind of standard) got rid of the falling bandwidth issue that was present on the throughput graph yesterday. I’m wondering if it didn’t get a buffer overrun or some other kind of nonsense at the kernel level and that was why it would run great for a while, then once it threw enough errors or whatever it started falling on it’s face? IDK.

As for the Hyper-V settings, I was noticing odd CPU behavior on the Hyper-V host when I’d saturate the network from on the FOG guest. Setting RSS to NUMA scaling instead of closest processor static got rid of that behavior.

txqueuelen made the biggest different in getting it to a stable state, then the NIC settings increased the total throughput. But there were a ton of things played with too so I’m not exactly sure what all “tweaks” helped, but those two were the largest factors I noticed.

entr0py

Update - found a couple issues lurking.

First, I think if you’re going to run 40G, and maybe even 10G to an extent, you need to play with the queuelen parameter on your ethernet interface. Raising that seemed to help things a bunch. I’d like to go to Jumbo Frames too but I’m not brave enough to make that leap yet.

Second, playing with the settings in the NIC on the host device made little difference, but playing with the ones on the actual Hyper-V switch did. Mainly some of the offloading and VLAN filtering settings.

Third, we weren’t ever going to hit more than 10G anyway considering there was a VLAN misconfiguration and some of the devices were actually routing back to the FOG server instead of being switched to there and the router only has 10G so yeah…

Anyway, long story short, We were hitting 7-8 Gbit last night with 8-9 devices imaging. This morning I’m running a steady 3+ running 4 devices at a time. If my weekend goes right I’ll end up deploying some labs that have 10 devices per row and each row has a 10G uplink a 10G uplink back to the core switch so we’ll see if it can scale beyond 10G. I should be able to hit it with a full 40G worth of request at once.

https://imgur.com/LWoeOQC

entr0py

@george1421

It is still virtualized, it was moved from a Serve 2012 R2 Hyper-V host to a Server 2022 Hyper-V host. The prior storage subsystem was iSCSI across 10G to a shared storage server. The images themselves ran from a single SATA SSD.

I realize that it doesn’t take a workhorse to run FOG. The server before that was an old HP Microserver with a couple SSDs and a 10G nic. Even with 4 cores and 4 GB RAM that machine would saturate the 10G uplinks and push to clients at a full gig each.

I think the problem may have been found but I’ll keep testing and update. The server itself runs an Intel XL710 40GbE NIC and there is a driver function called Virtual Machine Queues and apparently some people have experienced issues with that. I’ll know in a few hours if things are better.

entr0py

Devs,

This summer I transitioned our FOG server from being a guest on a much older and slower Hyper-V platform to more modern hardware. Along the way we went from Ubuntu 16.04 to 22.04 and etc.

Now we’re in the thick of things getting ready to deploy hundreds of devices per day and I’m seeing very very strange behavior when it comes to image throughput. The first device will image at a full gig, the second will take an additional gig, etc. From there, it gets worse and eventually everything will basically freeze. None of the underlying network infrastructure has changed apart from the migration from 10 Gig to 40 Gig on the server itself, still into the same switch, still across the same 10 gig to my lab, etc.

The server itself has plenty of horsepower and I can get 40 gig between it and other devices on the 40G backbone. Even running multiple copies or transfers at the same time I don’t see the performance issues. The images are coming off SSD so bandwidth there shouldn’t be an issue either. Load averages are low, top shows nfsd taking 6-8% of the CPU per thread (16 cores are allocated to the FOG guest from a 128 core host). I’m perplexed as to why this odd behavior. Even if I just transfer a file from /images via nfs to my workstation I can pull 10G to my workstation (all I’ve got for NIC).

Please see the screenshot below and let me know what more you need for diagnostic info. Thanks!

https://imgur.com/a/zZk5ppA

entr0py

@wayne-workman

I wish it was that simple. Looks like it isnt just one specific motherboard from MSI and I have a bunch of Asrock machines that might have similar behavior, I will investigate this week.

When we were working in the database our focus was the MSI units that were delivering ffffffff based UUIDs but we also noticed some with 000000’s and those seem to be the Asrock units.

Not sure what the ideal solution is here but it appears UUID is problematic across multiple vendors.

entr0py

1.5.0-RC9 SVN 6080

dev-branch

I’m not using multicast, I select the group, or a few individual machine for that matter, and schedule for instant deploy. They already have fog-client on them and they reboot as expected and start their task, until they get to the point of actually doing something and at that point they either error out, or worst case, they all start, the first one done deletes the task for 001 and then all the rest of the units totally fail as soon as partclone finishes because they try to get their tasks and there are none. They reboot and the while fiasco starts over again because 002 still has an active task so PXE send it to restore but once it gets to that point it says no task for 001 and reboots.

I can give access to the fog server as well as anything else that might be helpful. I really think at this point it is some kind of database issue. I sent Tom a PM because I felt this was more of a localized problem not a project issue but I never got a reply and I have to get this running today.

I also took my screenshot on the reports page so you can see each host does show up correctly with their individual MAC address so it isn’t like a MAC duplication issue.
0_1508505215001_Screenshot 1.png

Thanks

entr0py

Upgrading to the latest git version solved my image resizing problems but now I’m having another issue. I did a complete backup of my latest master, tried to deploy it to a computer lab and each client seems to be reliant on xxxx-001. For example, backed up from a computer named IMG-A85XMA, am restoring to LAB103-001 through LAB103-027 but as soon as the task completes on 001, every other maching fails stating “There is no active task for 001” even though the menu shows their names correctly, they are reporting the correct MAC, and when they error out and go to the reboot screen they show the correct MAC and name.

If I power off unit 001 and give it a deploy task, then unit 002 or any other unit will complete their task, but when they do they remove the task 001 from the scheduled tasks and thus the next unit in line, and all other units tasked with a deploy all fail. See the below screenshot showing the hostname as 002 and it expecting a task for 001. The odd part is I would expect if anything it to be tied to the original host name of the computer that was captured but that is not what it is expecting.

I have tried completely deleting the host registrations, recreating the registrations, a new image. It appears to be some massive confusion in the task scheduling or the database???

entr0py

As an update, I created a new image, named the same thing with an extra character at the end. Changed my image master and one client to be the new image, captured and deployed and the results are the same. I can say it isn’t something being held over from the old image configuration from 1.3.

d1.partitions:

label: gpt
label-id: 3B297ACC-51F5-4B3A-8A67-2227339DA914
device: /dev/nvme0n1
unit: sectors
first-lba: 2048
last-lba: 500118158

/dev/nvme0n1p1 : start=        2048, size=      921600, type=DE94BBA4-06D1-4D40-A16A-BFD50179D6AC, uuid=A15FB31F-4441-488E-9456-89B9B7461E1F, name="Basic data partition"
/dev/nvme0n1p2 : start=      923648, size=      202752, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=5DF33B32-90B4-4D42-8B08-365B8DA0A994, name="EFI system partition"
/dev/nvme0n1p3 : start=     1126400, size=       32768, type=E3C9E316-0B5C-4DB8-817D-F92DF00215AE, uuid=D2C44AF3-54B4-4A39-9E78-EEFB1F771027, name="Microsoft reserved partition"
/dev/nvme0n1p4 : start=     1159168, size=   498958336, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=28132F21-6D6D-4F0E-B09B-1FAFADFF097F, name="Basic data partition"

p1, 2, and 3 are properly transferred and they are listed in fixed_size. It appears fog even thinks p4 actually resized properly but it never does.

0_1502571160195_81bfa108-95fb-4ee1-bf6e-c1f47f9a21fc-image.png

entr0py

Server

FOG Version: 1.4.4
OS: Ubuntu 16.04

Client

Service Version: 0.11.15
OS: Windows 10 Enterprise

Description

Recently upgraded a FOG 1.3 installation to 1.4.4. Using the same images created with 1.3 now workstations start with the failed to set guid. Originally, it was a error and everything would stop at that point. The system would reboot and try to rerun the task because it failed to complete. Applying the init.xz files from thread 9948 (https://forums.fogproject.org/topic/9948/single-disk-resizable-failed-to-set-disk-guid/6) changed it from an error to a warning. The task now completes and the units are not stuck in the reboot and reimage loop but the partitions never resize.

I’ve searched the threads to no avail on how to correct this issue. Removing the d1.original.uuids causes the errors to go away but does not fix the resize issue. I’ve even re-pushed the image to the server with the .uuids missing, after the init.xz change, and it recreated the same problems.

Thanks in advance!

entr0py

@entr0py

Best posts made by entr0py

Latest posts made by entr0py

Server

Client

Description