Posts made by entr0py

entr0py

TL;DR,

If you’re seeing performance issues scaling with high bandwidth FOG servers that should have the hardware horsepower to make it work, check your NIC settings, especially when virtualized. txqueuelen parameter in Linux can make a huge difference. If it’s virtualized, look into the settings on both the host and the guest.

You should be able to get >10Gbit from your FOG server if you have hardware that will support it. We hit ours with 20 devices this afternoon all pulling 47gb images and the whole shebang was over in 7 minutes.

Solved!

entr0py

@brakcounty

root@fogserver:~# ethtool eth0 | grep Speed
Speed: 40000Mb/s

Raising txqueuelen on the interface to 40K (seems like 1K was the default when 1 gig cards became kind of standard) got rid of the falling bandwidth issue that was present on the throughput graph yesterday. I’m wondering if it didn’t get a buffer overrun or some other kind of nonsense at the kernel level and that was why it would run great for a while, then once it threw enough errors or whatever it started falling on it’s face? IDK.

As for the Hyper-V settings, I was noticing odd CPU behavior on the Hyper-V host when I’d saturate the network from on the FOG guest. Setting RSS to NUMA scaling instead of closest processor static got rid of that behavior.

txqueuelen made the biggest different in getting it to a stable state, then the NIC settings increased the total throughput. But there were a ton of things played with too so I’m not exactly sure what all “tweaks” helped, but those two were the largest factors I noticed.

entr0py

Update - found a couple issues lurking.

First, I think if you’re going to run 40G, and maybe even 10G to an extent, you need to play with the queuelen parameter on your ethernet interface. Raising that seemed to help things a bunch. I’d like to go to Jumbo Frames too but I’m not brave enough to make that leap yet.

Second, playing with the settings in the NIC on the host device made little difference, but playing with the ones on the actual Hyper-V switch did. Mainly some of the offloading and VLAN filtering settings.

Third, we weren’t ever going to hit more than 10G anyway considering there was a VLAN misconfiguration and some of the devices were actually routing back to the FOG server instead of being switched to there and the router only has 10G so yeah…

Anyway, long story short, We were hitting 7-8 Gbit last night with 8-9 devices imaging. This morning I’m running a steady 3+ running 4 devices at a time. If my weekend goes right I’ll end up deploying some labs that have 10 devices per row and each row has a 10G uplink a 10G uplink back to the core switch so we’ll see if it can scale beyond 10G. I should be able to hit it with a full 40G worth of request at once.

https://imgur.com/LWoeOQC

entr0py

@george1421

It is still virtualized, it was moved from a Serve 2012 R2 Hyper-V host to a Server 2022 Hyper-V host. The prior storage subsystem was iSCSI across 10G to a shared storage server. The images themselves ran from a single SATA SSD.

I realize that it doesn’t take a workhorse to run FOG. The server before that was an old HP Microserver with a couple SSDs and a 10G nic. Even with 4 cores and 4 GB RAM that machine would saturate the 10G uplinks and push to clients at a full gig each.

I think the problem may have been found but I’ll keep testing and update. The server itself runs an Intel XL710 40GbE NIC and there is a driver function called Virtual Machine Queues and apparently some people have experienced issues with that. I’ll know in a few hours if things are better.

entr0py

Devs,

This summer I transitioned our FOG server from being a guest on a much older and slower Hyper-V platform to more modern hardware. Along the way we went from Ubuntu 16.04 to 22.04 and etc.

Now we’re in the thick of things getting ready to deploy hundreds of devices per day and I’m seeing very very strange behavior when it comes to image throughput. The first device will image at a full gig, the second will take an additional gig, etc. From there, it gets worse and eventually everything will basically freeze. None of the underlying network infrastructure has changed apart from the migration from 10 Gig to 40 Gig on the server itself, still into the same switch, still across the same 10 gig to my lab, etc.

The server itself has plenty of horsepower and I can get 40 gig between it and other devices on the 40G backbone. Even running multiple copies or transfers at the same time I don’t see the performance issues. The images are coming off SSD so bandwidth there shouldn’t be an issue either. Load averages are low, top shows nfsd taking 6-8% of the CPU per thread (16 cores are allocated to the FOG guest from a 128 core host). I’m perplexed as to why this odd behavior. Even if I just transfer a file from /images via nfs to my workstation I can pull 10G to my workstation (all I’ve got for NIC).

Please see the screenshot below and let me know what more you need for diagnostic info. Thanks!

https://imgur.com/a/zZk5ppA

entr0py

@wayne-workman

I wish it was that simple. Looks like it isnt just one specific motherboard from MSI and I have a bunch of Asrock machines that might have similar behavior, I will investigate this week.

When we were working in the database our focus was the MSI units that were delivering ffffffff based UUIDs but we also noticed some with 000000’s and those seem to be the Asrock units.

Not sure what the ideal solution is here but it appears UUID is problematic across multiple vendors.

entr0py

1.5.0-RC9 SVN 6080

dev-branch

I’m not using multicast, I select the group, or a few individual machine for that matter, and schedule for instant deploy. They already have fog-client on them and they reboot as expected and start their task, until they get to the point of actually doing something and at that point they either error out, or worst case, they all start, the first one done deletes the task for 001 and then all the rest of the units totally fail as soon as partclone finishes because they try to get their tasks and there are none. They reboot and the while fiasco starts over again because 002 still has an active task so PXE send it to restore but once it gets to that point it says no task for 001 and reboots.

I can give access to the fog server as well as anything else that might be helpful. I really think at this point it is some kind of database issue. I sent Tom a PM because I felt this was more of a localized problem not a project issue but I never got a reply and I have to get this running today.

I also took my screenshot on the reports page so you can see each host does show up correctly with their individual MAC address so it isn’t like a MAC duplication issue.
0_1508505215001_Screenshot 1.png

Thanks

entr0py

Upgrading to the latest git version solved my image resizing problems but now I’m having another issue. I did a complete backup of my latest master, tried to deploy it to a computer lab and each client seems to be reliant on xxxx-001. For example, backed up from a computer named IMG-A85XMA, am restoring to LAB103-001 through LAB103-027 but as soon as the task completes on 001, every other maching fails stating “There is no active task for 001” even though the menu shows their names correctly, they are reporting the correct MAC, and when they error out and go to the reboot screen they show the correct MAC and name.

If I power off unit 001 and give it a deploy task, then unit 002 or any other unit will complete their task, but when they do they remove the task 001 from the scheduled tasks and thus the next unit in line, and all other units tasked with a deploy all fail. See the below screenshot showing the hostname as 002 and it expecting a task for 001. The odd part is I would expect if anything it to be tied to the original host name of the computer that was captured but that is not what it is expecting.

I have tried completely deleting the host registrations, recreating the registrations, a new image. It appears to be some massive confusion in the task scheduling or the database???

entr0py

As an update, I created a new image, named the same thing with an extra character at the end. Changed my image master and one client to be the new image, captured and deployed and the results are the same. I can say it isn’t something being held over from the old image configuration from 1.3.

d1.partitions:

label: gpt
label-id: 3B297ACC-51F5-4B3A-8A67-2227339DA914
device: /dev/nvme0n1
unit: sectors
first-lba: 2048
last-lba: 500118158

/dev/nvme0n1p1 : start=        2048, size=      921600, type=DE94BBA4-06D1-4D40-A16A-BFD50179D6AC, uuid=A15FB31F-4441-488E-9456-89B9B7461E1F, name="Basic data partition"
/dev/nvme0n1p2 : start=      923648, size=      202752, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=5DF33B32-90B4-4D42-8B08-365B8DA0A994, name="EFI system partition"
/dev/nvme0n1p3 : start=     1126400, size=       32768, type=E3C9E316-0B5C-4DB8-817D-F92DF00215AE, uuid=D2C44AF3-54B4-4A39-9E78-EEFB1F771027, name="Microsoft reserved partition"
/dev/nvme0n1p4 : start=     1159168, size=   498958336, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=28132F21-6D6D-4F0E-B09B-1FAFADFF097F, name="Basic data partition"

p1, 2, and 3 are properly transferred and they are listed in fixed_size. It appears fog even thinks p4 actually resized properly but it never does.

0_1502571160195_81bfa108-95fb-4ee1-bf6e-c1f47f9a21fc-image.png

entr0py

Server

FOG Version: 1.4.4
OS: Ubuntu 16.04

Client

Service Version: 0.11.15
OS: Windows 10 Enterprise

Description

Recently upgraded a FOG 1.3 installation to 1.4.4. Using the same images created with 1.3 now workstations start with the failed to set guid. Originally, it was a error and everything would stop at that point. The system would reboot and try to rerun the task because it failed to complete. Applying the init.xz files from thread 9948 (https://forums.fogproject.org/topic/9948/single-disk-resizable-failed-to-set-disk-guid/6) changed it from an error to a warning. The task now completes and the units are not stuck in the reboot and reimage loop but the partitions never resize.

I’ve searched the threads to no avail on how to correct this issue. Removing the d1.original.uuids causes the errors to go away but does not fix the resize issue. I’ve even re-pushed the image to the server with the .uuids missing, after the init.xz change, and it recreated the same problems.

Thanks in advance!

entr0py

@Tom-Elliott

Upgraded to 6007 on both host and node, made the change you suggested and still nothing. Same routine.

I created another node with all the same settings, placed it in the same storage group, disabled the old node and enabled the new one and still nothing. If I select that storage group on the dashboard it shows 0/0/0 with and the node doesn’t show in the list either.

entr0py

@Tom-Elliott

Don’t be sorry! Thanks for all you do!

entr0py

@Tom-Elliott

Nevermind… how about /var/www/html/fog/lib/fog

My bad.

entr0py

That might be the problem considering I don’t even have that file…

root@FOG-Virtual-Machine:/var/www/fog/lib/fog# ls -lsa quests
ls: cannot access quests: No such file or directory

Those are supposed to be asterisks for anything quests anything. I only have one file in the whole lib/fog directory with a newer date than August 13 2014…

entr0py

Server

FOG Version: 1.3.0-RC20
OS: Ubuntu 12.04

Client

Service Version: N/A
OS: N/A

Description

Upgraded to RC20, when I go to Host Management>New Search and put anything in the search box the following error appears.

0_1478480932398_Unexpected token R.png

Searching by going to List All Hosts and using the filters works fine.

Thanks guys!

entr0py

@Tom-Elliott

Any thoughts on why the storage node still doesn’t work? If I try to create a task it says the node doesn’t exist. If I try to create a task I get the following:

If I try to show the node on the dashboard it isn’t even listed:

0_1478480581836_node list.png

If I try to image a computer using “Deploy Image” on GRUB menu:

entr0py

Also, either user should work given the grants in MySQL.

mysql> show grants for fog;
±---------------------------------------------------------------------------------------------------+
| Grants for fog@% |
±---------------------------------------------------------------------------------------------------+
| GRANT USAGE ON . TO ‘fog’@‘%’ IDENTIFIED BY PASSWORD ‘2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19’ |
| GRANT ALL PRIVILEGES ON fog. TO ‘fog’@’%’ WITH GRANT OPTION |
±---------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> show grants for fogstorage;
±----------------------------------------------------------------------------------------------------------+
| Grants for fogstorage@% |
±----------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON . TO ‘fogstorage’@‘%’ IDENTIFIED BY PASSWORD ‘2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19’ |
| GRANT ALL PRIVILEGES ON fog. TO ‘fogstorage’@’%’ WITH GRANT OPTION |
±----------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql>

entr0py

@Tom-Elliott

Sorry for it being so confusing what I posted. I was just playing around with a ton of options trying to find something that might work before I started to complain at the overworked and underpaid developers. My snmysqluser=‘fogstorage’ on the node already. Passwords all match. I can ftp between the boxes using the management user of “fog” and password of “password” and mysql works with “fogstorage” and “password” as well. The user works as shown below:

root@node0-MicroServer-Gen8:/opt/fog# mysql -h 172.16.0.13 -u fogstorage -ppassword -A fog
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2984
Server version: 5.5.53-0ubuntu0.12.04.1 (Ubuntu)

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

mysql> SELECT * FROM hostMAC WHERE hmHostID=0;
Empty set (0.00 sec)

It also matches the web interface settings.

Attached are both .fogsettings files.

0_1478478272991_dotfogsettings-host.txt
0_1478478279431_dotfogsettings-node.txt

entr0py

@Wayne-Workman

0_1478476447405_error.log.txt

Attached.

Nothing in the last 3ish hours…

root@FOG-Virtual-Machine:/var/log/apache2# date
Sun Nov 6 16:55:33 MST 2016

entr0py

@Tom-Elliott

This problem seems to exist in RC20 as well. I just did an upgrade of our fog system to RC20 from a much older SVN (RC8 I think?) and have the exact same error. I’ve dropped users, recreated them, I can mount nfs, ftp, ssh and mysql between node and the full Fog server with the passwords and users as specified in the configs with no issues but the existing storage nodes are dead. Trying to add another storage node results in the Storage ID 0 is not valid error as described.

No firewalls, nothing new. Did a straight svn up from the fogproject folder and an installfog.sh as usual. No hardware or software changes on either machine and this is (was) a running production system a few hours ago that had been deploying and capturing from the nodes just fine. I’ve done basically everything possible to make this as insecure as possible, granting all with %'s and everything to no avail.

0_1478464694876_Fog Issues.png

SERVER

FOG Version: 1.3.0 RC20
OS: Ubuntu 12.04
IP: 172.16.0.13

I’ve done the following on the server:

mysql> SET PASSWORD FOR ‘fogstorage’@‘%’ = PASSWORD(‘password’);
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON fog.* TO ‘fogstorage’@‘%’ IDENTIFIED BY ‘password’ WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)

mysql> SET PASSWORD FOR ‘fog’@‘%’ = PASSWORD(‘password’);
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON fog.* TO ‘fog’@‘%’ IDENTIFIED BY ‘password’ WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)

NODE0

FOG Version: 1.3.0 RC20
OS: Ubuntu 12.04
IP: 172.16.0.16

Results on the Node:

root@node0-MicroServer-Gen8:/images# mysql -u fogstorage -h 172.16.0.13 -ppassword -D fog
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1617
Server version: 5.5.53-0ubuntu0.12.04.1 (Ubuntu)

root@node0-MicroServer-Gen8:/images# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.5 LTS
Release: 12.04
Codename: precise

0_1478464835324_Storage Management Page.png