Posts made by tomierna

tomierna

@Tom-Elliott That was the ticket!

The snapins from a pre-1.x version of FOG imported many moons ago to have the full path in the snapins table, and had 0 in the sgaStorageGroupID field for the storage group in the snapinGroupAssoc table.

I removed the full path in the snapins table and set the sgaStorageGroupID to 1, and now I can get to the snapins list page and the host page.

Thanks for the pointer!

tomierna

@Tom-Elliott According to the Storage page, the DefaultMember has the correct path for Images and Snapins:

Screenshot 2024-08-30 at 10.55.11 AM.png

I can’t get to the Snapins list page, probably for similar reasons. According to the snapins table, some of my older snapins have the full path, and the newer ones don’t. Is there another table or set of fields where I can verify the correct storage group is associated with the snapins?

Screenshot 2024-08-30 at 10.56.48 AM.png

tomierna

@Tom-Elliott I was able to post it just now, so whatever you did seems to have worked!

Thanks!

tomierna

I’ve been nursing a pretty old FOG install on CentOS for a few years and decided to spin up a new server running Ubuntu and transfer everything instead of trying to update in place.

I successfully got the images and database transferred, and the installer runs without error, but then when I get into the new system, clicking on any host or a Deploy button results in a partially rendered page.

There is an underlying PHP error when this occurs:

[Thu Aug 29 19:43:25.376709 2024] [proxy_fcgi:error] [pid 65111] [client 192.168.10.85:53400] AH01071: Got error 'PHP message: PHP Fatal error:  Uncaught ValueError: min(): Argument #1 ($value) must contain at least one element in /var/www/html/fog/lib/fog/snapin.class.php:392\nStack trace:\n#0 /var/www/html/fog/lib/fog/snapin.class.php(392): min()\n#1 /var/www/html/fog/lib/fog/snapin.class.php(344): Snapin->getPrimaryGroup()\n#2 /var/www/html/fog/lib/router/route.class.php(1327): Snapin->getStorageGroup()\n#3 /var/www/html/fog/lib/router/route.class.php(487): Route::getter()\n#4 /var/www/html/fog/lib/pages/hostmanagementpage.class.php(1806): Route::listem()\n#5 /var/www/html/fog/lib/pages/hostmanagementpage.class.php(3264): HostManagementPage->hostSnapins()\n#6 /var/www/html/fog/lib/fog/fogpagemanager.class.php(220): HostManagementPage->edit()\n#7 /var/www/html/fog/management/index.php(58): FOGPageManager->render()\n#8 {main}\n  thrown in /var/www/html/fog/lib/fog/snapin.class.php on line 392', referer: http://fog.molecularmedia.com/fog/management/index.php?node=host&sub=search

The new system is running Ubuntu 24.04 which comes with PHP 8.3. Are there known problems with this version of PHP with FOG Project, or is there something else wrong with my install?

This error is happening with both the stable branch and the dev branch.

Here’s what the partially rendered host page looks like:
Screenshot 2024-08-29 at 4.07.29 PM.png

tomierna

I can’t post a Linux technical question to this forum because Akismet keeps flagging it as spam.

I don’t know if this post will go through until I click the button.

@admin - any insight?

tomierna

@george1421 I might just try that, just for troubleshooting purposes.

Re: firmware - There is a BIOS update for the machines, and a firmware update for the NVM Samsung drive. Sadly, trying these was my first troubleshooting step (not listed here, because it was before I suspected components of FOG). I sure was holding my breath that it was the drive firmware though!

tomierna

We bought 50 of these machines and one arrived with a cracked screen. I just received the replacement from the RMA of that broken machine, and of course it images at full speed.

The replacement machine came with a Samsung m.2 drive, part: MZVLW256HEHP-000L7

The other 49 machines have the Lenovo equivalent: LENSE20256GMSP34MEAT2TA

I’ve contacted my Lenovo rep with the hopes that I can work with an engineer to narrow down a fix.

tomierna

@tom-elliott I’m pretty stumped myself.

And why does it matter on the FOS Client that it is not NTFS? Fuse NTFS version differences between FOS and Ubuntu?

tomierna

So apparently on the Ubuntu machine, as long as the partition is mounted, a restore is fast.

On the FOS Client, the partition has to be formatted as a FS other than NTFS and mounted.

I’m too far down the rabbit hole to see how this makes any sense.

tomierna

@sebastian-roth Thank you, Sebastian.

This is getting weirder by the day.

I went back to the Ubuntu test machine today to try and look for differences, and partclone.restore from NFS to the m.2 SSD ran at expected speeds!

Going back through my shell history, I noticed I had never unmounted the partition I was cloning onto.

So, after the restore completed, I unmounted the partition and ran the partclone.restore again. Boom, slow.

Then remounted, re-ran command, boom, fast again.

I did this a few more times to make sure I wasn’t seeing things, but sure enough, on the Ubuntu machine, when the target partition is mounted, partclone.restore writes at GbE speeds. When the target partition is not mounted, the restore speed falls to about 450MB/min.

I tried this on the FOG Client machine, but partclone exits because it knows the partition is mounted.

Thinking this might be due to the partclone version 0.2.89 on FOS, I copied over the 0.3.11 binaries and libraries.

This allowed it to run the clone despite the partition being mounted, but it was still slow.

I looked back at the history on the Ubuntu machine, and the FS I had mounted the first time I had a fast restore was ext4. Subsequent times it was NTFS (from the image).

So, I did an mkfs.ext4 on the partition on the FOS machine, mounted it, and ran the partclone. IT RAN AT GbE SPEEDS!!!

However, subsequent unmount/remount did not allow another restore to run quickly. I’m just about to test formatting as ext2 and trying the restore with that mounted to see if it matters which FS.

tomierna

@george1421 I’ve tested partclone over NFS to m.2 under Ubuntu 18.04 now.

The exact same issue is happening there with partclone.

I ran partclone.restore to /dev/null, from the FOG NFS images share to get a non-writing baseline of network performance, and it showed 6.8GB/min.

Then I ran partclone.restore to the m.2 drive, and it started at 14GB/min, and by 4% it was down to 2GB/min. By 50% it was down to 450MB/min.

The /var/log/partclone.log showed multiple writes per buffer, like I outlined in another post.

I guess it’s time for me to post in the partclone forums?

tomierna

@george1421 LOL, yeah. Super frustrating.

It really does seem like an interaction between partclone.restore and the m.2 ssd (or maybe the FOS kernel’s support of that device).

Right now I’m running a partclone.restore from the command line of a debug deploy from the NFS share to the external USB3 SSD I’ve got connected. Solid 7.3GB/min.

Tomorrow I will try booting from Ubuntu Live and install Partclone, and see if the same problem exists there, and maybe that will show what part of the nvm subsystem needs tweaking in the FOS kernel.

tomierna

@george1421 I’ve already copied via NFS with rsync to the internal m.2 drive, at GbE speeds.

I also excluded pigz and cat by pre-decompressing the image and trying the partclone.restore command from the command line.

[edit]I just did another test, while I was doing 6 unicast t410i machines at the same time, and rsync to the internal m.2 drive from NFS was getting 60MB/second while each of the unicasts were doing 5.5GB/min (91MB/s). About halfway through the rsync, some of the unicasts finished, and the rsync speed took up the bandwidth, peaking at 110MB/sec.[/edit]

tomierna

@george1421 - I’ve restored an image over the Ubuntu install, but I will try a live boot and see if I can do the lspci command from there.

Re: write speed to the m.2 SSD within the FOS debug session:

dd if=/dev/zero of=./test1.img bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.40934 s, 762 MB/s

dd if=/dev/zero of=./test1.img bs=2G count=16 oflag=direct iflag=fullblock
16+0 records in
16+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 159.576 s, 215 MB/s

A larger file is slower, but still way faster than GbE speeds.

tomierna

On a partclone mailing list a test to determine if write io was a bottleneck was mentioned: restore to /dev/null.

I tried that, and got a solid 13GB/min from the SSD and 7.3GB/min from the NFS share.

This tells me write performance to the m.2 drive is probably the culprit.

Any kernel parameters I should be looking at? I will be doing a diff between sysctl -a on an Ubuntu 18.04 machine and the FOS client kernel.

tomierna

Another day, another data point:

I’ve copied the decompressed image file to an external USB3 SSD (over NFS, at 100+MB/s with rsync), and while in the debug-deploy shell, I ran partclone using the SSD as the source.

The partclone session started out fast, but like with the NFS-based sessions, after about 5%, started to slow down. By 7%, transfer speed from SSD was at around 800MB/min, and by 10% was at 600MB/min.

/var/log/partclone.log showed similar write fragmentation patterns to what I posted last night.

I’m going to look next at kernel tunables to see if there are any io buffers I can set to be larger.

tomierna

Doing some more testing.

I finally compiled partclone 0.2.88, 0.2.91 and 0.3.11 and copied the binaries and enough of the libraries over to my debug-deploy machine so that they would run without error.

None of the versions made any difference in deploy speed. 0.2.91 and 0.3.11 took longer to write the GPT, but I think I remember reading about that somewhere.

Next I used gunzip to expand one of my partition files so I could run partclone from the shell to see if excluding pigz and cat from the FOS machine made a difference. It did not.

I added -d2 to the command to increase the error log verbosity.

The log doesn’t show any errors, but it does show all the reads and writes. The default buffer is 1MB, and so each of the reads is 256 4096-byte blocks.

Many of the writes say “write 1048576, 0 left”, but a large number of the writes appear to be fragmented. Here are three read/write cycles, with the first being non-fragmented, and the next two being fragmented:

read more: io_all: read 1049600, 0 left.
io_all: write 1048576, 0 left.
blocks_read = 256 and copied = 535552
read more: io_all: read 1049600, 0 left.
io_all: write 753664, 0 left.
io_all: write 57344, 0 left.
io_all: write 237568, 0 left.
blocks_read = 256 and copied = 535808
read more: io_all: read 1049600, 0 left.
io_all: write 204800, 0 left.
io_all: write 327680, 0 left.
io_all: write 360448, 0 left.
io_all: write 155648, 0 left.
blocks_read = 256 and copied = 536064

I don’t yet have a comparison log from a t410i to see if this type of write fragmentation pattern is normal, or if it’s potentially a reason for the slowdown.

Again, when I format the partition as ext4 and do a straight copy or rsync from the NFS server or wget or curl from the http server, I get 100+MB/sec. Only when it passes through partclone does it slow down to 400-500MB/min.

tomierna

I just re-tested and am confirming GbE transfer speeds and no errors on the switch port or on the t480’s ethernet from a debug shell when rsync-ing the image files from the NFS mount to the m.2 SSD.

Does anyone have older or newer static binaries of Partclone which will run in the client fog boot image?

I’d like to test a couple of older versions, and possibly a newer test version. For old, maybe 0.2.88 and 0.2.80, and for newer, maybe 0.3.11?

I can’t figure out how to build them from source in a way that works on the debug-boot machine.

tomierna

@sebastian-roth - I think I’ve shown it’s not a FOG Client kernel issue, since I’ve been able to do http and NFS copies in a debug shell at GbE speeds. (See my post from a few days ago)

This week I’m going to try other versions of PartClone, as well as doing some other checking of port statistics when just doing the http and NFS copies.

tomierna

@george1421 Yes, there was a single GbE connection between the server and the previous switch, an unmanaged Netgear GS116.

The first set of deploys I tried were similarly slow on the previous switch. The project to change out the switch for a managed one and add the 10GbE link was long planned, but since I couldn’t get any info out of the GS116, I figured having a management console would help debug things.

Doing five unicast t410i’s each at gigabit speeds makes me think the 10GbE link and VM are not the problem.