Posts made by p4cm4n

p4cm4n

@brakcounty could be.
its interesting that its getting to IPXE already, so its getting SOMETHING from SOMEWHERE.
try it out and lets see.

p4cm4n

@brakcounty to confirm, you mention that with UFW disabled, there is no delay. is the script running however?

p4cm4n

@brakcounty what is the DHCP server that you’re using for this environment? are there managed switches in between subnets?

p4cm4n

the below entry, opening max connections for sql - was indeed a fix.
default is 151, and i now have been a solid 200 connections for a week with no issues.

in debian 11, i did the following :

sudo su -
mysql -D fog
SET GLOBAL max_connections = 512;

To make this a permanent solution, refer to the link in the previous post.

p4cm4n

unsure of this being a permanent fix, or if this is specific to my environment -

further troubleshooting of this environment showed that a specific storage node was behind NAT (and as such, mysql_error was showing a connection failing from a gateway IP, versus the storage node IP)

after shutting this storage node off toward the end of business, the error seemingly cleared until the next day, and a bulk of imaging/client checkins began.

couldn’t tell if its because lots of client checkins happened at start of business as a lot of machines turned on, OR something else. the error returned around 10AM.

i found a link (https://www.thegeekdiary.com/mysql-error-too-many-connections-and-how-to-resolve-it/)
which pointed me in the direction of temporarily opening up max connections for SQL.
this has so far resolved the issue, even with the errant storage node. i was at 151-152 just prior to this bump, now i’m hovering around 160-165. i might have been hitting this ceiling organically.

p4cm4n

@george1421 yeah man. used that tutorial (from you
already on innodb.
TBH, it seemed that when this issue happened earlier, it happened when we were using ISAM. after migration, they went away, until recently.
(we’re adding 200-250 hosts a day however)

p4cm4n

@george1421 from the multicast not starting thread?
https://forums.fogproject.org/topic/15960/fog-multicast-not-starting-anymore

p4cm4n

@george1421
mariadb is between 30-50% roughly.
currently have given the box 32 cores, 32GB RAM. RAM util is only hovering around 1.25GB though.
so far, approximately 6000. around 7500 in the host DB. most of those are on during the business day.

p4cm4n

In reference to some previous posts I’ve made, this post is probably in relation to the size of the environment where my FOG install resides.

The exact errors as they are stated from this link
https://mariadb.com/kb/en/aborted-connections/

What I have seen is that after some time in production, between many simultaneous images being deployed from around 40 storage nodes, as well as thousands of clients connecting to the main FOGserver with db…
Pages and images start to fail with a ‘database connection failed!’ blank html page (with a url pointed to schema) or ‘valid database connection could not be made!’

The mariadb/MySQL logs start out fine at startup, followed by a few errors that say data inconsistent, then start to slowly get flooded with the aborted connections error until that error persists 5-10 times per second constantly until the service is restarted.

This has shown up occasionally in the past however was usually resolved by a reboot and wouldn’t come back after several weeks. This time however we’re seeing this 2-3x an hour.

A quick netstat-plant shows what looks to be a lot (if not all?) storage nodes connecting over 3306, and established - a LOT of time_wait to all sorts of hosts on 127.0.0.1:9000 and quite a few on 80 as well.

This issue causes all tasks to fail, from imaging to client check in.

Because I see the mariadb KB pointing to .net, I was wondering if this is related to the FOG client?

I currently have agent check in time at 8 minutes and have migrated to innodb.

p4cm4n

@george1421 yeah the error was something from funcs.sh, no OS ID passed, call determineOS.

p4cm4n

@george1421 I called the fog.download script and got ‘no OS defined’ which must have been the kernel reload you mentioned.
how did you reload the kernel params like that? i dont see in any code anywhere to issue a reboot command unless its in functions.sh, which seems to be the common dump

p4cm4n

@george1421
Sweet. This will be my project tomorrow.

Today I ended up automating the server node installation, was pretty fun actually. Learned a bunch of the inner workings. Never written a bash script before, or modified the fog.man.reg in the way that I have. The man.reg only asks
Hostname:
LocID:
ImgID:
GroupID:
Then deploys.

This poses the question though, I still don’t understand the workflow enough…how DID you end up having the machine image without rebooting?

p4cm4n

@sebastian-roth @george1421

Client time has been adjusted but its being adjusted on-the-fly. Imaging workflow is mostly completed on new laptops, however NOW will be pre-existing computers. Pre-Existing will expand that range from the 4000 as it stands now, to an additional 10k. I’m glad you guys mention that though, as I will need to definitely change that over ASAP before I roll out the FogClient deployment (via PDQ actually, george…however in this engagement we’ve noticed PDQ only scales sooooo much :))

It turns out that after setting the images to replicate with the functionality you guys mention (Group > Group Image Replication) worked. I might have been a little impatient when testing it before. I guess it makes sense that at the time I’ve been impatient with it, but it is what it is. I’ll have to monitor the Master Nodes in their respective groups - but I set one overnight to replicate and it worked as expected…prior to that it did not (Seemingly? perhaps firewall issues though…I watched the log as this one happened)

In most situations I wont be making storage group slaves, luckily. However the infrastructure is there to “make” the images at the HQ site. Massive ESXi farm, lots of existing snapshots, and already open firewall to everywhere.

I now understand what you’re mentioning about the database design. Any guides you have to work on this and migrate it over?

I’m getting to a point in this project where I have to offload nearly everything as I’m leaving the country soon. But all the preparation I can do in advance, the better. Going to go build 70 storage nodes today and tomorrow

p4cm4n

I think so far the adjustments you guys have mentioned will probably be made - but some may not be necessary. I don’t think enough of a requirement will be for the web portion to be ‘fast’ as we wrote a powershell GUI for the specific functionality that we needed with the fogAPI.

I’m still not really understanding some things about the master/storage/storage groups (with the added feature of ‘locations’)

Something I’m running into is that at specific places, I will need a new variation of an image perhaps. So lets say this is the case

HQ > FOGServer. Database host, Image host, PXE host.

HQ has 3 additional storage nodes.
HQ Has the ‘Image Master VM’ as I will call it.
Site 1> FOG Storage Node (Master Node of this Storage Group)
Site 2> FOG Storage Node (Master Node of this Storage Group)

Question 1 relates to only the HQ setup - If I use “DEPLOY IMAGE” without registering a host, is it supposed to only talk to the FOGServer, and not load balance in any way? In this functionality, I’ve got about 90 laptops imaging from the main FOGserver. It’s working incredibly fast, but just making sure this is by design. I guess the FOGServer doesn’t ‘know’ the clients, so it can’t load balance on something it doesn’t know? I’d expect the slots are taken regardless, but ?

Question 2 relates to the entire setup. If I create an image at HQ, but lets say I want to replicate that image to Site 1…how do I do that? As it is now, I’m just changing the storage group of Site 1’s storage node (disabling master) and putting it in the group of HQ so it replicates with the ones in HQ. Then moving it back to its own…but in this case, I will need to do this with all of everywhere…and as more sites come about, that may be a lottttt of images.

What I’d like to do is have that “FOGServer” master node. Capture an Image (Lets Say ImageAdobe) at HQ on this Master Node, then selectively replicate it to Site 2, for example. No need for Site 1 to get it. Likewise, Site 1 gets ImageChrome. No need for Site2 to get it. Is this something that can be done? If there is no ability for this to be done - if I just go into the storage node via SSH and delete the folder for the image I don’t need at a location, is this best practice? Will it cause any issues, if Site2 thought it had Image34, but the folder no longer existed?

So far, I’ve killed the crap out of the master fogserver based on 256MB PHP-FCGI limits (I run debian) but found a few threads @george1421 was a part of that explained how to resolve those issues. A little learning curve due to you guys using CentOS but all good. I’ve got 4000 Clients in there, which does take a while to load up the webpage, but since most of our work now is
image > approve pending hosts, add pending to group > rinse/repeat
we’ve done that whole workflow in Powershell GUI. Only I’ve had to go into the webserver

p4cm4n

@george1421
So here is my dilemma in building this out going forward - and its more the naming schemes behind the fog nodes.

I will need the FOG database of clients probably on (1) machine. It is the HQ Master Node as of yet. This is designated by site (HQ), with storage group (HQ). There are (3) additional storage nodes here.

I have a remote site that will potentially be of a ‘DR’ functionality. I’m unsure if its possible to have a second copy of the entire database on that server as well - but the host of ‘fogserver’ is at HQ.

of the 70 or so other sites, at least 50 of them will have FOG storage nodes that connect back to HQ’s fogserver.
-?? I’m going to be playing with subnet groups today, but the idea is going to be to try to automate which site everything talks to, for location purposes.
I have at least 50Mbit to every site, but in some places 2g-10g. I was tempted to use those 10g places as ‘failover’ if their local spot becomes too saturated. Any ideas if this will work, or how to do this?

-?? Any ideas if something like this will work, and i’m not sure if you’ve used some of these concepts or more for one-offs - but i’ll be testing some things soon. if you have ideas of testing things as well, i’m all ears. feel free to reach out via PM for more ‘details’ if you need them. some things i cannot discuss.

p4cm4n

@george1421 said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

@p4cm4n said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

OK since we have had some interactions before (sorry I participate in too many threads to remember anymore), I’ll jump right into the more advanced stuff.

–You’re good man. We’ve actually spoken a few times over the years, but thats just because you’re as functionally involved with support as much as you are. Kudos to you for that

Do you need to manage these target computers with FOG once they are deployed? Like deploying snapins or such?
A: Yes. Specifically AD joining and maintenance. This will also probably replace/begin use as an imaging solution.
C1: This was actually a leading question to see if the “Load and Go” (using ipxe deploy image menu) approach would be best here. If all you need is imaging LaG is the simplest method of deployment if everything can be calculated and you don’t need to manage with fog after deployment. In this case you don’t register the computer with FOG and the FOG server will forget about the computer after deployment. A system builder would use this approach.
C2: Since you mentioned that you already use the deploy image menu. If you do have all of the clients registered in FOG you can restrict the deploy image menu to only display the defined image for that target computer instead of all defined images on the fog server.

–This actually doesn’t end up being feasible because they aren’t registered until after the second bootup in Windows (and a tech hits “register pending” in a powershell GUI we built for mass imaging via FOGAPI.)

Do you need to be able to remotely deploy an image to the computers or will this imaging need to be done under the control of an IT Tech?
A: Mostly under the control of a tech. But the workflow so far has been to deploy an image. This may change if I’m able to en-masse register the hosts.
C1: This question was focused on if you needed to boot through PXE or can just have the IT Tech press the F12 key during booting to select PXE boot. If you needed to reimage an entire class room of computers without visiting each computer I would say leave the default boot through the iPXE menu configured. On my campus we only allow reimaging when the IT Tech is sitting in front of the computer to manually select PXE boot. This approach avoids accidentally automatically reimging of the Director’s computer by picking the wrong computer at the wrong time if everything was automatic.

–Helpful, as funny enough I deployed the DHCP scope for fogserver//undionly|ipxe to the whole DHCP range for a server that handles a few sites. Funny enough, quite a few machines popped up to the deployimage menu. But, there is currently no salvageable data on those machines so its useless to be ‘safe’…if you catch my drift. The ones we’re actually caring about don’t do this because of SecureBoot and that Win10 takes over the boot priority for quick booting.

What target OS will you deploy?
A: Win10, 21H2 Ent/Pro
C1: You will surely want to have fog 1.5.9.110 or later installed. You will need to switch to the dev branch and reinstall the FOG software so that you can avoid an issue with 20H2 and later where M$ changed the disk structure a bit which causes FOG 1.5.9 to not be able to expand the disk properly. That has been fixed in 1.5.9.110 or later. FOG 1.5.10 will be out later this spring to have all of the fixes included.

–All good there. Dev branch as of 1/28/2022 was installed and that client deployed on all machines thus far. It’s on the golden image I created as well.

At the sites with 10 or less computers, assuming you have enough bandwidth, will you send an IT tech to the site to image the computers or will you deputize an existing site person to image the computers?
A: Probably an existing person, which I think I have mastered the image workflow for this purpose. It can’t be dumbed down any further. However, I have remote control of DHCP at least.
C1: This question was geared towards seeing if a mobile deployment server would work for the small sites? This is basically a FOG server installed on a laptop, NUC, or Raspberry Pi? That could be shuttled between the sites. Thinking a laptop with a normal version of FOG installed, running dnsmasq and a FOG community dynamic dhcp script could do the trick. Just have the office admin connect it into your network and power it on. The only bit you would need is to be able to find it on your network, but you could do it with a linux startup script just have it send you an email saying hello I’m here and here is my IP address, or something like that.

–Good point, but I don’t know this will be feasible. Most of these places will probably have reasonable bandwidth but there are parts to them that I can’t mention on this forum that would give a bit too much away as to the details.

What is your computer naming convention? Can it be calculated?
A: Naming convention cannot be automatically populated. I wish - however these machines are asset tagged, non-sequentially.
C1: I might have mentioned the use of bar coded tags here with a reader will help in this situation. Its easier to “zap” barcode fragments than to key in a host name and get it right every time.
C2: Through the use of the linux dmidecod command a FOG post install script can read values in SMBIOS. On my campus the windows host name is calculatable based on the site LUN, chassis type, and dell serial number. These all can be pulled from smbios. We can (but don’t currently) pull the dell asset field (blank field in the firmware where companies can put their asset number) with dmidecode.

–Yeah, they are asset tagged by the shipping/internal department of the organization that handles procurement. There is no field in ‘software’ that contains this value. I was hoping there would be, -OR- that we’d get a lucky batch of sequential numbers. But it starts with ‘ORG-12345678’ with the numerical portion being somewhat sequential but spread out over a large geographical area. Barcode/One-Off manual entry works well enough for now.

Since you created a custom fog manual registration script. There is a hack I did one time at the end of that script. In the script it asks if you want to deploy the image to the target computer. If you answer yes it will reboot and then deploy the image. I found this extra reboot (since I require the IT Tech to sit in front of the computer and press the F12 button to get it to pxe boot) unnecessary. I’d have to look at the script but I was able to have it go right into imaging without a reboot if the IT Tech answer yes to deploy the image. That would save about 60 seconds on a registration and then image deployment.

–This is a great idea and its been requested by my techs a bunch before I ended up taking the workflow from fullreg > deploy image. The problem was more of following the manual data entry portion of full reg, and while I could have limited what was on there to only request hostname for example, I still had to either use something like persistent groups for snapin deployment and such since they were new clients…but ultimately I just scripted the auto-install of the ‘base’ applications. After that, another poSH gui we created gives an ‘inventory’ or ‘deployment’ person who gives a new laptop to an end user, and is able to just scan the device asset tag barcode, select a checkbox, and hit start which auto-deploys the snapins for that laptop. It’s relatively scalable for 100 a day (not quite keeping up with the 500 new laptops we’re imaging a day as of yet, but…)

p4cm4n

@george1421
and actually, big thanks. i ended up using deploy image per the last response in another thread you had suggested.
i ended up doing this with win10 for my laptops (the ‘proof of concept’ for this whole project)

-boot to fog>deploy image (default)
-edited the deploy image menu option to include the username and password, so the screen defaults to image selection screen, so we choose which image
-the image we use has drivers included for new intel NIC’s. it then prompts for you to enter the hostname in a CMD shell window. this is input with a barcode scanner. after this, software that installs as the image finishes deploying, and then sets the FOG client to delayed-auto and reboots the machine.
-the client is now sitting in pending hosts, which then we clear the QC portion of the hostname, add it to a group, join to AD appropriately, customize snapins and shutdown the machine.
–because of your suggestion, this significantly reduced technician error with snapins, locations, hostname, etc.

p4cm4n

@george1421

@george1421 said in Master, Storage nodes, and clients for gigantic network. Trying to design - need sanity check?:

@p4cm4n So some questions.

Do you need to manage these target computers with FOG once they are deployed? Like deploying snapins or such?

– Yes. Specifically AD joining and maintenance. This will also probably replace/begin use as an imaging solution.

Do you need to be able to remotely deploy an image to the computers or will this imaging need to be done under the control of an IT Tech?

–Mostly under the control of a tech. But the workflow so far has been to deploy an image. This may change if I’m able to en-masse register the hosts.

What target OS will you deploy?

–Win10, 21H2 Ent/Pro

At the sites with 10 or less computers, assuming you have enough bandwidth, will you send an IT tech to the site to image the computers or will you deputize an existing site person to image the computers?

–Probably an existing person, which I think I have mastered the image workflow for this purpose. It can’t be dumbed down any further. However, I have remote control of DHCP at least.

What is your computer naming convention? Can it be calculated?

–Naming convention cannot be automatically populated. I wish - however these machines are asset tagged, non-sequentially.

p4cm4n

Okay…so I have a huge deployment of FOG I’m about to roll out. 14,000 endpoints.

I’m trying to design based on bandwidth, and the tools available within FOG, to be able to differentiate and make this as smooth as possible
(IE, currently no machines are actually in FOG, save for the newest ones…all of these will probably be freshly added, need identified, grouped, and talking to the most efficient storage node possible)
I’ve never dabbled with subnet groups before. I HAVE dabbled with locations before. Ideally I’d love to try to hash out some stuff in designing this.
For reasons I can’t give many details. But suffice it to say that a good number of machines all have no working OS, and booting to FOG to register and deploy an OS is easy enough - but grouping and managing all of them after the fact may prove difficult and time consuming.

Also, as there are 70+ sites, sometimes bandwidth is a constraint. Physical hardware to that location however is not necessarily an issue. The analysis would end up being whether it makes sense (if 5Mbit connection, but only 2-5 machines need imaged, no sense in deploying a storage node)

Does anyone have experience in designing such a system, and have any pointers/caveauts/tips for this?
For reference, I have already built a “Master” node in HQ, and a few storage nodes there as well. The hardware there is actually insane - I’ve hit 8GB/min to 90 machines on unicast from one server.
I have also tested that the firewall has allowed imaging to happen over WAN/VPN at a remote site, using storage and the master node at a remote location. It too, got 8GB/min.

p4cm4n

HiHi.

I’m trying to design a workflow for a massive amount of laptops and their associated imaging.

At the moment, SOME laptops come bundled with a sticker that has a barcode with a MAC Address. Part of my workflow integrates assigning a hostname to that MAC address, and importing the CSV into fog.

What I’d LIKE to do is basically a Semi Host Inventory of sorts, NOT full.

I’d like to boot to fog automatically, with simply a prompt of a hostname. The hostname is entered, RETURN is pressed, the machine restarts or shuts down (I’m not technically concerned with this but shutdown would be excellent)

Does anyone know how to go about doing this?