One FOG server with multiple storage nodes?
-
@theterminator93 i just realized that you’re running FOG 1.2.0. If you’re willing to run an experimental build, you can upgrade to trunk for an increase in speed.
-
Also an option; if I upgrade and see other weird things and can’t get around the current problem, I can always just restore the entire VM to the way it was this morning.
As it turns out, VMWare is emulating VMXNET adapters. Once I finish the imaging tasks currently queued I’ll switch it over and try again.
-
Okay - this is what I get if I create a couple new storage nodes as before. This is after I switched the VMWare adapter type to E1000.
And if I disable eth0 and eth2 (the other two adapters currently enabled which have 1.20 and 1.18 addresses), it will mount and start imaging. If I start another imaging task, it will try to connect to 1.18; if I reenable that adapter, it won’t connect but the one started on 1.17 still runs. If I reenable eth0 (tied to 1.20, the master), it still runs but nothing new can connect.
Then I tried disabling the storage node tied to 1.18 and restarted the second imaging task, when connected and started imaging, though I’m not sure which node it connected to; VMware says just about all traffic to/from the server is coming off vmnic 2 and not the others.
-
@theterminator93 instead of trying to set up multiple storage nodes on the same server, have you tried nic teaming on the esxi box and using the VMXNET adapters? also, how is the speed improvement after upgrading to trunk?
-
This post is deleted! -
@Junkhacker said in One FOG server with multiple storage nodes?:
@theterminator93 instead of trying to set up multiple storage nodes on the same server, have you tried nic teaming on the esxi box and using the VMXNET adapters? also, how is the speed improvement after upgrading to trunk?
ESXi already is configured to use all four NICs an an LACP port-channel to the switch, is what you’re referring to an additional level of NIC teaming between ESXi and the guest OS? Upgrade to trunk might be a project for tomorrow (I have techs onsite imaging machines today); where is the speed improvement with that coming from - better image compression?
Is there a way to determine which storage node a particular imaging task is pulling data from? I’ve left the three adapters active in the OS and just disabled storage node 3, and imaging seems to be progressing as it should be (although only doing one at a time for the moment, so this observation is not 100% for sure yet).
-
@theterminator93 using a single VMXNET adapter on the server and the 4 nics configured that way on the esxi box should allow you to send out 4 gigabit connections at once. it won’t let 1 imaging task go 4 times as fast, but it should let 4 tasks go at once without slowing down because of network connections (though your other hardware may not be able to keep up). this is all theory to me, since i haven’t tried it though. the speed improvement on imaging with trunk mainly comes from a change in the decompression method. the image must be decompressed before it is piped to partclone. fog 1.2.0 used a single processing thread, trunk uses multiple. see the video i posted for an example of the current speed we get.
-
@theterminator93 There was also an issue with vmxnet depending on what version of ESXi you are running on. Where it would slow down under heavy load. The recommended fix was to use the e1000 network adapter.
I guess I have to ask the question at what level do you need to image your machines? With 4-6 minutes per image download you can chain load images and start a new system deploying every 6 minutes giving you 10 per hour? Maybe unicast imaging is not the right solution for you. If you need to image 10s of systems per hour you may want to look into multicasting (which brings its own problems too?).
So to ask the question that should have been asked earlier, how many machines do you need to image per hour?
-
It’s an ESXi 6.0 host. I did notice that with the VMXNET adapter as the emulated NIC that the OS was seeing a 10Gb link speed… however that was the config under which I was seeing Ethernet throughput choke with more than 3 clients unicasting.
Funnily enough I had tried multicasting a lab first, which… didn’t work (the infamous problem where all hosts check in but don’t start deploying). I really should look into that as well since, truth be told, it’s definitely the way to go when pushing the same image out to identical hardware. My motivation for pushing the limit of how many unicast tasks I can run simultaneously is because this client has an extensive (not in the good way) assortment of hardware and applications out there, varying from single threaded Northwood P4s up to Broadwell i5s (guess which they have more of…).
I’m definitely going to try out trunk to see how it runs with multithreaded decompression. I’ve also read of a number of other items that have been addressed which I’ve not really run into issues with so far, but it’d be reassuring to know there’s less of a chance of them dropping in on me unexpectedly.
Regarding the number of machines that “need” to be imaged per hour… first and foremost the setup I had before trying to push the envelope with this endeavor was already a tenfold improvement over the old setup; a 100 Mb server, 100 Mb switches, etc. So just with the new infrastructure we’ve gone from 30 minutes per machine to less than 10. My vision is where one person can go into a classroom to boot up a host, register it (if needed) and start the imaging task, then and walk away then go do the next machine in the next room and have none of the machines need to wait in the queue to start pulling an image - while all image at peak throughput relative to the capability of their hardware. In short it’s the “new network… push it to its limits to see what it’s capable of” mindset.
Upon leaving for the afternoon I had disabled storage node 3, but storage node 2 as well as the default remained enabled and imaging tasks seemed to proceed normally. I still haven’t confirmed whether or not it’s using two NICs to handle throughput yet though - but hopefully I’ll have an opportunity to examine that tomorrow.
Thanks everyone for the input and suggestions. The fresh perspective helps keep me from digging too deep in the wrong places.
-
Minor update.
I haven’t changed anything today yet but did image a lab this morning. While the lab was going I didn’t see a single occurrence of a machine getting the error I had before (unable to mount) with two storage nodes active. I had no discernible slowdowns while imaging one machine at gig while simultaneously imaging 7 others at 100 Mb (old 2950 switches still in this building). I was also seeing traffic coming off of more than one vNIC according to VMWare.
Until I try and do 4-6 or more simultaneous imaging tasks at Gig on the new network I won’t know for sure if the > 1000 Mb/s throughput “barrier” is broken.
I’ve got one last model’s image I’m uploading today, then I’ll do the trunk upgrade and do some more tests to see how things look.