Posts made by theterminator93

theterminator93

@Quazz The debug 09_undionly.kpxe binary ended up working properly. Whatever Sebastian did as a workaround allowed it to go straight from PXE to the kernel.

@Sebastian-Roth Here’s a link to a pcap when a different machine did the connection reset jig today.

theterminator93

Odd… I did nothing and it started working. I cold booted two times in a row and it worked both times.

But then I then tried a different (same model) host just to be sure, and it threw the error. After that I tried the original host and it was erroring out too. Then I tried undionly.kpxe to see if it would at least give me a menu, same result. Then I tried a different subnet… and it worked. Apparently it’s something screwy with the network or switch.

As far as the error log, nothing of significance. Only events indicating the events that the OS is shutting down and starting again.

theterminator93

Actually both undionly.kpxe and undionly.kkpxe managed to successfully give me the FOG menu. I didn’t try any of the EFI binaries since I built the Win10 image in legacy mode.

The good news is… 09_undionly.kpxe didn’t lock up and we appear to be in business!

It did spit out a bunch of debug output immediately before the screen went black (like usual) which I attempted to capture:

alt text

The only oddity now is that after I PXE boot once (maybe twice), subsequent boot attempts throw an error and reboot until I restart the FOG server.

tftp://10.15.1.20/default.ipxe… ok
http://10.15.1.20/fog/service/ipxe/boot.php… Connection reset (http://ipxe.org/0f0a6039)
Could not boot: Connection reset (http://ipxe.org/0f0a6039)
Chainloading failed, hit ‘s’ for the iPXE shell; reboot in 10 seconds

theterminator93

I tried running lsusb from a debug kernel off FOS, nothing appeared with ethernet in the name so I snapped this in case it proves useful.

alt text

And here is what we lock up at with 08…

alt text

theterminator93

Strange. I rebooted the server and the 192.168.x.x address was no longer interfering with PXE booting. No signs of it in boot.php based on the suggestions from George at this point either.

In any case, if I run the first command from a FOS USB, it comes back with no results. What’s interesting is it tries three times to get an IP but seems to give up, despite its actually getting an address…?

alt text

Here’s what it locked up at with 05_undionly and the added host debug parameters DEBUG=init,device,undi,pci.

alt text

06_undionly with the extra debug params was just a black screen when it locked up, with bzImage… ok and init.xz… ok and a blinking cursor.

And 07, with debug params, just skips the FOG menu and boots to the OS. This is what’s on the screen for a split-second before it starts booting Windows.

alt text

theterminator93

That binary isn’t loading, for some reason it is trying to pull bzImage from a class C private address rather than the address of the server onsite after it gets an IP.

theterminator93

All righty - here’s what we get with 02_undionly.kpxe:

alt text

theterminator93

Couple updates. I realized that I had incorrectly put bzImage_ng into /tftpboot - so I moved it to /var/www/html/fog/service/ipxe and reset the permissions according to the wiki article. Booted using undionly.kpxe and it still locked up after init.xz.

So then I tried booting with 01_undionly.kpxe. I have attached what I get after it locks up…

alt text

I then tried creating and booting to a FOS USB as George suggested. It loaded fine, and since I had already registered and approved the host in the web GUI, I attempted to capture the image - which worked. I made sure I was booting to the USB media in legacy (not EFI) mode as well. I’m guessing that means it’s an issue with iPXE to FOS handoff?

theterminator93

Thanks again for the response. Yep, Secure Boot is off as well. Initially I had been using undionly.kpxe. Here are my observations with each of the binaries I found in /tftpboot which weren’t EFI:

default.ipxe ----> Stops after getting an IP from DHCP (will reboot with CTRL+ALT+DEL)
intel.pxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
intel.kpxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
intel.kkpxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
ipxe.pxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
ipxe.krn ----> Stops after getting an IP from DHCP (will reboot with CTRL+ALT+DEL)
ipxe.kpxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
ipxe.kkpxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
pxelinux.0.old —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
realtek.pxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
realtek.kpxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
realtek.kkpxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
undionly.pxe —> Appears to try to boot to network, skips FOG menu and boots to hard disk immediately
undionly.kpxe —> Boots normally. I can load memtest fine from the FOG menu. I was also able to ‘approve this host’ but after it tried to boot the kernel to do an inventory, it locked up at init.xz (had to hard reset).
undionly.kkpxe —> Booted normally. Locked up at init.xz when trying to do hardware inventory from before. Canceled the task, memtest loaded fine.

Fortunately with the new FOG client it registered itself automatically (love that feature!) so I didn’t need to manually enter the MAC to register it. I put bzImage_ng into the /tftpboot directory and changed the kernel under TFTP settings in the web interface, but when attepmting to boot it just keeps adding “…” as it waits, so I fear I may not have set permissions on the file/directory properly (chmod -R 777 /tftpboot).

theterminator93

Thanks for the suggestion. Now running the latest git version of 1.4.2 - no changes.

I tried ipxe7156.efi but I got an error saying it was too large to fit in base memory. Realtek7156.efi just locked up after DHCP got an IP, ipxe.pxe loaded but instantly booted to the hard disk (no FOG menu). I ran out of time to try more, but I will do so next week. None of the EFI options will work as I’m booting legacy, I assume.

theterminator93

Hi all,

I’m trying to get FOG to work with some HP Stream 11 G3s for a client. I recently imaged all of them with CloneZilla (no Ethernet port), but wanted to try out FOG since it would reduce deployment time in our environment under our particular circumstances.

I wanted to ensure maximum likelihood that the USB-Ethernet adapter I purchased would properly PXE boot, so I picked up HP’s branded USB 3.0 to Ethernet adapter. It PXE boots perfectly, and I get the FOG menu. When I try to do a full registration and inventory, it shows that it loads the kernel and init.xz, then it locks up. No caps lock LED response, stays that way after 10-15 minutes, and I have to hold the power button for 10 seconds to get it to power off.

I’m running 1.4.0 RC6 (and will update, try again, and report after I get onsite this morning) and I made sure I had the most up-to-date kernel - but no change. Any suggestions or config changes which might resolve this?

theterminator93

Minor update.

I haven’t changed anything today yet but did image a lab this morning. While the lab was going I didn’t see a single occurrence of a machine getting the error I had before (unable to mount) with two storage nodes active. I had no discernible slowdowns while imaging one machine at gig while simultaneously imaging 7 others at 100 Mb (old 2950 switches still in this building). I was also seeing traffic coming off of more than one vNIC according to VMWare.

Until I try and do 4-6 or more simultaneous imaging tasks at Gig on the new network I won’t know for sure if the > 1000 Mb/s throughput “barrier” is broken.

I’ve got one last model’s image I’m uploading today, then I’ll do the trunk upgrade and do some more tests to see how things look.

theterminator93

It’s an ESXi 6.0 host. I did notice that with the VMXNET adapter as the emulated NIC that the OS was seeing a 10Gb link speed… however that was the config under which I was seeing Ethernet throughput choke with more than 3 clients unicasting.

Funnily enough I had tried multicasting a lab first, which… didn’t work (the infamous problem where all hosts check in but don’t start deploying). I really should look into that as well since, truth be told, it’s definitely the way to go when pushing the same image out to identical hardware. My motivation for pushing the limit of how many unicast tasks I can run simultaneously is because this client has an extensive (not in the good way) assortment of hardware and applications out there, varying from single threaded Northwood P4s up to Broadwell i5s (guess which they have more of…).

I’m definitely going to try out trunk to see how it runs with multithreaded decompression. I’ve also read of a number of other items that have been addressed which I’ve not really run into issues with so far, but it’d be reassuring to know there’s less of a chance of them dropping in on me unexpectedly.

Regarding the number of machines that “need” to be imaged per hour… first and foremost the setup I had before trying to push the envelope with this endeavor was already a tenfold improvement over the old setup; a 100 Mb server, 100 Mb switches, etc. So just with the new infrastructure we’ve gone from 30 minutes per machine to less than 10. My vision is where one person can go into a classroom to boot up a host, register it (if needed) and start the imaging task, then and walk away then go do the next machine in the next room and have none of the machines need to wait in the queue to start pulling an image - while all image at peak throughput relative to the capability of their hardware. In short it’s the “new network… push it to its limits to see what it’s capable of” mindset.

Upon leaving for the afternoon I had disabled storage node 3, but storage node 2 as well as the default remained enabled and imaging tasks seemed to proceed normally. I still haven’t confirmed whether or not it’s using two NICs to handle throughput yet though - but hopefully I’ll have an opportunity to examine that tomorrow.

Thanks everyone for the input and suggestions. The fresh perspective helps keep me from digging too deep in the wrong places.

theterminator93

@Junkhacker said in One FOG server with multiple storage nodes?:

@theterminator93 instead of trying to set up multiple storage nodes on the same server, have you tried nic teaming on the esxi box and using the VMXNET adapters? also, how is the speed improvement after upgrading to trunk?

ESXi already is configured to use all four NICs an an LACP port-channel to the switch, is what you’re referring to an additional level of NIC teaming between ESXi and the guest OS? Upgrade to trunk might be a project for tomorrow (I have techs onsite imaging machines today); where is the speed improvement with that coming from - better image compression?

Is there a way to determine which storage node a particular imaging task is pulling data from? I’ve left the three adapters active in the OS and just disabled storage node 3, and imaging seems to be progressing as it should be (although only doing one at a time for the moment, so this observation is not 100% for sure yet).

theterminator93

Okay - this is what I get if I create a couple new storage nodes as before. This is after I switched the VMWare adapter type to E1000.

alt text

And if I disable eth0 and eth2 (the other two adapters currently enabled which have 1.20 and 1.18 addresses), it will mount and start imaging. If I start another imaging task, it will try to connect to 1.18; if I reenable that adapter, it won’t connect but the one started on 1.17 still runs. If I reenable eth0 (tied to 1.20, the master), it still runs but nothing new can connect.

Then I tried disabling the storage node tied to 1.18 and restarted the second imaging task, when connected and started imaging, though I’m not sure which node it connected to; VMware says just about all traffic to/from the server is coming off vmnic 2 and not the others.

theterminator93

Also an option; if I upgrade and see other weird things and can’t get around the current problem, I can always just restore the entire VM to the way it was this morning.

As it turns out, VMWare is emulating VMXNET adapters. Once I finish the imaging tasks currently queued I’ll switch it over and try again.

theterminator93

Thanks everyone for chiming in. Let me try to address everything below:

@Wayne-Workman said in One FOG server with multiple storage nodes?:

I think what you’ve tried is very close.

So, each IP address for each NIC should have a storage node defined. It’s fine to point them all to the same local directory. In such a scenario, all the nodes need to belong to the same group. For Multicasting to work (don’t know if you’re using that), one node in the group must be set as Master.

Could it be firewall?

Are there any apache error logs?

Another thing to do - and this will really help you out - is to do a debug deployment with the multiple NICs and corresponding storage nodes setup, and then to manually do the NFS mounting yourself and see how that goes, see what errors you get, and so on. We have a walk-through on this right here:
https://wiki.fogproject.org/wiki/index.php?title=Troubleshoot_NFS

Please report back with your findings, be they failure or success. Screen shots of any errors would immensely help out.

Yep, each new storage node I added I assigned the same static IP that I gave the particular interface in the OS. Only the default was set as master, and they were all part of the default group. The original IP of the server is x.x.1.20; the two storage nodes I added were 1.17 and 1.18, respectively. When I tried imaging with this configuration, it booted fine and got to the point where it had started to prep the disk to pull the image from the storage node, then just complained that accessing the 1.18 node kept timing out.

The first thing I did was disable the firewall (sudo ufw disable) but saw no change. When I disabled the other two interfaces (1.20 and 1.17) to make sure that I could at least get internet traffic on the server, it started imaging. And as soon as I reenabled one of the other interfaces, it stopped immediately.

I didn’t check any logs, though I will have a look at this again and see what I can come up with based on your suggestions.

@ITSolutions said in One FOG server with multiple storage nodes?:

What mode of bonding did you setup when you did the test? Also did you make sure each virtual NIC was assigned to different physical NIC’s?

I followed the instructions here more or less to-the-letter: https://wiki.fogproject.org/wiki/index.php/Bonding_Multiple_NICs
I used bond mode 2 and gave the bond interface the same MAC of the old eth0 NIC as reported by ifconfig. What I would have had in the interfaces file would have been similar to this:

auto lo
iface lo inet loopback
auto bond0
iface bond0 inet static
bond-slaves none
bond-mode 2
bond-miimon 100
address x.x.1.20
netmask 255.255.255.0
network x.x.1.0
broadcast x.x.1.255
gateway x.x.1.1
hwaddress ether MAC:OF:ETH0
auto eth0
iface eth0 inet manual
bond-master bond0
bond-primary eth0 eth1 eth2 eth3
auto eth1
iface eth1 inet manual
bond-master bond0
bond-primary eth0 eth1 eth2 eth3
auto eth2
iface eth2 inet manual
bond-master bond0
bond-primary eth0 eth1 eth2 eth3
auto eth3
iface eth3 inet manual
bond-master bond0
bond-primary eth0 eth1 eth2 eth3

@george1421 said in One FOG server with multiple storage nodes?:

let me see if I understand this.

You have a ESXi server and it has 4 nic cards that are trunked to your core switch. From the core switch to the remote idf closets you have (2) 4Gb links (!!suspect!!). Your fog server has one virtual nic (eth0)

Some questions I have

Your ether channel trunk did you use lacp on that link and is your link mode ip hash or mac hash (the esxi server and the switch need to agree on this)?

You stated that you have 4Gb link to the idf closets? Are you using tokenring or fiber channel here?

When you setup your fog vm, what (ESXi) style network adapter did you choose (VMXNet, or E1000)? For that OS you should be using E1000.

What is your single client transfer rate according to partclone?

I can tell you I have a similar setup in that I have a ESXi server with 4 nics teamed with a 2 port LAG group to each idf closet. In my office I have an 8 port managed switch that I use for testing target host deployment, that is connected to the idf closet and the idf closet to the core switch.

For a single host deployment from the fog server to a Dell 990 I get about 6.1GB/min which translates to about 101MB/s (which is near the theoretical GbE speed of 125MB/s)

The physical setup is the ESX(i) server with 4x 1Gb NICs in port-channel (not trunked, all access ports and all IPs on the VMs are on the same vlan) to a Cisco 2960x stack (feeds the labs I was working in, so no “practical” throughput limits here between switches), which has 2x 1Gb LACP links to the core switch (an old 4006). From there, there are 2x 1Gb MM fiber links to 2960X stacks in the IDFs (again, LACP); there is a 3x MM LH fiber link to a building under the parking lot (PAGP port channel due to ancient code running on the 4006 at that building). There is a remote site with a 100 Mb link through the ISP (it kills me) but FOG is not used there currently.
ESXi is using IP hash load balancing - basically identical to the config here: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048
I will double check on what NICs VMware is emulating to the OS. When I gave the VM four NICs, it was just VMNIC1, VMNIC2, etc.

I usually see a transfer rate (as was noted by another poster, after decompression) of a little over 4 GB/min on older Conroe based CPU hardware (which is the bulk of what is at this site) - I can do that on up to 4 hosts simultaneously before it drops into the high 3s, at which point I can see the saturation of the NIC in the FOG console home page’s transmit graph (125 MB/s). The newest machines there I saw a transfer rate as high as 6.65 GB/min (Ivy Bridge i5s) but have yet to try sending more than one of those at a time. All the image compression settings are default.

theterminator93

Not really a problem that I can tell since (otherwise) the setup works how it’s supposed to…

So here’s the skinny - I recently rolled out a new network for a client (going from 100 Mb/s everything to 1Gb/s with a 2-4 Gb/s site/closet interlink, depending on site) and this rollout included a new ESX(i) server. The server has 4x 1Gb NICs, which are port-channeled into the switch. FOG 1.2.0 is running on Ubuntu 12.04 LTS. I set up the VM with one virtual Gb NIC; eth0.

We were doing some imaging and found out that the server saturates the 1Gb link with 3-4 clients imaging simultaneously at gig (obviously more if only at 100 Mb). So I decided I’d try to find a way to get the server to use more than one NIC at the same time so I could get more imaging throughput after verifying the host NIC was the bottleneck (rather than the server’s disk array). Originally I tried bonding four virtual NICs together and changing the storage node to use bond0, but when imaging I only saw traffic coming from one of the virtual NICs in VMware - so I ditched the idea and am trying something else. I added two more virtual NICs to the VM, eth1 and eth2, gave them their own IPs, then created two additional storage nodes pointing to the local /images directory in the FOG web config corresponding to the IP of that NIC.

When I tried pushing an image, the host got stuck at mounting image and told me it timed out while attempting to read the directory on one of the new storage nodes. I went in to the OS and disabled the other interfaces, then it successfully mounted and started imaging. As soon as I reenabled one of the other interfaces, imaging came to a screeching halt.

So here’s the big question - how do I make this work? Am I going about it in the wrong way entirely, or is there some config file that needs a few minor edits somewhere?

theterminator93

I will give that a try later this week and report back. I’ve got nothing to lose at this point.

theterminator93

No luck. Re-uploaded the image on a fresh install of Ubuntu 12.04 with FOG 1.2.0 and attempted to deploy - same problem. Upgrade to FOG Trunk and attempted to deploy, same problem. An error message would be helpful but none are seen…

I’ve just about given up. I’ve had my techs start imaging these machines with Acronis so we don’t keep falling behind on our timetables. If anyone has any other ideas or things to try, I’m all ears.