Network throughput crippled.

Wayne Workman

This I strongly believe is not a fog problem but is a Linux and or a VMware problem.

At work we have a fog system that has about 15 servers in total. All of the storage nodes and the main server are all a member of the same storage group but I think this is irrelevant.

We recently had a power outage that lasted for about 3 hours at our Administration Center and the VM platform went down. It’s configured to use a San, I believe there are two sans that are mirrored and 2 VM platforms that are also mirrored and a switch configured in whatever is standard for this particular setup. I didn’t build the platform so I’m unsure of any specific details about that.

All of the storage nodes are operating just fine at about 1 gigabit per second. The main fog server which is hosted in VMware is averaging about 30 megabits per second to anywhere besides itself.
I used iPerf to test throughput from the fog server to several other places and they all averaged 30 megabits per second.

When I tested throughput using iPerf to 127.0.0.1 I get 34Gbps.

Other VMS in the platform are operating normal, I tested those as well.

CPU usage remains under 0.3, Disk Utility usage remains at about 3%, and memory usage is negligible.

I used ethtool to verify that the adapter is configured at 1 gigabit per second.

The main fog server other than throughput is otherwise operating perfectly normally.

I have several ideas about how to recover from this if I can’t fix the throughput problem, but I’m reaching out to pick the heads of all you gurus here about what could I possibly do to test or fix the throughput issue.

The fog server is CentOS_7.

george1421

After thinking about this for a bit I have a few questions and some comments.

You have to remember there is a virtualization layer between the FOG host system and the physical world. So checking with ethertool on the vm client will only give you a false sense of what is going on because it tells you what is going on between the host system and the virtualization layer vSwitch. What you need to really find out is what is going on between the physical vm host server and the physical core switch. On your virtualization host, do you have a network LAG setup between the vm host server and your core switch? If you do ensure that all LAG elements (ports) are running at GbE speeds (from the core switch side).

Are all 15 FOG storage nodes connected to the same core switch? If not what kind of throughput do you get to a linux server connected to the same core switch. What about a linux host on the same virtualization host? You need to start ruling out where the problem isn’t Is it the vm host-> same vm host, vm host-> other vm host on same core switch, vm host -> some vm server on the other side of your network?

I assume your vm client disks (vmdk files) reside on the SAN. If so you also have to take that into account for overall system speed. 30MB/s is something I might expect from an old SATA disk. Have you tested with hdparm to see what your disk transfer rates are? You may have an issue on the SAN or SAN LAN side that is causing slow disk access not related to client network throughput. If your SAN LAN uses mpio for load balancing and redundancy, you may have one of the mpio branches off-line. But since you are using iperf to measure and it is reporting slow, its probably not the disk subsystem at fault here. But it would be good to check that out since a slow disk would create an overall low throughput to the target computer.

george1421

said in Network throughput crippled.:

When I tested throughput using iPerf to 127.0.0.1 I get 34Gbps.

This statement is interesting since you are getting 34Gb/s, but that is all on host and doesn’t ever leave the vm client (fog server). I think if you have the resources, spin up another centos 7 (test) box on this same vm host server and see what your throughput is to that clean centos 7 server (you can destroy it after testing is done).

Wayne Workman

@george1421 Thanks. I’ll ask about all those things. I was going to do an iPerf test from the fog server to a windows server on the same platform, would this accomplish the same thing as spinning up another linux vm and testing between it and the fog server? Or are you looking to see how a fresh linux vm performs?

george1421

@Wayne-Workman I would use an existing vm client if I had it. We have a centos 7 vm template so spinning up a new centos 7 vm take about 3 minutes not including boot time. The goal here is to see if it is the vm host or vm host to the network where you are taking the hit. Test 2 would be the fog server to another server on the same core switch to start testing devices near to far of the fog server to see if you can establish a pattern. Right now its not clear in my mind where the problem isn’t.

Wayne Workman

@george1421 I confirmed the unit of measurement is Mbits/sec

Averages are now around 15Mbits/sec.
0_1469103769430_upload-dbeff3e6-d7c2-4db8-8311-835260fd5886

george1421

@Wayne-Workman OK, wow that sucks. Now start testing your way from near systems to far systems to see if you can pinpoint where things go bad.

Mentaloid

@Wayne-Workman
Just as a reference, my fog server on ESXi 6 to another (OLD debian jessie, including old unoptimised net devices) VM on the same host

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.1.110.100, TCP port 5001
TCP window size: 1.83 MByte (default)
------------------------------------------------------------
[  5] local 10.1.100.50 port 50312 connected with 10.1.110.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  7.35 GBytes  6.31 Gbits/sec
[  4] local 10.1.100.50 port 5001 connected with 10.1.110.100 port 40811
[  4]  0.0-10.0 sec  12.4 GBytes  10.6 Gbits/sec

Only thing I can suggest is possibly removing the virt network device from the guest, boot, re-add the most current, and configure.

I honestly can’t see what a power outage would do other than possibly mess something up in the guest os VMX file. Do you have a backup to compare?

These are the relevant lines from my vmx

ethernet0.virtualDev = "vmxnet3"
ethernet0.networkName = "VM Network VLAN100"
ethernet0.addressType = "generated"
ethernet0.uptCompatibility = "TRUE"
ethernet0.present = "TRUE"
ethernet0.pciSlotNumber = "192"
ethernet0.generatedAddress = "00:0c:29:XX:YY:ZZ"
ethernet0.generatedAddressOffset = "0"

Wayne Workman

@Mentaloid On Monday, after taking a new snapshot (just to have it), we’re going to apply an older snapshot and see if the problem is resolved or not.

The VMX stuff you posted, what file has that in Debian, or is that in ESXi?

Earlier today, we removed the E1000 adapter, and added an E1000E adapter and gave a reboot, the OS didn’t detect the adapter right, and there was no network connectivity. I could have just missed a step, I did generate a new UUID for the old interface name (it’s a red hat thing), but I am not entirely sure the new interface’s name would be the same as the old one. The old interface name was ens32. I’ll have to do more testing on this Monday.

Of course - any and all advice or questions are welcome. I need all the help I can get.

Mentaloid

@Wayne-Workman
.VMX is the ESXi config file for the guest OS - it normally would be stored with your virt HD on your SAN. If you can’t edit/view the file (plaintext) directly on your SAN, You can view it via the vsphere client/web interface. Edit the powered off VM, go to the options tab, and then advanced/general hit the button for configuration params. Be careful in there!

Centos I would imagine supports vmxnet adapters… if you have the option for support in your guest OS, they are more efficient/faster than e1000/e1000e emulation. I think this guide should help you get that running…

vmxnet3incentos7

Back to an earlier point though, you have confirmed that GuestOS/VM0 to another GuestOS/VM on the same VMHost is not working at correct speed?

If VM to VM is good (therefore the VMHost virt switch & virt adapters are working internally), I’d look at your LAG/Bond for your uplink. I’ve have ESXi puke and start dropping packets on a LAG before, and I’ve also had switches with good “links” and no frame errors, but not passing data in one direction after a power loss. This can be frustrating to troubleshoot in LAG/Bonded links. Pull out your all but one of the LAG wires (admin down from switch isn’t good enough as you physically have to break the link for ESXi to figure out that it shouldn’t use the port for data), and verify. If it’s good, pull it, and try another - keep going through and verify each port is functioning on it’s own. Of course you need to ensure your using a known good port on this switch for your test machine!

george1421

Something just struck me, did you have any pending vm host updates waiting on a reboot? With ESXi, it executes out of memory, its possible to update the system files for ESXi and for those updates to only be applied upon reboot. Not saying this is the case, but is possible to explain why after an entire system restart things are acting a bit strange.

Wayne Workman

So, we found out what it was.

A security camera contractor had damaged the fiber line that the VMWare platform used.

He fixed it Friday morning. I’m guessing he told our network team what happened sometime Friday. After that, everything was fine.

Network throughput crippled.

214

12.2k

17.3k

155.5k