Travelling Techie: July 2013

Tuesday, July 30, 2013

Troubleshooting a virtual network

In this entry I'm going to discuss my methodology for troubleshooting a virtual network. As a Network Escalation Engineer for VMware, I had many occasions to both troubleshoot network issues, and teach others how to do so. If you have ever been through one of my whiteboard lessons, you may recognize some of the diagrams below. This entry will probably evolve over time, and for clarity, I will be posting updates in the text, rather than at the end.

For this discussion, I will be using the following diagram, representing the basic idea of a virtual network connected to a physical network. I am mostly going to focus on standard switches and standard VLANS, I won't cover the Nexus 1000v, vCNI, VXLAN, or any advanced features, although the basic ideas won't change much. Click on any of the images below for a larger version.

The diagram has two ESXi hosts, each with two VMs. All the VMs are connected to the same portgroup. Each host is connected to two switches for redundancy. The two physical switches are then connected to a root switch, which is connected to a router (or may in fact be a router)

Troubleshooting networking, sometimes you have to think outside the network.

Josh Townsend’s recent post on vmtoday, PCoIP Packet Loss? Don’t blame the network is a fantastic example of troubleshooting. In addition, it illustrates something I ran in to frequently as an Escalation Engineer. Namely, that just because a problem is exhibiting the signs of a network issue, sometimes you have to expand your search. Once you have eliminated the possible network problems, you have to be willing to look at other areas that might cause similar symptoms.
A very good example of that is packet loss. In addition to network problems (and the very occasional insidious vmkernel problem), anything that causes a vm to pause, even momentarily, can cause it to drop packets.
Storage is one possibility that can cause this, as Josh pointed out. Another possibility is too many vCPUs. While not exactly a pause, if the VM can’t get all it’s processors scheduled, it’s not going to be able to pull all of the packets off of the ring buffer, and they get dropped. Relaxed coscheduling helps with this, but does not eliminate it. I can’t count the number of cases that were resolved by reducing the number of vCPU’s from 4 to 2. (Be aware of HAL/kernel compatibility when changing between 1 and 2 vCPUs). Another issue that can cause packet loss/VM pausing is the CDROM drive. If the iso that is mounted is not available (there can be multiple reasons for this) but the operating system keeps trying to access it, this will lead to small frequent pauses, often resulting in every other ping being dropped. A common cause of this on ESX(i) 5.0 and below is mounting an ISO on a VMFS datastore, and then vMotioning it to the 9th host to access that image. VMFS only supports 8 hosts accessing the same read only file at one time.

Related information:
KB 1015797
KB 1005362
KB 1010184

Monday, July 29, 2013

vsish for networking

vsish, or the vmkernel system information shell, provides behind the curtain information on the running vmkernel, similar to the way /proc provides information on a running linux kernel. For more information, see What is VMware vsish? from William Lam over at VirtuallyGhetto.

First a word of caution, vsish is not supported unless directed to use it by VMware Support. Do not make uneducated changes to the vmkernel or you can significantly reduce performance, or cause a purple screen.

I am going to focus specifically on network nodes that can be helpful for retrieving information.

/vmkModules/cdp CDP information for vmnics
/net/pNics/vnmicX/stats Vmnic statistics
/net/pNics/vmnicX/properties Driver/firmware information, other properties
/net/tcpip/v4/neighbors/ Arp cache information
/net/portsets/vSwitchX/ports/#####/X will be the vSwitch number ##### is the port number from esxtop

status information on the port, including what device it is connected to
stats standard switch counters
clientStats counters from the vnic perspective
teamuplink what uplink this port is bound to
vmxnet3/rxSummary additional counters for the vmxnet3
vmxnet3/txSummary additional counters for the vmxnet3

/net/portsets/DvsPortset-#/ports/######/ Same as above, for the vDS
/system/heaps/NetPktHeap/###### Current NetPktHeap status, below 30% free of max size and problems may occur, check high and low, but low is more important. (this is usually only a problem in 4.0 with more than two 10GB Nics)

Sunday, July 28, 2013

First Post

Like many of my colleagues, I have decided to start writing a blog. Mostly about VMware, although I may cover some other subjects on occasion. I am going to mostly try to stay away from how to information, as that is posted on other sites that do a better job than I could, and instead focus on what is going on under the hood and troubleshooting. I often find myself thinking “I wonder how this works.” This blog will record those explorations. Of course, initial intentions and how things evolve are often different, so don’t hold me to this.

Pages