For this discussion, I will be using the following diagram, representing the basic idea of a virtual network connected to a physical network. I am mostly going to focus on standard switches and standard VLANS, I won't cover the Nexus 1000v, vCNI, VXLAN, or any advanced features, although the basic ideas won't change much. Click on any of the images below for a larger version.
The diagram has two ESXi hosts, each with two VMs. All the VMs are connected to the same portgroup. Each host is connected to two switches for redundancy. The two physical switches are then connected to a root switch, which is connected to a router (or may in fact be a router)First pick your troubleshooting tools. I usually use iperf/netperf, nc, and ping. I will cover these tools in detail later on and link back to them. For intermittent problems you may need to write a script to monitor for longer periods of time.
This first step in troubleshooting a network problem related to a VM should always be to test VM to VM on the same host, same portgroup. At this stage of the testing, there is no "network" there. It's simply a memory transfer, and any problems that you see at this level are related to the VM itself (firewall, ipsec, misconfiguration, too many vCPUs) or the host (too many VMs, storage issues, memory/cpu issues).
Next, it is helpful to determine which vmnic the VM is bound to, as well as look at the dropped packet counters. To do this we use esxtop or resxtop. Press "N" for networking. The first column tells us the port number, which will be helpful later on if we need to use vsish. Under USED-BY, we can locate the VM in question, and then TEAM-PNIC will tell us what vmnic it is bound to (if it says all-#, we are using IP Hash). Also look at the dropped packet counters. Dropped packets at the vnic indicate a problem with the VM, dropped packets at the vmnic generally indicate a driver or host issue.
Now that we know which vmnic we are bound to, we can use that and either CDP, or a network diagram to determine which switch we are connected to (if you don't know, try looking for the MAC of the VM on the physical switch).
Using this information, test from VM to VM on different hosts, but going through only one physical switch. Generally problems in this area indicate vSwitch or physical switch misconfiguration, driver problems (you are running on the latest version of the driver right? and you updated the firmware at the same time?), physical issues (cabling, ports) or sometimes host issues.
Everything good so far? Excellent, we have mostly eliminated the host as a candidate, although IP HASH misconfigurations can be particularly insidious to troubleshoot.
Many but not all network problems can be isolated, if not resolved by using this method. Sometimes you'll have to do packet captures at multiple points along the route to determine what is going on. While there is no method for doing packet captures on a vmnic, you can do a capture on a vmk using tcpdump-uw. I'll leave packet captures to another article.
Don't forget to examine log files. The vmkernel.log file can be helpful, particularly for vmnic issues, search for the vmnic number in question. For problems related to vmotion and virtual machine hardware, check the vmware.log file that is in the same directory as the VM files.
A couple of final thoughts. Remember that network communication consists of three parts, the sender, the reciever, and the stuff in the middle, check all three. Make sure that you are only dealing with one problem at a time, if you have varied symptoms, you may have more than one problem. Troubleshoot them one at a time. And finally, it is NEVER random, you just haven't found the pattern yet.




No comments:
Post a Comment