Travelling Techie: Troubleshooting networking, sometimes you have to think outside the network.

Josh Townsend’s recent post on vmtoday, PCoIP Packet Loss? Don’t blame the network is a fantastic example of troubleshooting. In addition, it illustrates something I ran in to frequently as an Escalation Engineer. Namely, that just because a problem is exhibiting the signs of a network issue, sometimes you have to expand your search. Once you have eliminated the possible network problems, you have to be willing to look at other areas that might cause similar symptoms.
A very good example of that is packet loss. In addition to network problems (and the very occasional insidious vmkernel problem), anything that causes a vm to pause, even momentarily, can cause it to drop packets.
Storage is one possibility that can cause this, as Josh pointed out. Another possibility is too many vCPUs. While not exactly a pause, if the VM can’t get all it’s processors scheduled, it’s not going to be able to pull all of the packets off of the ring buffer, and they get dropped. Relaxed coscheduling helps with this, but does not eliminate it. I can’t count the number of cases that were resolved by reducing the number of vCPU’s from 4 to 2. (Be aware of HAL/kernel compatibility when changing between 1 and 2 vCPUs). Another issue that can cause packet loss/VM pausing is the CDROM drive. If the iso that is mounted is not available (there can be multiple reasons for this) but the operating system keeps trying to access it, this will lead to small frequent pauses, often resulting in every other ping being dropped. A common cause of this on ESX(i) 5.0 and below is mounting an ISO on a VMFS datastore, and then vMotioning it to the 9th host to access that image. VMFS only supports 8 hosts accessing the same read only file at one time.

Related information:
KB 1015797
KB 1005362
KB 1010184

Travelling Techie

Pages

Tuesday, July 30, 2013

Troubleshooting networking, sometimes you have to think outside the network.

No comments:

Post a Comment