Beacon probing resulting in excessive broadcasts

Not so long ago I posted a link to a VMWare blog in which beacon probing was demystified. This article stated that you only should use beacon probing when there is no link state tracking on the physical switches and you could consider beacon probing as a nice software solution for replacing it.

Well we’ve got our ESX environment set up by a local supplier and they advised us to use beacon probing instead of link state tracking. But for some reason beginning from that moment I got major events from my ProCurve switches stating excessive broadcasts. This wasn’t often, but especially during peak hours I was getting this notification.

When I started sniffing the network packets on the uplink of the switches I noticed what kind of packets it were. It was an almost continuous flow of RARP packets coming from the ESX servers. RARP packets are meant as MAC address table updates for switches. This way when a node is suddenly available on a different MAC address the switch already knows the new path. This is also what happens when a virtual switch detects a link is not functional. It will switch the uplink and notify the switches. When beacon probing isn’t working as expected ESX constantly thinks the uplink isn’t functional so it is constantly switching the uplink and as a result constantly sending out RARP packets.

While you can define both the notify switches parameter and the network failure detection, it isn’t good to just put notify switches to ‘No’. This way the failover is still constantly changing uplinks and this can result in errors (in my case timeouts with TFTP). The real problem was the failure detection. Beacon probing for some reason just didn’t work in our environment. When I changed the failure detection to ‘Link Status only’ all the RARP packets disappeared and my excessive broadcasts were gone.

In my case just link status only is sufficient, but I can imagine there are cases where you would want to use beacon probing. If you enable beacon probing and this results in excessive broadcasts (or just more broadcasts) I do advise to look if you could find those RARP packets. This can indicate that beacon probing is just not working correctly in your environment.

I want to thank Scott Lowe for giving me a push in the right direction.

Advertisements

5 responses to “Beacon probing resulting in excessive broadcasts

  1. Beacon probing works well with switches configured with SLAG (Static LAG)

  2. I also had a bad experience with beacon probing corrupting a switches CAM table and causing excessive flapping. We are back to link status.

  3. Actually turns out our issue was due to secured interface ports (so the interfaces couldn’t speak to one another). In the end this was not a beacon probing issue.

  4. “It was an almost continuous flow of RARP packets coming from the ESX servers.”

    What packet rate did you observe? E.g., 1000 pkts/sec, 10,000 pkts/sec, 100,000 pkts/sec, etc. Was it the same RARP packet repeated over and over, or a stream of RARPs with different VM MAC addresses (corresponding to VMs being moved back and forth between uplinks)?

    Why do I care about these details? Because we have a similar issue involving a broadcast storm wherein the same RARP packet is repeated over and over at about 500,000 pkts/sec. We are trying to discern whether it is ESX sending this traffic, or something else repeating a RARP packet originating from one of our ESX hosts.

    Thanks!

  5. Pingback: Possible reasons for RARP storms from an ESX host | Virtualaholics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s