How have folks setup automated monitoring of HA status and session sync?
We see HA instability on a 5060 A/A cluster during periods of high load. The boxes get too busy to respond to HA messages, lose heartbeat, and start to think the links have failed. Normally, they recover automatically when the load decreases. But this weekend, we had a case where HA1 heartbeat did not return, and session sync failed as a result.
But I can't figure out what to poll. There's nothing about HA in SNMP. So I'm looking at the CLI/API. "show high-availability state" seems to display the configuration, not the status. And it doesn't give specific status info for the each HA link, heartbeats, or sync. "show high-availability state-synchronization" looks promising, but I can't tell if it's reporting configuration or status.
Hi...Did you setup HA heartbeat backup where heartbeats are sent over the mgmt interface in addition to HA1? That will help when the dataplane is under high load. Thanks.
Here is a CLI command that you can use to see the hello timeouts and failures. Another method is to configure the system log to forward to snmp and/or syslog and monitor for HA heartbeat events from your snmp/syslog console. Thanks.
admin@PA-5060(active)> show high-availability control-link statistics
Control Link Statistics:
Messages-TX : 23004
Messages-RX : 22973
Capability-Msg-TX : 17
Capability-Msg-RX : 17
Error-Msg-TX : 5
Error-Msg-RX : 1
Preempt-Msg-TX : 0
Preempt-Msg-RX : 0
Preempt-Ack-Msg-TX : 0
Preempt-Ack-Msg-RX : 0
Primary-Msg-TX : 7
Primary-Msg-RX : 7
Primary-Ack-Msg-TX : 7
Primary-Ack-Msg-RX : 7
Hello-Msg-TX : 22954
Hello-Msg-RX : 22927
Hello-Timeouts : 0
Hello-Failures : 0
I setup email alerts for Critical events. That way when one occurs, you get notified of what happened. Since I have over 5 HA pairs, i set it up on the Panorama.
Here is what the emails looks like:
Subject: SYSTEM ALERT : critical : HA Group 1: Moved from state Active to state Non-Functional
receive_time: 2015/01/07 15:22:52
time_generated: 2015/01/07 15:22:50
opaque: Chassis Master Alarm: HA-event
That looks potentially promising - we could pull those counters into our monitoring system and alert on their incrementing.
Unfortunately I can't get the data from the API. If I call "<show><high-availability><control-link><statistics></statistics></control-link></high-availability></show>", the response looks like this:
So there is some data there, but we can't get at it programmatically.
I was hoping to get something we could feed into our monitoring system, so that our NOC will be alerted automatically.
We have the emails setup, and they are helpful when you're looking at email - but it doesn't really scale for automated monitoring.
I also have this in addition and the way I do it is by exporting all logs from the Panorama to the log management system and then setup a custom alert from that system. Any log manager or SEIM should be able to accomplish this.
I guess there's no other option - seems like there's no other way to get at this data. Shame its so hard to collect some simple counters!
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!
The Live Community thanks you for your participation!