Monitoring A/A HA status and session sync

cancel
Showing results for 
Search instead for 
Did you mean: 

Monitoring A/A HA status and session sync

Not applicable

How have folks setup automated monitoring of HA status and session sync?

We see HA instability on a 5060 A/A cluster during periods of high load.  The boxes get too busy to respond to HA messages, lose heartbeat, and start to think the links have failed. Normally, they recover automatically when the load decreases.  But this weekend, we had a case where HA1 heartbeat did not return, and session sync failed as a result.

But I can't figure out what to poll.  There's nothing about HA in SNMP.  So I'm looking at the CLI/API.   "show high-availability state" seems to display the configuration, not the status.  And it doesn't give specific status info for the each HA link, heartbeats, or sync.  "show high-availability state-synchronization" looks promising, but I can't tell if it's reporting configuration or status.

Ross

8 REPLIES 8

L6 Presenter

Hi...Did you setup HA heartbeat backup where heartbeats are sent over the mgmt interface in addition to HA1?  That will help when the dataplane is under high load.  Thanks.

Not applicable

Yes - we have HA heartbeat backup configured.  It doesn't really help - or at least, it doesn't help enough.

Here is a CLI command that you can use to see the hello timeouts and failures.  Another method is to configure the system log to forward to snmp and/or syslog and monitor for HA heartbeat events from your snmp/syslog console.  Thanks.

admin@PA-5060(active)> show high-availability control-link statistics

Group 1:

  Mode: Active-Passive

  Control Link Statistics:

    HA1:

      Messages-TX               : 23004

      Messages-RX               : 22973

      Capability-Msg-TX         : 17

      Capability-Msg-RX         : 17

      Error-Msg-TX              : 5

      Error-Msg-RX              : 1

      Preempt-Msg-TX            : 0

      Preempt-Msg-RX            : 0

      Preempt-Ack-Msg-TX        : 0

      Preempt-Ack-Msg-RX        : 0

      Primary-Msg-TX            : 7

      Primary-Msg-RX            : 7

      Primary-Ack-Msg-TX        : 7

      Primary-Ack-Msg-RX        : 7

      Hello-Msg-TX              : 22954

      Hello-Msg-RX              : 22927

     Hello-Timeouts            : 0

     Hello-Failures            : 0

L3 Networker

I setup email alerts for Critical events. That way when one occurs, you get notified of what happened. Since I have over 5 HA pairs, i set it up on the Panorama.

Here is what the emails looks like:

Subject: SYSTEM ALERT : critical : HA Group 1: Moved from state Active to state Non-Functional

Body:

domain: 1
receive_time: 2015/01/07 15:22:52
serial:
seqno: 155559
actionflags: 0x8000000000000000
type: SYSTEM
subtype: general
config_ver: 0
time_generated: 2015/01/07 15:22:50
vsys:
eventid: general
object:
fmt: 0
id: 0
module: general
severity: critical
opaque: Chassis Master Alarm: HA-event

That looks potentially promising - we could pull those counters into our monitoring system and alert on their incrementing.

Unfortunately I can't get the data from the API.  If I call "<show><high-availability><control-link><statistics></statistics></control-link></high-availability></show>", the response looks like this:

<response status="success">

  <result>

    <enabled>yes</enabled>

    <group>

      <mode>Active-Active</mode>

      <control-stats/>

    </group>

  </result>

</response>

So there is some data there, but we can't get at it programmatically.

I was hoping to get something we could feed into our monitoring system, so that our NOC will be alerted automatically.

We have the emails setup, and they are helpful when you're looking at email - but it doesn't really scale for automated monitoring.

I also have this in addition and the way I do it is by exporting all logs from the Panorama to the log management system and then setup a custom alert from that system. Any log manager or SEIM should be able to accomplish this.

I guess there's no other option - seems like there's no other way to get at this data.  Shame its so hard to collect some simple counters!

Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!