- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
12-02-2016 03:05 AM - edited 12-02-2016 03:55 AM
Hi Guys,
Just to clarify that heartbeat ping messages send by bi-direction (Active>Passive and Passive>Active) and these messages proceed by management plane. So if my MP CPU utilisation is always high (98% it is 2050) is it possible to lose ICMP (heartbeat) messages?
Cheers,
Myky
gurus members
12-02-2016 05:03 AM
hmmm i might be wrong 🙂 possible that the pan_dha (dataplane HA agent) forwards the heartbeats on to the management plane... which would actually make sense ..
your backup does not seem to be configured
Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold
so, indeed, yes it would be possible the HA is flapping due to CPU load
you could try enabling the ha1-backup to enable the simplified pings, this could help as it requires less intelligence, so less cpu cycles
12-02-2016 04:48 AM
HA1 has it's own intelligent hearbeat to check if both sides are 'aware' they are alive, this is controlled on the dataplane and flows through dataplane or dedicated interfaces
the ha1 backup on mgmt interface is an additional ping between the management planes, just to ensure if dataplane is running so high the ha1 messages get timed out, the passive unit doesn't take over and create a split brain situation
I wouldn't recommend setting your primary HA1 to management unless the dp is running hight to begin with
12-02-2016 04:53 AM - edited 12-02-2016 04:55 AM
Hi Reaper,
Thanks for your reply. So it is not processed by MP plane then. We do have a case where our PA2050 MP CPU always running high (98%) and see a lot of these HA alerts:
Apr 01 03:17:06 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)
Apr 01 03:17:28 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)
Apr 01 03:17:29 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 2 ping timeouts out of 3 (ha1)
Apr 01 03:25:14 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)
Apr 01 03:25:17 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)
Apr 01 03:28:04 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)
Apr 01 03:28:26 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)
Apr 01 03:28:27 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 2 ping timeouts out of 3 (ha1)
Apr 01 03:28:28 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 3 ping timeouts out of 3 (ha1)
Apr 01 03:28:28 Error: ha_ping_peer_miss(src/ha_ping.c:758): We have missed 4 pings from the peer for group 1 (ha1), restarting connection
Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1 connection down
Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: All HA1 connections down
Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1 peer link unknown, waiting hold
Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold
Apr 01 03:28:28 HA2 peer link unknown
Apr 01 03:28:28 HA2-Backup peer link unknown
Apr 01 03:28:28 HA3 peer link unknown
Apr 01 03:28:28 ha_peer_send_error(src/ha_peer.c:1452): Group 1 (HA1-MAIN): Sending errro message
Error Msg
---------
flags : 0x2 (close:)
err code : Heartbeat ping failure (16)
num tlvs : 1
Printing out 1 tlvs
TLV[1]: type 5 (ERR_STRING); len 23; value:
48656172 74626561 74207069 6e672066 61696c75 726500
Apr 01 03:28:28 Error: ha_peer_disconnect(src/ha_peer.c:1593): Group 1 (HA1-MAIN): peer connection error msg set: Heartbeat ping failure
Apr 01 03:28:28 Group 1 (HA1-MGMT): new primary (error), going away from NONE
Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA heartbeat backup is being used to avoid split-brain; the HA functionality is in a degraded state pending the recovery of HA1
Apr 01 03:28:28 ha_peer_send_primary(src/ha_peer.c:4950): Group 1 (HA1-MGMT): Sending primary message
Primary Msg
-----------
flags : 0x0
reason : 2 (error)
num tlvs : 0
Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3829): Attempting 1 modify for sw.sysd.peers
Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3874): Clearing out peer sysd setting because stop for link reconfig
Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3893): Setting sysd node to: { 'peer.': { }, }
Apr 01 03:28:28 ha_ping_stop(src/ha_ping.c:404): Group 1: Stopping pings for ha1
Apr 01 03:28:28 ha_ping_stop(src/ha_ping.c:404): Group 1: Stopping pings for ha1
Apr 01 03:28:28 ha_ping_start(src/ha_ping.c:210): Group 1: Starting pings for ha1
Apr 01 03:28:28 ha_peer_start(src/ha_peer.c:246): Group 1 (HA1-MAIN): waiting for ping response before starting connection
Apr 01 03:28:28 ha_peer_recv_primary(src/ha_peer.c:5020): Group 1 (HA1-MGMT): Receiving primary ack message
So constantly heartbeats ping missed but failover is not actually happening due to mgmt back up link path. If l understood this correct form teh logs.
Chreers,
Myky
12-02-2016 05:03 AM
hmmm i might be wrong 🙂 possible that the pan_dha (dataplane HA agent) forwards the heartbeats on to the management plane... which would actually make sense ..
your backup does not seem to be configured
Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold
so, indeed, yes it would be possible the HA is flapping due to CPU load
you could try enabling the ha1-backup to enable the simplified pings, this could help as it requires less intelligence, so less cpu cycles
12-02-2016 05:08 AM
Hi Reaper,
Unfortunately, l do have a control for this firewall so cannot confirm right away details, but this is what l have noticed. Apart of configuring HA1-Backup , would increasing heartbeats time help in this case?
Thx,
Myky
12-02-2016 05:10 AM
absolutely
if your firewall is working within 'expected' parameters you'll want to relax the heartbeat/hello interval and increase the hold time
12-02-2016 05:13 AM - edited 12-02-2016 05:17 AM
Will do a test and let you know. Thanks man!
P.S Using different community forums for dif vendors (Extreme, Aruba, Infoblox) but PA is the best 🙂
12-02-2016 06:15 AM
You might want to try and get whoever owns this box to upgrade @TranceforLife. That constant high high utilization is bound to be causing management issues across the board; I can't imagine the commit time or log query on this device.
Here's a few live documents on potential steps to lower to the CPU utilization as long as they fit your needs.
12-02-2016 06:17 AM
Hi All,
Oh it is nightmare and you are correct about the upgrade. Scheduled for the next week already
Thx,
Myky
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!