PA HA concept quick question

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements

PA HA concept quick question

L6 Presenter

Hi Guys,

 

Just to clarify that heartbeat ping  messages send by bi-direction (Active>Passive and Passive>Active) and these messages proceed by management plane. So if my MP CPU utilisation is always high (98% it is 2050) is it possible to lose ICMP (heartbeat) messages?

 

Cheers,

Myky

 

gurus members

@reaper@kiwi@BPry

1 accepted solution

Accepted Solutions

hmmm i might be wrong 🙂 possible that the pan_dha (dataplane HA agent) forwards the heartbeats on to the management plane... which would actually make sense ..

 

your backup does not seem to be configured 

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold

 

so, indeed, yes it would be possible the HA is flapping due to CPU load

you could try enabling the ha1-backup to enable the simplified pings, this could help as it requires less intelligence, so less cpu cycles

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

View solution in original post

8 REPLIES 8

Cyber Elite
Cyber Elite

Hi @TranceforLife

 

HA1 has it's own intelligent hearbeat to check if both sides are 'aware' they are alive, this is controlled on the dataplane and flows through dataplane or dedicated interfaces

the ha1 backup on mgmt interface is an additional ping between the management planes, just to ensure if dataplane is running so high the ha1 messages get timed out, the passive unit doesn't take over and create a split brain situation

 

I wouldn't recommend setting your primary HA1 to management unless the dp is running hight to begin with

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Hi Reaper,

 

Thanks for your reply. So it is not processed by MP plane then. We do have a case where our PA2050 MP CPU always running high (98%) and see a lot of these HA alerts:

 

Apr 01 03:17:06 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:17:28 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:17:29 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 2 ping timeouts out of 3 (ha1)

Apr 01 03:25:14 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:25:17 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:28:04 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:28:26 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:28:27 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 2 ping timeouts out of 3 (ha1)

Apr 01 03:28:28 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 3 ping timeouts out of 3 (ha1)

Apr 01 03:28:28 Error: ha_ping_peer_miss(src/ha_ping.c:758): We have missed 4 pings from the peer for group 1 (ha1), restarting connection

Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1 connection down

Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: All HA1 connections down

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1 peer link unknown, waiting hold

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold

Apr 01 03:28:28 HA2 peer link unknown

Apr 01 03:28:28 HA2-Backup peer link unknown

Apr 01 03:28:28 HA3 peer link unknown

Apr 01 03:28:28 ha_peer_send_error(src/ha_peer.c:1452): Group 1 (HA1-MAIN): Sending errro message

 

Error Msg

---------

flags    : 0x2 (close:)

err code : Heartbeat ping failure (16)

num tlvs : 1

  Printing out 1 tlvs

  TLV[1]: type 5 (ERR_STRING); len 23; value:

    48656172 74626561 74207069 6e672066 61696c75 726500

Apr 01 03:28:28 Error: ha_peer_disconnect(src/ha_peer.c:1593): Group 1 (HA1-MAIN): peer connection error msg set: Heartbeat ping failure

Apr 01 03:28:28 Group 1 (HA1-MGMT): new primary (error), going away from NONE

Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA heartbeat backup is being used to avoid split-brain; the HA functionality is in a degraded state pending the recovery of HA1

Apr 01 03:28:28 ha_peer_send_primary(src/ha_peer.c:4950): Group 1 (HA1-MGMT): Sending primary message

 

Primary Msg

-----------

flags    : 0x0

reason   : 2 (error)

num tlvs : 0

Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3829): Attempting 1 modify for sw.sysd.peers

Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3874): Clearing out peer sysd setting because stop for link reconfig

Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3893): Setting sysd node to: { 'peer.': { }, }

Apr 01 03:28:28 ha_ping_stop(src/ha_ping.c:404): Group 1: Stopping pings for ha1

Apr 01 03:28:28 ha_ping_stop(src/ha_ping.c:404): Group 1: Stopping pings for ha1

Apr 01 03:28:28 ha_ping_start(src/ha_ping.c:210): Group 1: Starting pings for ha1

Apr 01 03:28:28 ha_peer_start(src/ha_peer.c:246): Group 1 (HA1-MAIN): waiting for ping response before starting connection

Apr 01 03:28:28 ha_peer_recv_primary(src/ha_peer.c:5020): Group 1 (HA1-MGMT): Receiving primary ack message

 

So constantly heartbeats ping missed but failover is not actually happening due to mgmt back up link path. If l understood this correct form teh logs.

 

Chreers,

Myky

hmmm i might be wrong 🙂 possible that the pan_dha (dataplane HA agent) forwards the heartbeats on to the management plane... which would actually make sense ..

 

your backup does not seem to be configured 

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold

 

so, indeed, yes it would be possible the HA is flapping due to CPU load

you could try enabling the ha1-backup to enable the simplified pings, this could help as it requires less intelligence, so less cpu cycles

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Hi Reaper,

 

Unfortunately, l do have a control for this firewall so cannot confirm right away details, but this is what l have noticed. Apart of configuring HA1-Backup , would increasing heartbeats time help in this case?

 

Thx,

Myky

absolutely

 

if your firewall is working within 'expected' parameters you'll want to relax the heartbeat/hello interval and increase the hold time

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Will do a test and let you know. Thanks man!

 

P.S Using different community forums for  dif vendors (Extreme, Aruba, Infoblox) but PA is the best 🙂

You might want to try and get whoever owns this box to upgrade @TranceforLife. That constant high high utilization is bound to be causing management issues across the board; I can't imagine the commit time or log query on this device.

 

Here's a few live documents on potential steps to lower to the CPU utilization as long as they fit your needs. 

Part 1

Part 2

Hi All,

 

Oh it is nightmare and you are correct about the upgrade. Scheduled for the next week already

 

Thx,

Myky

  • 1 accepted solution
  • 4646 Views
  • 8 replies
  • 1 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!