PA HA concept quick question

TranceforLife · ‎12-02-2016

Hi Guys,

Just to clarify that heartbeat ping messages send by bi-direction (Active>Passive and Passive>Active) and these messages proceed by management plane. So if my MP CPU utilisation is always high (98% it is 2050) is it possible to lose ICMP (heartbeat) messages?

Cheers,

Myky

gurus members

@reaper @kiwi @BPry

reaper · ‎12-02-2016

hmmm i might be wrong 🙂 possible that the pan_dha (dataplane HA agent) forwards the heartbeats on to the management plane... which would actually make sense ..

your backup does not seem to be configured

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold

so, indeed, yes it would be possible the HA is flapping due to CPU load

you could try enabling the ha1-backup to enable the simplified pings, this could help as it requires less intelligence, so less cpu cycles

Tom Piens
PANgurus - Strata & Prisma Access specialist

View solution in original post

reaper · ‎12-02-2016

Hi @TranceforLife

HA1 has it's own intelligent hearbeat to check if both sides are 'aware' they are alive, this is controlled on the dataplane and flows through dataplane or dedicated interfaces

the ha1 backup on mgmt interface is an additional ping between the management planes, just to ensure if dataplane is running so high the ha1 messages get timed out, the passive unit doesn't take over and create a split brain situation

I wouldn't recommend setting your primary HA1 to management unless the dp is running hight to begin with

Tom Piens
PANgurus - Strata & Prisma Access specialist

TranceforLife · ‎12-02-2016

Hi Reaper,

Thanks for your reply. So it is not processed by MP plane then. We do have a case where our PA2050 MP CPU always running high (98%) and see a lot of these HA alerts:

Apr 01 03:17:06 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:17:28 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:17:29 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 2 ping timeouts out of 3 (ha1)

Apr 01 03:25:14 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:25:17 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:28:04 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:28:26 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 1 ping timeouts out of 3 (ha1)

Apr 01 03:28:27 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 2 ping timeouts out of 3 (ha1)

Apr 01 03:28:28 Error: ha_ping_peer_miss(src/ha_ping.c:751): Missed 3 ping timeouts out of 3 (ha1)

Apr 01 03:28:28 Error: ha_ping_peer_miss(src/ha_ping.c:758): We have missed 4 pings from the peer for group 1 (ha1), restarting connection

Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1 connection down

Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: All HA1 connections down

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1 peer link unknown, waiting hold

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold

Apr 01 03:28:28 HA2 peer link unknown

Apr 01 03:28:28 HA2-Backup peer link unknown

Apr 01 03:28:28 HA3 peer link unknown

Apr 01 03:28:28 ha_peer_send_error(src/ha_peer.c:1452): Group 1 (HA1-MAIN): Sending errro message

Error Msg

---------

flags : 0x2 (close:)

err code : Heartbeat ping failure (16)

num tlvs : 1

Printing out 1 tlvs

TLV[1]: type 5 (ERR_STRING); len 23; value:

48656172 74626561 74207069 6e672066 61696c75 726500

Apr 01 03:28:28 Error: ha_peer_disconnect(src/ha_peer.c:1593): Group 1 (HA1-MAIN): peer connection error msg set: Heartbeat ping failure

Apr 01 03:28:28 Group 1 (HA1-MGMT): new primary (error), going away from NONE

Apr 01 03:28:28 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA heartbeat backup is being used to avoid split-brain; the HA functionality is in a degraded state pending the recovery of HA1

Apr 01 03:28:28 ha_peer_send_primary(src/ha_peer.c:4950): Group 1 (HA1-MGMT): Sending primary message

Primary Msg

-----------

flags : 0x0

reason : 2 (error)

num tlvs : 0

Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3829): Attempting 1 modify for sw.sysd.peers

Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3874): Clearing out peer sysd setting because stop for link reconfig

Apr 01 03:28:28 ha_sysd_peerip_modify(src/ha_sysd.c:3893): Setting sysd node to: { 'peer.': { }, }

Apr 01 03:28:28 ha_ping_stop(src/ha_ping.c:404): Group 1: Stopping pings for ha1

Apr 01 03:28:28 ha_ping_start(src/ha_ping.c:210): Group 1: Starting pings for ha1

Apr 01 03:28:28 ha_peer_start(src/ha_peer.c:246): Group 1 (HA1-MAIN): waiting for ping response before starting connection

Apr 01 03:28:28 ha_peer_recv_primary(src/ha_peer.c:5020): Group 1 (HA1-MGMT): Receiving primary ack message

So constantly heartbeats ping missed but failover is not actually happening due to mgmt back up link path. If l understood this correct form teh logs.

Chreers,

Myky

reaper · ‎12-02-2016

hmmm i might be wrong 🙂 possible that the pan_dha (dataplane HA agent) forwards the heartbeats on to the management plane... which would actually make sense ..

your backup does not seem to be configured

Apr 01 03:28:28 ha_sysd_haX_link_change(src/ha_sysd.c:2223): Seeing HA1-Backup peer link unknown, waiting hold

so, indeed, yes it would be possible the HA is flapping due to CPU load

you could try enabling the ha1-backup to enable the simplified pings, this could help as it requires less intelligence, so less cpu cycles

Tom Piens
PANgurus - Strata & Prisma Access specialist

TranceforLife · ‎12-02-2016

Hi Reaper,

Unfortunately, l do have a control for this firewall so cannot confirm right away details, but this is what l have noticed. Apart of configuring HA1-Backup , would increasing heartbeats time help in this case?

Thx,

Myky

reaper · ‎12-02-2016

absolutely

if your firewall is working within 'expected' parameters you'll want to relax the heartbeat/hello interval and increase the hold time

Tom Piens
PANgurus - Strata & Prisma Access specialist

TranceforLife · ‎12-02-2016

Will do a test and let you know. Thanks man!

P.S Using different community forums for dif vendors (Extreme, Aruba, Infoblox) but PA is the best 🙂

BPry · ‎12-02-2016

You might want to try and get whoever owns this box to upgrade @TranceforLife. That constant high high utilization is bound to be causing management issues across the board; I can't imagine the commit time or log query on this device.

Here's a few live documents on potential steps to lower to the CPU utilization as long as they fit your needs.

Part 1

Part 2

TranceforLife · ‎12-02-2016

Hi All,

Oh it is nightmare and you are correct about the upgrade. Scheduled for the next week already

Thx,

Myky

Unlock your full community experience!

PA HA concept quick question

PA HA concept quick question

Show your appreciation!