HA1 Backup link went down root cause analysis

Announcements

ATTENTION Customers, All Partners and Employees: The Customer Support Portal (CSP) will be undergoing maintenance and unavailable on Saturday, November 7, 2020, from 11 am to 11 pm PST. Please read our blog for more information.

Reply
Highlighted

HI Reaper,

 

I have share the error message that happened during the time of event. 

 

Which is share in my first post but i couldn't able to understand the error message.

 

 

L7 Applicator

Unfortunately this error message only indicates a heartbeat was missed

To find the cause, troubleshooting needs to be done which will likely require a techsupport file feom each peer and some debugging logs to be enabled
Tom Piens - PANgurus.com
New to PAN-OS or getting ready to take the PCNSE? check out amazon.com/dp/1789956374
Highlighted

I Have tech support file collected from both the peer during the time of issue.

 

What debug need to be done? 

Highlighted
L7 Applicator

You would need to lay both TSF side by side and compare their logs to see if anything interesting happens in their logs hours to minutes before the event
Tom Piens - PANgurus.com
New to PAN-OS or getting ready to take the PCNSE? check out amazon.com/dp/1789956374
Highlighted

This is what I can see during the time of issue. 

 

I can't understand error message number and stuffs .

 

2019-01-29 11:51:33.447 +0400 debug: ha_sysd_haX_link_change(src/ha_sysd.c:2221): Seeing HA1-Backup peer link down, waiting hold

2019-01-29 11:51:33.447 +0400 Warning: ha_event_log(src/ha_event.c:47): HA1-Backup peer link down

2019-01-29 11:51:34.229 +0400 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (ha1-backup) 2019-01-29 11:51:35.229 +0400 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 3 ping timeouts out of 3 (ha1-backup) 2019-01-29 11:51:35.229 +0400 Error: ha_ping_peer_miss(src/ha_ping.c:763): We have missed 4 pings from the peer for group 1 (ha1-backup), restarting connection

2019-01-29 11:51:35.230 +0400 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1-Backup connection down

2019-01-29 11:51:35.230 +0400 debug: ha_peer_send_error(src/ha_peer.c:1517): Group 1 (HA1-BKUP): Sending errro message Error Msg --------- flags : 0x2 (close:) err code : Heartbeat ping failure (16) num tlvs : 1 Printing out 1 tlvs TLV[1]: type 5 (ERR_STRING); len 23; value: 48656172 74626561 74207069 6e672066 61696c75 726500

2019-01-29 11:51:35.230 +0400 Error: ha_peer_disconnect(src/ha_peer.c:1652): Group 1 (HA1-BKUP): peer connection error msg set: Heartbeat ping failure

2019-01-29 11:51:35.230 +0400 debug: ha_ping_stop(src/ha_ping.c:407): Group 1: Stopping pings for ha1-backup

2019-01-29 11:51:35.230 +0400 debug: ha_ping_stop(src/ha_ping.c:407): Group 1: Stopping pings for ha1-backup

2019-01-29 11:51:35.230 +0400 debug: ha_ping_start(src/ha_ping.c:210): Group 1: Starting pings for ha1-backup

2019-01-29 11:51:35.230 +0400 debug: ha_peer_start(src/ha_peer.c:246): Group 1 (HA1-BKUP): waiting for ping response before starting connection

2019-01-29 11:51:39.195 +0400 debug: cfgagent_flags_callback(pan_cfgagent.c:226): ha_agent: cfg agent received flags from server

2019-01-29 11:51:39.195 +0400 debug: cfgagent_flags_callback(pan_cfgagent.c:230): new flags=0x4 2019-01-29 11:51:39.195 +0400 debug: cfgagent_config_callback(pan_cfgagent.c:253): ha_agent: cfg agent received configuration from server

2019-01-29 11:51:39.195 +0400 debug: cfgagent_config_callback(pan_cfgagent.c:275): config length=45594

Highlighted
L7 Applicator

your issue starts at the first line

2019-01-29 11:51:33.447 +0400 debug: ha_sysd_haX_link_change(src/ha_sysd.c:2221): Seeing HA1-Backup peer link down, waiting hold

'this' peer reports the remote end is down

so you now need to check the corresponding timeframe at the remote end

 

 

Tom Piens - PANgurus.com
New to PAN-OS or getting ready to take the PCNSE? check out amazon.com/dp/1789956374
Highlighted

This is at passive firewall

 

2019-01-29 11:51:33.229 +0400 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (ha1-backup)
2019-01-29 11:51:33.257 +0400 debug: ha_peer_recv_hello(src/ha_peer.c:5119): Group 1 (HA1-MAIN): Receiving hello message

Msg Hdr
-------
version : 1
groupID : 1
type : Hello (2)
token : 0xb32d
flags : 0x1 (req:)
length : 122

Hello Msg
---------
flags : 0x0 ()
state : Active (5)
priority : 100
cookie : 55493
num tlvs : 3
Printing out 3 tlvs
TLV[1]: type 62 (CONFIG_MD5_PRE); len 33; value:
65313361 38313135 34623561 32633139 64353536 33313363
32383039 37616236 00
TLV[2]: type 2 (CONFIG_MD5SUM); len 33; value:
35373338 35623065 36663138 38313537 39616161 66326530
65396232 33376561 00
TLV[3]: type 11 (SYSD_PEER_DOWN); len 4; value:
00000000

2019-01-29 11:51:33.257 +0400 debug: ha_state_cfg_md5_set(src/ha_state_cfg.c:465): We were in sync and now we are out of sync; autocommit no; ha-sync no; panorama no; cfg-sync-off no; pre-old-insync yes; pre-new-insync no
2019-01-29 11:51:33.257 +0400 debug: ha_sysd_dev_cfgsync_update(src/ha_sysd.c:1415): Set dev cfgsync to Committing
2019-01-29 11:51:33.257 +0400 debug: ha_state_cfg_from_insync_to_outsync(src/ha_state_cfg.c:673): peer group 1 has changed the md5, waiting for an update
2019-01-29 11:51:33.447 +0400 debug: ha_peer_recv_hello(src/ha_peer.c:5119): Group 1 (HA1-MAIN): Receiving hello message

 

It is the one which said missed one ping time out, I'm not seeing any ping attempt in primary firewall, once Passive firewall miss 4 time out it went down forever.

 

Then I selected the same interface HA-1 B and committed then it came up still stable.

 

 

Highlighted
L7 Applicator

ok so while one peer sees interface down thge other sees a missed ping but ALSO is in the process of committing a config:

2019-01-29 11:51:33.257 +0400 debug: ha_sysd_dev_cfgsync_update(src/ha_sysd.c:1415): Set dev cfgsync to Committing

 

is it possible a config change related to the HA1-b interface was being pushed? there is always a small config sync gap in between the active member committing and the passive unit receiving and committing

Tom Piens - PANgurus.com
New to PAN-OS or getting ready to take the PCNSE? check out amazon.com/dp/1789956374
Highlighted

NO dude, I have seen config changes but no such changes.

 

Once after the port issue only the did change. 

Highlighted
L7 Applicator

something was being committed, so that could have caused the interface to bounce or possibly resources were drained somehow

 

 

Tom Piens - PANgurus.com
New to PAN-OS or getting ready to take the PCNSE? check out amazon.com/dp/1789956374
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!

The Live Community thanks you for your participation!