HA1 Backup link went down root cause analysis

Venkatesan_radhakrishnan · ‎02-03-2019

HI Team,

I have issue in Palo alto firewall 3260 where HA1 backup link went down. Eventhough there is no production impact i'm seeing this issue happened without any cable change or any activity.

This is due to ping failure for heart beat , But I want to know what caused this ping failure issue.

I have already running PANOS 8.1.4-h2 which says release notes that HA1 Backup port issue unexpected behaviour was fixed.

Below is error message

Error Msg
---------
flags : 0x2 (close:)
err code : Heartbeat ping failure (16)
num tlvs : 1
Printing out 1 tlvs
TLV[1]: type 5 (ERR_STRING); len 23; value:
48656172 74626561 74207069 6e672066 61696c75 726500

Regards

Venky

reaper · ‎02-05-2019

your issue starts at the first line

2019-01-29 11:51:33.447 +0400 debug: ha_sysd_haX_link_change(src/ha_sysd.c:2221): Seeing HA1-Backup peer link down, waiting hold

'this' peer reports the remote end is down

so you now need to check the corresponding timeframe at the remote end

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Venkatesan_radhakrishnan · ‎02-05-2019

This is at passive firewall

2019-01-29 11:51:33.229 +0400 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (ha1-backup)
2019-01-29 11:51:33.257 +0400 debug: ha_peer_recv_hello(src/ha_peer.c:5119): Group 1 (HA1-MAIN): Receiving hello message

Msg Hdr
-------
version : 1
groupID : 1
type : Hello (2)
token : 0xb32d
flags : 0x1 (req:)
length : 122

Hello Msg
---------
flags : 0x0 ()
state : Active (5)
priority : 100
cookie : 55493
num tlvs : 3
Printing out 3 tlvs
TLV[1]: type 62 (CONFIG_MD5_PRE); len 33; value:
65313361 38313135 34623561 32633139 64353536 33313363
32383039 37616236 00
TLV[2]: type 2 (CONFIG_MD5SUM); len 33; value:
35373338 35623065 36663138 38313537 39616161 66326530
65396232 33376561 00
TLV[3]: type 11 (SYSD_PEER_DOWN); len 4; value:
00000000

2019-01-29 11:51:33.257 +0400 debug: ha_state_cfg_md5_set(src/ha_state_cfg.c:465): We were in sync and now we are out of sync; autocommit no; ha-sync no; panorama no; cfg-sync-off no; pre-old-insync yes; pre-new-insync no
2019-01-29 11:51:33.257 +0400 debug: ha_sysd_dev_cfgsync_update(src/ha_sysd.c:1415): Set dev cfgsync to Committing
2019-01-29 11:51:33.257 +0400 debug: ha_state_cfg_from_insync_to_outsync(src/ha_state_cfg.c:673): peer group 1 has changed the md5, waiting for an update
2019-01-29 11:51:33.447 +0400 debug: ha_peer_recv_hello(src/ha_peer.c:5119): Group 1 (HA1-MAIN): Receiving hello message

It is the one which said missed one ping time out, I'm not seeing any ping attempt in primary firewall, once Passive firewall miss 4 time out it went down forever.

Then I selected the same interface HA-1 B and committed then it came up still stable.

reaper · ‎02-05-2019

ok so while one peer sees interface down thge other sees a missed ping but ALSO is in the process of committing a config:

2019-01-29 11:51:33.257 +0400 debug: ha_sysd_dev_cfgsync_update(src/ha_sysd.c:1415): Set dev cfgsync to Committing

is it possible a config change related to the HA1-b interface was being pushed? there is always a small config sync gap in between the active member committing and the passive unit receiving and committing

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Venkatesan_radhakrishnan · ‎02-05-2019

NO dude, I have seen config changes but no such changes.

Once after the port issue only the did change.

reaper · ‎02-05-2019

something was being committed, so that could have caused the interface to bounce or possibly resources were drained somehow

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Venkatesan_radhakrishnan · ‎02-09-2019

HI @reaper,

Yes , You are correct commit has been done @ 11:53AM on 29th but this related to address object configuration mapping to address group and then calling in source addres of policy.

i'm more curious how this affect HA port configured somewhere in my firewall.

Regards

Venky

Venkatesan_radhakrishnan · ‎02-10-2019

HI @reaper

Awaiting for your reply.

Venkatesan_radhakrishnan · ‎02-10-2019

HI @reaper

I have seen one more interesting thing, The HA-B port was dropping packets. which happened in primary firewall.

So my issue is in active firewall which dropped the packets so HA1-B went down. SInce I have the tech support file generated after clearing the issue. I'm not able to see the memory during time of issue.

Interface: ha1-b

-------------------------------------------------------------------------------
Logical interface counters:
-------------------------------------------------------------------------------
bytes received 207647488
bytes transmitted 214917298
packets received 4254056
packets transmitted 4261401
receive errors 0
transmit errors 0
receive packets dropped 10769
transmit packets dropped 0
multicast packets received 0
-----------------------------------------

reaper · ‎02-11-2019

Don't focus too much on these numbers until you can directly correlate them to the actual event. some packets may get dropped naturally, or they could have been from a previous issue (possibly during initial config)

since the connection was impacted during the commit you'll need to look at both techsupport files side by side starting secondas before the commit starts, see if there are unusual; spikes in MP or DP cpu, those drop counters should be correlated for their delta during the commit (does the number increase gradually over time, or does it spike during the commit)

the content of the commit may not necessarily be related to the interface itself, it's possible something during the commit chokes the interfaces for some reason

do you have as support case open already? If not,m this may be a good time to do so

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization

Venkatesan_radhakrishnan · ‎02-13-2019

HI @reaper

I have case opened with TAC and they are researching on root casuse. I will keep you posted once I get update.

Thank you so much for all your analysis for betterment in investigation.

Regards

Venky

AnalysisMan · ‎06-06-2019

Any updates on this case? I've got the same issue on PA-3220 with PAN-OS 8.1.8.

I see the symptom precisely like you that 'receive packets dropped' increased on Active firewall. I'm going to open a case with TAC.

--
"The Simplicity is the ultimate sophistication." - Leonardo da Vinci.

Venkatesan_radhakrishnan · ‎06-06-2019

Hi

this is a known bug gonna fixed in 8.1.9 or 9.0 version . You can wait for 8.1.9 or can upgrade to 9.0

AnalysisMan · ‎06-06-2019

Thanks for your reply!

Do you have the bug/issue ID? Or is this non-public one?

--
"The Simplicity is the ultimate sophistication." - Leonardo da Vinci.

AnalysisMan · ‎06-06-2019

FYI -

Here is a workaround for someone who wants to bring up the HA1 Backup before upgrading the PAN-OS.

Step 1. Change the Port type from ha1-b to management on Active firewall and Commit (Device -> High Availability -> General > Control link (HA1 Backup)
Step 2. Revert back to the previous configuration with the Port type: ha1-b, along with the IP address and Commit.

This workaround should bring up the HA1 Backup.

Hope this helps!

--
"The Simplicity is the ultimate sophistication." - Leonardo da Vinci.

Unlock your full community experience!

HA1 Backup link went down root cause analysis

HA1 Backup link went down root cause analysis

Show your appreciation!