High Availability weirdness

darren_g · ‎06-24-2014

Hi.

I have a pair of 2020's (I know, don't comment - they're getting upgraded soon, I hope!) which run in active/passive HA pairing.

This morning, I noticed the following entries in the logs on my primary (normally active) device

2014/06/25 08:58:37 info	ras	rasmgr- 0 RASMGR daemon sync all user info to HA peer exit.
2014/06/25 08:58:37 info	vpn	keymgr- 0 KEYMGR sync all IPSec SA to HA peer exit.
2014/06/25 08:58:37 info	ras	rasmgr- 0 RASMGR daemon sync all user info to HA peer started.
2014/06/25 08:58:37 info	vpn	keymgr- 0 KEYMGR sync all IPSec SA to HA peer started.
2014/06/25 08:58:37 info	satd	satd-ha 0 SATD daemon sync all gateway infos to HA peer started.
2014/06/25 08:58:37 info	routing	routed- 0 FIB HA sync started when peer device becomes passive.
2014/06/25 08:53:37 info	ha	connect 0 HA Group 1: HA1 connection up
2014/06/25 08:53:37 high	ha	ha3-lin 0 HA3 peer link down
2014/06/25 08:53:37 high	ha	ha2-lin 0 HA2-Backup peer link down
2014/06/25 08:53:37 info	ha	ha2-lin 0 HA2 peer link up
2014/06/25 08:53:37 high	ha	ha1-lin 0 HA1-Backup peer link down
2014/06/25 08:53:37 info	ha	ha1-lin 0 HA1 peer link up
2014/06/25 08:53:37 info	ha	connect 0 HA Group 1: Control link running on HA1 connection
2014/06/25 08:53:36 info	satd	satd-ha 0 SATD daemon sync all gateway infos to HA peer no longer needed.
2014/06/25 08:53:36 info	vpn	keymgr- 0 KEYMGR sync all IPSec SA to HA peer no longer needed.
2014/06/25 08:53:35 info	ras	rasmgr- 0 RASMGR daemon sync all user info to HA peer no longer needed.
2014/06/25 08:53:34 critical ha	connect 0 HA Group 1: All HA1 connections down
2014/06/25 08:53:34 critical ha	connect 0 HA Group 1: HA1 connection down

(that's a backwards view - the oldest event is at the bottom). The standby (normally passive) device has similar entries.

Trouble is, I 100% *know* the links didn't go down - there is nobody available to physically remove them (the devices are in a locked cabinet in another building which only two people are authorised to access, and both of them are sitting in the office right now), and both devices show uptimes which exceed 70 days

Can anyone shed some light on what may have caused this? From memory, this isn't the first time I've noticed this, but on the last occasion I wrote it off as some environmental thing - maybe I should have chased it earlier.

Any input appreciated.

Thanks.

gafrol · ‎06-25-2014

We had a similar issue a while ago with an older PANOS Version, I think it was 5.0.7. Which PANOS Version are you running ?

darren_g · ‎06-25-2014

The devices in question are on 5.0.12. I don't want to update them to the version 6 software because they're only 2020's, and I have concerns about their load if I do.

pulukas · ‎06-25-2014

While your cables may not have been pulled, clearly the communication link overall failed between the nodes. So this probably means there is an interface or system issue on one of the nodes of the cluster. I think you might have a hardware issue.

I would pull a full tech support from both and open a case for investigation of the root cause.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

emr_1 · ‎06-25-2014

#1

How is dataplane load going?

You can check by 'show running resource-monitor'

If it looks high average or 100% in max, it might stops traffic via HA1 port.

To avoid this situation, you can enable heartbeat backup.

#2

You can check if some kind of process was restarted.

'show system files' shows you core-dump if generated.

If you can find any output, then you can export it, open the case, then TAC will investigate it.

darren_g · ‎06-25-2014

#1 The dataplane load rarely if ever gets above about 17-20%. The *management* plane routinely runs at 60-70% with a known issue on the 2020's (one reason I'm looking to upgrade them).

#2 No recent core dumps or crashinfo files exist

Thanks for the suggestions, though. I'll log a case and see if PA TAC can find an answer

darren_g · ‎06-25-2014

I will log a case and see what happens.

Thanks for your input

Unlock your full community experience!

High Availability weirdness

High Availability weirdness

Show your appreciation!