High Availability weirdness

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements

High Availability weirdness

L4 Transporter

Hi.

I have a pair of 2020's (I know, don't comment - they're getting upgraded soon, I hope!) which run in active/passive HA pairing.

This morning, I noticed the following entries in the logs on my primary (normally active) device

2014/06/25 08:58:37 info ras        rasmgr- 0  RASMGR daemon sync all user info to HA peer exit.
2014/06/25 08:58:37 info vpn        keymgr- 0  KEYMGR sync all IPSec SA to HA peer exit.
2014/06/25 08:58:37 info ras        rasmgr- 0  RASMGR daemon sync all user info to HA peer started.
2014/06/25 08:58:37 info vpn        keymgr- 0  KEYMGR sync all IPSec SA to HA peer started.
2014/06/25 08:58:37 info satd       satd-ha 0  SATD daemon sync all gateway infos to HA peer started.
2014/06/25 08:58:37 info routing    routed- 0  FIB HA sync started when peer device becomes passive.
2014/06/25 08:53:37 info ha         connect 0  HA Group 1: HA1 connection up
2014/06/25 08:53:37 high ha         ha3-lin 0  HA3 peer link down
2014/06/25 08:53:37 high ha         ha2-lin 0  HA2-Backup peer link down
2014/06/25 08:53:37 info ha         ha2-lin 0  HA2 peer link up
2014/06/25 08:53:37 high ha         ha1-lin 0  HA1-Backup peer link down
2014/06/25 08:53:37 info ha         ha1-lin 0  HA1 peer link up
2014/06/25 08:53:37 info ha         connect 0  HA Group 1: Control link running on HA1 connection
2014/06/25 08:53:36 info satd       satd-ha 0  SATD daemon sync all gateway infos to HA peer no longer needed.
2014/06/25 08:53:36 info vpn        keymgr- 0  KEYMGR sync all IPSec SA to HA peer no longer needed.
2014/06/25 08:53:35 info ras        rasmgr- 0  RASMGR daemon sync all user info to HA peer no longer needed.
2014/06/25 08:53:34 critical ha         connect 0  HA Group 1: All HA1 connections down
2014/06/25 08:53:34 critical ha         connect 0  HA Group 1: HA1 connection down

(that's a backwards view - the oldest event is at the bottom). The standby (normally passive) device has similar entries.

Trouble is, I 100% *know* the links didn't go down - there is nobody available to physically remove them (the devices are in a locked cabinet in another building which only two people are authorised to access, and both of them are sitting in the office right now), and both devices show uptimes which exceed 70 days

Can anyone shed some light on what may have caused this? From memory, this isn't the first time I've noticed this, but on the last occasion I wrote it off as some environmental thing - maybe I should have chased it earlier.

Any input appreciated.

Thanks.

6 REPLIES 6

L4 Transporter

We had a similar issue a while ago with an older PANOS Version, I think it was 5.0.7. Which PANOS Version are you running ?

The devices in question are on 5.0.12. I don't want to update them to the version 6 software because they're only 2020's, and I have concerns about their load if I do.

L7 Applicator

While your cables may not have been pulled, clearly the communication link overall failed between the nodes.  So this probably means there is an interface or system issue on one of the nodes of the cluster.  I think you might have a hardware issue.

I would pull a full tech support from both and open a case for investigation of the root cause.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

L5 Sessionator

#1

How is dataplane load going?

You can check by 'show running resource-monitor'

If it looks high average or 100% in max, it might stops traffic via HA1 port.

To avoid this situation, you can enable heartbeat backup.

#2

You can check if some kind of process was restarted.

'show system files' shows you core-dump if generated.

If you can find any output, then you can export it, open the case, then TAC will investigate it.

#1 The dataplane load rarely if ever gets above about 17-20%. The *management* plane routinely runs at 60-70% with a known issue on the 2020's (one reason I'm looking to upgrade them).

#2 No recent core dumps or crashinfo files exist

Thanks for the suggestions, though. I'll log a case and see if PA TAC can find an answer

I will log a case and see what happens.

Thanks for your input

  • 6286 Views
  • 6 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!