- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
12-11-2014 08:13 AM - last edited on 04-14-2021 12:58 PM by jdelio
Hey all,
I want to start off saying I love Palo Alto's, they are AMAZING!
With that out of the way, I wanted to say I recently got a Device RMA'd and the process went amazingly smooth, and I actually was able to completed a HA peer PA-500 in less time then it took my provider to get my a digital key for some software!
This was the guide I followed: https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClHFCA0
Everything seemed to go very well... however I have recently been receiving Split-Brain; HA1 Group down alerts, they don't seem to be affecting production flow thank goodness, but is alarming non-the-less.
12/11 08:18:43
vpn
informational
keymgr-ha-full-sync-done
KEYMGR sync all IPSec SA to HA peer exit.
12/11 08:18:43
ras
informational
rasmgr-ha-full-sync-done
RASMGR daemon sync all user info to HA peer exit.
12/11 08:18:42
ha
informational
session-synch
HA Group 1: Starting session synchronization with peer
12/11 08:18:42
routing
informational
routed-fib-sync-peer-backup
FIB HA sync started when peer device becomes passive.
12/11 08:18:42
ras
informational
rasmgr-ha-full-sync-start
RASMGR daemon sync all user info to HA peer started.
12/11 08:18:42
vpn
informational
keymgr-ha-full-sync-start
KEYMGR sync all IPSec SA to HA peer started.
12/11 08:18:42
satd
informational
satd-ha-full-sync-start
SATD daemon sync all gateway infos to HA peer started.
12/11 08:18:39
ha
informational
session-synch
HA Group 1: Completed session synchronization with peer
12/11 08:18:31
ha
informational
session-synch
HA Group 1: Starting session synchronization with peer
12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Threat Content version now matches
12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Global Protect Client Software version now matches
12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Application Content version now matches
12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Anti-Virus version now matches
12/11 08:18:29
ha
high
peer-version-match
HA Group 1: Threat Content version does not match
12/11 08:18:29
ha
high
peer-version-match
HA Group 1: Application Content version does not match
12/11 08:18:28
ha
high
peer-version-match
HA Group 1: Global Protect Client Software version does not match
12/11 08:18:28
ha
high
peer-version-match
HA Group 1: Anti-Virus version does not match
12/11 08:18:28
ha
critical
peer-split-brain
HA Group 1: Staying in Active state after split-brain recovery (split-brain duration: 1s)
12/11 08:18:28
ha
informational
connect-change
HA Group 1: HA1 connection up
12/11 08:18:28
ha
informational
ha2-link-change
HA2 peer link up
12/11 08:18:28
ha
informational
ha1-link-change
HA1 peer link up
12/11 08:18:28
ha
informational
connect-change
HA Group 1: Control link running on HA1 connection
12/11 08:18:27
ras
informational
rasmgr-ha-full-sync-abort
RASMGR daemon sync all user info to HA peer no longer needed.
12/11 08:18:27
vpn
informational
keymgr-ha-full-sync-abort
KEYMGR sync all IPSec SA to HA peer no longer needed.
12/11 0:18:24
satd
informational
satd-ha-full-sync-abort
SATD daemon sync all gateway infos to HA peer no longer needed.
12/11 08:18:23
ha
critical
connect-change
HA Group 1: All HA1 connections down
12/11 08:18:23
ha
critical
connect-change
HA Group 1: HA1 connection down
I did a bit of research on it, and it said it does it when one PA see a firewall that the other doesn't? But both are setup exactly the same, thoguhts?
12-11-2014 11:33 AM - last edited on 04-14-2021 01:01 PM by jdelio
If the HA1 Link fails and there is no HA1 Backup nor Heartbeat Backup configured, configuration synchronization will fail and a split brain condition will be created. Split brain conditions occur when HA members can no longer communicate with each other to exchange HA monitoring information. Each HA member will assume the other member is in a non-functional state and take over as the Active (A/P) or Active-Primary (A/A). Split brain conditions can be prevented by configuring an HA1 Backup link and/or enabling Heartbeat Backup.
Source:
High Availability Synchronization
The reason for HA1 link failure is not limited to physical problems, it can also happen if the ha_agent process is busy and can't process HA1 functions. In that case it's useful to have the backup be Heartbeat Backup through the MGMT port, since the Heartbeat function sends out ICMP probes and these are processed by the system kernel, and not the ha_agent process. In that case, use of Heartbeat Backup would result in a more split-brain resilient configuration than using an HA1 Backup link through the MGMT port (which would depend on the health of the ha_agent process).
See How to Configure High Availability on PAN-OS for details on configuring HA1 Backup link and enabling Heartbeat Backup.
12-11-2014 08:23 AM
If you see Split-Brain messages it means that the heartbeat between both peers are missed. In the worst case both devices stay active and you network will go down.
Please check your cables and network devices between the HA 1 port if they are still working without problems and configure a backup heartbeat over another interface.
12-11-2014 08:50 AM
Thanks for the suggestion? But not sure what the issue could be there, both peers sit one a top the other... with a 6 Inch patch cable connecting the peers to each other for the HA traffic.... so nothing in terms of devices between the HA 1 ports....
12-11-2014 11:33 AM - last edited on 04-14-2021 01:01 PM by jdelio
If the HA1 Link fails and there is no HA1 Backup nor Heartbeat Backup configured, configuration synchronization will fail and a split brain condition will be created. Split brain conditions occur when HA members can no longer communicate with each other to exchange HA monitoring information. Each HA member will assume the other member is in a non-functional state and take over as the Active (A/P) or Active-Primary (A/A). Split brain conditions can be prevented by configuring an HA1 Backup link and/or enabling Heartbeat Backup.
Source:
High Availability Synchronization
The reason for HA1 link failure is not limited to physical problems, it can also happen if the ha_agent process is busy and can't process HA1 functions. In that case it's useful to have the backup be Heartbeat Backup through the MGMT port, since the Heartbeat function sends out ICMP probes and these are processed by the system kernel, and not the ha_agent process. In that case, use of Heartbeat Backup would result in a more split-brain resilient configuration than using an HA1 Backup link through the MGMT port (which would depend on the health of the ha_agent process).
See How to Configure High Availability on PAN-OS for details on configuring HA1 Backup link and enabling Heartbeat Backup.
12-11-2014 02:05 PM
Thanks Mivaldi!
I was talking to our security consultant about the issue, and he suggested the exact same thing. I really appreciate the great explanation you provided it really helped.
To help elevate the issue I will setup backup heartbeats on the MGMT plane as both of you have suggested. Just odd cause never got them before I did the one PA RMA. and lucky as I stated each time the network didn't go down and recovered itself within seconds. But if it doesn't take much to configure a backup heartbeat, that sounds like a great solution.
again thanks so much for that information!
08-25-2016 09:47 AM - edited 08-25-2016 09:58 AM
Hello Mivaldi,
Good explanation. We don't have HA1 backup link configured and our Heartbeat Backup (mgmt port) were not reachable between to PAs. This is all to the switch issue between the devices. Passive node went to Active. So peer split-brain occurred. Just to confirm when HA1 heartbeat recovered ( say network connection is restored between the devices) will my "Active" node become back to "Passive". If you can clear this would be great. Logs below:
2016-08-25 00:59:39.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (ha1)
2016-08-25 00:59:39.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (mgmt)
2016-08-25 00:59:40.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (ha1)
2016-08-25 00:59:40.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (mgmt)
2016-08-25 00:59:41.707 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 3 ping timeouts out of 3 (ha1)
2016-08-25 00:59:41.707 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:763): We have missed 4 pings from the peer for group 1 (ha1), restarting connection
2016-08-25 00:59:41.709 +0100 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1 connection down
2016-08-25 00:59:41.709 +0100 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: All HA1 connections down
Thx,
Myky
08-25-2016 02:02 PM - edited 08-25-2016 02:12 PM
For the record, @TranceforLife started this new topic for his question above:
If you have configured the Preemtive settings, then this will control which unit is active after communication is restored.
08-25-2016 02:36 PM - edited 08-25-2016 02:38 PM
Hello Jdelio,
Nice one. I could see it in the logs but just wanted to confirm. Logs from teh passive, but it became an Active due to heartbeats missing . Preemtion settings 100/50 :
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!