What is Peer-Split-Brain?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Palo Alto Networks Approved
Palo Alto Networks Approved
Community Expert Verified
Community Expert Verified

What is Peer-Split-Brain?

L3 Networker

Hey all,

 

I want to start off saying I love Palo Alto's, they are AMAZING!

 

With that out of the way, I wanted to say I recently got a Device RMA'd and the process went amazingly smooth, and I actually was able to completed a HA peer PA-500 in less time then it took my provider to get my a digital key for some software!

This was the guide I followed: https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClHFCA0

Everything seemed to go very well... however I have recently been receiving Split-Brain; HA1 Group down alerts, they don't seem to be affecting production flow thank goodness, but is alarming non-the-less.

 

12/11 08:18:43
vpn
informational
keymgr-ha-full-sync-done
KEYMGR sync all IPSec SA to HA peer exit.

12/11 08:18:43
ras
informational
rasmgr-ha-full-sync-done

RASMGR daemon sync all user info to HA peer exit.
12/11 08:18:42
ha
informational
session-synch
HA Group 1: Starting session synchronization with peer

12/11 08:18:42
routing
informational
routed-fib-sync-peer-backup
FIB HA sync started when peer device becomes passive.

12/11 08:18:42
ras
informational
rasmgr-ha-full-sync-start
RASMGR daemon sync all user info to HA peer started.

12/11 08:18:42
vpn
informational
keymgr-ha-full-sync-start
KEYMGR sync all IPSec SA to HA peer started.

12/11 08:18:42
satd
informational
satd-ha-full-sync-start
SATD daemon sync all gateway infos to HA peer started.

12/11 08:18:39
ha
informational
session-synch
HA Group 1: Completed session synchronization with peer

12/11 08:18:31
ha
informational
session-synch
HA Group 1: Starting session synchronization with peer

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Threat Content version now matches

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Global Protect Client Software version now matches

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Application Content version now matches

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Anti-Virus version now matches

12/11 08:18:29
ha
high
peer-version-match
HA Group 1: Threat Content version does not match

12/11 08:18:29
ha
high
peer-version-match
HA Group 1: Application Content version does not match

 


12/11 08:18:28
ha
high
peer-version-match
HA Group 1: Global Protect Client Software version does not match

12/11 08:18:28
ha
high
peer-version-match
HA Group 1: Anti-Virus version does not match

12/11 08:18:28
ha
critical
peer-split-brain
HA Group 1: Staying in Active state after split-brain recovery (split-brain duration: 1s)

12/11 08:18:28
ha
informational
connect-change
HA Group 1: HA1 connection up

12/11 08:18:28
ha
informational
ha2-link-change
HA2 peer link up

12/11 08:18:28
ha
informational
ha1-link-change
HA1 peer link up

12/11 08:18:28
ha
informational
connect-change
HA Group 1: Control link running on HA1 connection

12/11 08:18:27
ras
informational
rasmgr-ha-full-sync-abort
RASMGR daemon sync all user info to HA peer no longer needed.

12/11 08:18:27
vpn
informational
keymgr-ha-full-sync-abort
KEYMGR sync all IPSec SA to HA peer no longer needed.

12/11 0:18:24
satd
informational
satd-ha-full-sync-abort
SATD daemon sync all gateway infos to HA peer no longer needed.

12/11 08:18:23
ha
critical
connect-change
HA Group 1: All HA1 connections down
12/11 08:18:23
ha
critical
connect-change
HA Group 1: HA1 connection down

 

I did a bit of research on it, and it said it does it when one PA see a firewall that the other doesn't? But both are setup exactly the same, thoguhts?

1 accepted solution

Accepted Solutions

L7 Applicator

If the HA1 Link fails and there is no HA1 Backup nor Heartbeat Backup configured, configuration synchronization will fail and a split brain condition will be created. Split brain conditions occur when HA members can no longer communicate with each other to exchange HA monitoring information. Each HA member will assume the other member is in a non-functional state and take over as the Active (A/P) or Active-Primary (A/A). Split brain conditions can be prevented by configuring an HA1 Backup link and/or enabling Heartbeat Backup.

 

Source:

High Availability Synchronization

 

The reason for HA1 link failure is not limited to physical problems, it can also happen if the ha_agent process is busy and can't process HA1 functions. In that case it's useful to have the backup be Heartbeat Backup through the MGMT port, since the Heartbeat function sends out ICMP probes and these are processed by the system kernel, and not the ha_agent process. In that case, use of Heartbeat Backup would result in a more split-brain resilient configuration than using an HA1 Backup link through the MGMT port (which would depend on the health of the ha_agent process).

 

See How to Configure High Availability on PAN-OS for details on configuring HA1 Backup link and enabling Heartbeat Backup.

View solution in original post

7 REPLIES 7

L3 Networker

If you see Split-Brain messages it means that the heartbeat between both peers are missed. In the worst case both devices stay active and you network will go down.

Please check your cables and network devices between the HA 1 port if they are still working without problems and configure a backup heartbeat over another interface.

Thanks for the suggestion? But not sure what the issue could be there, both peers sit one a top the other... with a 6 Inch patch cable connecting the peers to each other for the HA traffic.... so nothing in terms of devices between the HA 1 ports....

L7 Applicator

If the HA1 Link fails and there is no HA1 Backup nor Heartbeat Backup configured, configuration synchronization will fail and a split brain condition will be created. Split brain conditions occur when HA members can no longer communicate with each other to exchange HA monitoring information. Each HA member will assume the other member is in a non-functional state and take over as the Active (A/P) or Active-Primary (A/A). Split brain conditions can be prevented by configuring an HA1 Backup link and/or enabling Heartbeat Backup.

 

Source:

High Availability Synchronization

 

The reason for HA1 link failure is not limited to physical problems, it can also happen if the ha_agent process is busy and can't process HA1 functions. In that case it's useful to have the backup be Heartbeat Backup through the MGMT port, since the Heartbeat function sends out ICMP probes and these are processed by the system kernel, and not the ha_agent process. In that case, use of Heartbeat Backup would result in a more split-brain resilient configuration than using an HA1 Backup link through the MGMT port (which would depend on the health of the ha_agent process).

 

See How to Configure High Availability on PAN-OS for details on configuring HA1 Backup link and enabling Heartbeat Backup.

Thanks Mivaldi!

I was talking to our security consultant about the issue, and he suggested the exact same thing. I really appreciate the great explanation you provided it really helped.

To help elevate the issue I will setup backup heartbeats on the MGMT plane as both of you have suggested. Just odd cause never got them before I did the one PA RMA. and lucky as I stated each time the network didn't go down and recovered itself within seconds. But if it doesn't take much to configure a backup heartbeat, that sounds like a great solution.

again thanks so much for that information!

Hello Mivaldi,

 

Good explanation. We don't have HA1 backup link configured and our Heartbeat Backup (mgmt port) were  not reachable between to PAs. This is all to the switch issue between the devices. Passive node went to Active. So peer split-brain occurred. Just to confirm when HA1 heartbeat recovered ( say network connection is restored between the devices) will my "Active" node become back to "Passive". If you can clear this would be great. Logs below:

 

2016-08-25 00:59:39.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (ha1)
2016-08-25 00:59:39.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (mgmt)
2016-08-25 00:59:40.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (ha1)
2016-08-25 00:59:40.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (mgmt)
2016-08-25 00:59:41.707 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 3 ping timeouts out of 3 (ha1)
2016-08-25 00:59:41.707 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:763): We have missed 4 pings from the peer for group 1 (ha1), restarting connection
2016-08-25 00:59:41.709 +0100 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1 connection down
2016-08-25 00:59:41.709 +0100 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: All HA1 connections down

 

Thx,

Myky

For the record, @TranceforLife started this new topic for his question above:

HA peer split-brain recovery?

 

If you have configured the Preemtive settings, then this will control which unit is active after communication is restored.

LIVEcommunity team member
Stay Secure,
Joe
Don't forget to Like items if a post is helpful to you!

Hello Jdelio,

 

Nice one. I could see it in the logs but just wanted to confirm. Logs from teh passive, but it became an Active due to heartbeats missing . Preemtion settings 100/50 :

 

2016/08/25 00:59:41 critical ha             connect 0  HA Group 1: HA1 connection down
2016/08/25 00:59:41 critical ha             connect 0  HA Group 1: All HA1 connections down
2016/08/25 00:59:41 critical ha             connect 0  HA Group 1: HA heartbeat backup is being used to avoid split-brain; the HA functionality is in a degraded state pending the recovery of HA1
2016/08/25 00:59:41 critical ha             connect 0  HA Group 1: HA heartbeat backup connection down
2016/08/25 00:59:42 high     ha             peer-ve 0  HA Group 1: Anti-Virus version does not match
2016/08/25 00:59:42 high     ha             peer-ve 0  HA Group 1: Application Content version does not match
2016/08/25 00:59:42 high     ha             peer-ve 0  HA Group 1: Threat Content version does not match
2016/08/25 00:59:42 info     ha             peer-ve 0  HA Group 1: Anti-Virus version now matches
2016/08/25 00:59:42 info     ha             peer-ve 0  HA Group 1: Application Content version now matches
2016/08/25 00:59:42 info     ha             peer-ve 0  HA Group 1: Threat Content version now matches
2016/08/25 00:59:42 info     ha             ha1-lin 0  HA1 peer link up
2016/08/25 00:59:42 info     ha             ha2-lin 0  HA2-Backup peer link up
2016/08/25 00:59:42 info     ha             ha2-lin 0  HA2 peer link up
2016/08/25 00:59:43 high     ha             state-c 0  HA Group 1: Moved from state Passive to state Active
2016/08/25 00:59:44 info     ha             connect 0  HA Group 1: Control link running on HA1 connection
2016/08/25 00:59:44 info     ha             connect 0  HA Group 1: HA1 connection up
2016/08/25 00:59:44 critical ha             split-b 0  HA Group 1: Going to Passive state due to split-brain recovery (split-brain duration: 1s)
2016/08/25 00:59:44 info     ha             connect 0  HA Group 1: HA heartbeat backup connection up
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/1: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/2: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/3: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/5: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/6: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/7: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/9: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/10: Down 1Gb/s-full duplex
2016/08/25 00:59:44 info     routing        routed- 0  FIB HA sync started when local device becomes master.
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/13: Up   auto duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/14: Up   auto duplex
2016/08/25 00:59:44 info     port    ethern link-ch 0  ethernet1/19: Up   1Gb/s-full duplex
2016/08/25 00:59:45 info     ha             session 0  HA Group 1: Starting session synchronization with peer
2016/08/25 00:59:45 info     routing        routed- 0  FIB HA sync started when local device becomes master.
2016/08/25 00:59:47 info     port    ethern link-ch 0  ethernet1/1: Up   1Gb/s-full duplex
2016/08/25 00:59:48 info     port    ethern link-ch 0  ethernet1/3: Up   1Gb/s-full duplex
2016/08/25 00:59:48 info     port    ethern link-ch 0  ethernet1/9: Up   1Gb/s-full duplex
2016/08/25 00:59:48 info     port    ethern link-ch 0  ethernet1/2: Up   1Gb/s-full duplex
2016/08/25 00:59:48 info     ha             session 0  HA Group 1: Completed session synchronization with peer
2016/08/25 00:59:48 info     port    ethern link-ch 0  ethernet1/6: Up   1Gb/s-full duplex
2016/08/25 00:59:49 info     port    ethern link-ch 0  ethernet1/10: Up   1Gb/s-full duplex
2016/08/25 00:59:49 info     port    ethern link-ch 0  ethernet1/7: Up   1Gb/s-full duplex
2016/08/25 00:59:50 info     port    ethern link-ch 0  ethernet1/5: Up   1Gb/s-full duplex
2016/08/25 00:59:51 info     ha             state-c 0  HA Group 1: Moved from state Initial to state Passive
2016/08/25 00:59:51 info     ha             session 0  HA Group 1: Starting session synchronization with peer
2016/08/25 00:59:54 info     ha             session 0  HA Group 1: Completed session synchronization with peer
 
Thx,
Myky
  • 1 accepted solution
  • 19193 Views
  • 7 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!