What is Peer-Split-Brain?

Zewwy · ‎12-11-2014

Hey all,

I want to start off saying I love Palo Alto's, they are AMAZING!

With that out of the way, I wanted to say I recently got a Device RMA'd and the process went amazingly smooth, and I actually was able to completed a HA peer PA-500 in less time then it took my provider to get my a digital key for some software!

This was the guide I followed: https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClHFCA0

Everything seemed to go very well... however I have recently been receiving Split-Brain; HA1 Group down alerts, they don't seem to be affecting production flow thank goodness, but is alarming non-the-less.

12/11 08:18:43
vpn
informational
keymgr-ha-full-sync-done
KEYMGR sync all IPSec SA to HA peer exit.

12/11 08:18:43
ras
informational
rasmgr-ha-full-sync-done

RASMGR daemon sync all user info to HA peer exit.
12/11 08:18:42
ha
informational
session-synch
HA Group 1: Starting session synchronization with peer

12/11 08:18:42
routing
informational
routed-fib-sync-peer-backup
FIB HA sync started when peer device becomes passive.

12/11 08:18:42
ras
informational
rasmgr-ha-full-sync-start
RASMGR daemon sync all user info to HA peer started.

12/11 08:18:42
vpn
informational
keymgr-ha-full-sync-start
KEYMGR sync all IPSec SA to HA peer started.

12/11 08:18:42
satd
informational
satd-ha-full-sync-start
SATD daemon sync all gateway infos to HA peer started.

12/11 08:18:39
ha
informational
session-synch
HA Group 1: Completed session synchronization with peer

12/11 08:18:31
ha
informational
session-synch
HA Group 1: Starting session synchronization with peer

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Threat Content version now matches

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Global Protect Client Software version now matches

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Application Content version now matches

12/11 08:18:29
ha
informational
peer-version-match
HA Group 1: Anti-Virus version now matches

12/11 08:18:29
ha
high
peer-version-match
HA Group 1: Threat Content version does not match

12/11 08:18:29
ha
high
peer-version-match
HA Group 1: Application Content version does not match

12/11 08:18:28
ha
high
peer-version-match
HA Group 1: Global Protect Client Software version does not match

12/11 08:18:28
ha
high
peer-version-match
HA Group 1: Anti-Virus version does not match

12/11 08:18:28
ha
critical
peer-split-brain
HA Group 1: Staying in Active state after split-brain recovery (split-brain duration: 1s)

12/11 08:18:28
ha
informational
connect-change
HA Group 1: HA1 connection up

12/11 08:18:28
ha
informational
ha2-link-change
HA2 peer link up

12/11 08:18:28
ha
informational
ha1-link-change
HA1 peer link up

12/11 08:18:28
ha
informational
connect-change
HA Group 1: Control link running on HA1 connection

12/11 08:18:27
ras
informational
rasmgr-ha-full-sync-abort
RASMGR daemon sync all user info to HA peer no longer needed.

12/11 08:18:27
vpn
informational
keymgr-ha-full-sync-abort
KEYMGR sync all IPSec SA to HA peer no longer needed.

12/11 0:18:24
satd
informational
satd-ha-full-sync-abort
SATD daemon sync all gateway infos to HA peer no longer needed.

12/11 08:18:23
ha
critical
connect-change
HA Group 1: All HA1 connections down
12/11 08:18:23
ha
critical
connect-change
HA Group 1: HA1 connection down

I did a bit of research on it, and it said it does it when one PA see a firewall that the other doesn't? But both are setup exactly the same, thoguhts?

mivaldi · ‎12-11-2014

If the HA1 Link fails and there is no HA1 Backup nor Heartbeat Backup configured, configuration synchronization will fail and a split brain condition will be created. Split brain conditions occur when HA members can no longer communicate with each other to exchange HA monitoring information. Each HA member will assume the other member is in a non-functional state and take over as the Active (A/P) or Active-Primary (A/A). Split brain conditions can be prevented by configuring an HA1 Backup link and/or enabling Heartbeat Backup.

Source:

High Availability Synchronization

The reason for HA1 link failure is not limited to physical problems, it can also happen if the ha_agent process is busy and can't process HA1 functions. In that case it's useful to have the backup be Heartbeat Backup through the MGMT port, since the Heartbeat function sends out ICMP probes and these are processed by the system kernel, and not the ha_agent process. In that case, use of Heartbeat Backup would result in a more split-brain resilient configuration than using an HA1 Backup link through the MGMT port (which would depend on the health of the ha_agent process).

See How to Configure High Availability on PAN-OS for details on configuring HA1 Backup link and enabling Heartbeat Backup.

View solution in original post

Wenar · ‎12-11-2014

If you see Split-Brain messages it means that the heartbeat between both peers are missed. In the worst case both devices stay active and you network will go down.

Please check your cables and network devices between the HA 1 port if they are still working without problems and configure a backup heartbeat over another interface.

Zewwy · ‎12-11-2014

Thanks for the suggestion? But not sure what the issue could be there, both peers sit one a top the other... with a 6 Inch patch cable connecting the peers to each other for the HA traffic.... so nothing in terms of devices between the HA 1 ports....

mivaldi · ‎12-11-2014

If the HA1 Link fails and there is no HA1 Backup nor Heartbeat Backup configured, configuration synchronization will fail and a split brain condition will be created. Split brain conditions occur when HA members can no longer communicate with each other to exchange HA monitoring information. Each HA member will assume the other member is in a non-functional state and take over as the Active (A/P) or Active-Primary (A/A). Split brain conditions can be prevented by configuring an HA1 Backup link and/or enabling Heartbeat Backup.

Source:

High Availability Synchronization

The reason for HA1 link failure is not limited to physical problems, it can also happen if the ha_agent process is busy and can't process HA1 functions. In that case it's useful to have the backup be Heartbeat Backup through the MGMT port, since the Heartbeat function sends out ICMP probes and these are processed by the system kernel, and not the ha_agent process. In that case, use of Heartbeat Backup would result in a more split-brain resilient configuration than using an HA1 Backup link through the MGMT port (which would depend on the health of the ha_agent process).

See How to Configure High Availability on PAN-OS for details on configuring HA1 Backup link and enabling Heartbeat Backup.

Zewwy · ‎12-11-2014

Thanks Mivaldi!

I was talking to our security consultant about the issue, and he suggested the exact same thing. I really appreciate the great explanation you provided it really helped.

To help elevate the issue I will setup backup heartbeats on the MGMT plane as both of you have suggested. Just odd cause never got them before I did the one PA RMA. and lucky as I stated each time the network didn't go down and recovered itself within seconds. But if it doesn't take much to configure a backup heartbeat, that sounds like a great solution.

again thanks so much for that information!

TranceforLife · ‎08-25-2016

Hello Mivaldi,

Good explanation. We don't have HA1 backup link configured and our Heartbeat Backup (mgmt port) were not reachable between to PAs. This is all to the switch issue between the devices. Passive node went to Active. So peer split-brain occurred. Just to confirm when HA1 heartbeat recovered ( say network connection is restored between the devices) will my "Active" node become back to "Passive". If you can clear this would be great. Logs below:

2016-08-25 00:59:39.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (ha1)
2016-08-25 00:59:39.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 1 ping timeouts out of 3 (mgmt)
2016-08-25 00:59:40.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (ha1)
2016-08-25 00:59:40.706 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 2 ping timeouts out of 3 (mgmt)
2016-08-25 00:59:41.707 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:756): Missed 3 ping timeouts out of 3 (ha1)
2016-08-25 00:59:41.707 +0100 Error: ha_ping_peer_miss(src/ha_ping.c:763): We have missed 4 pings from the peer for group 1 (ha1), restarting connection
2016-08-25 00:59:41.709 +0100 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: HA1 connection down
2016-08-25 00:59:41.709 +0100 Warning: ha_event_log(src/ha_event.c:47): HA Group 1: All HA1 connections down

Thx,

Myky

jdelio · ‎08-25-2016

For the record, @TranceforLife started this new topic for his question above:

HA peer split-brain recovery?

If you have configured the Preemtive settings, then this will control which unit is active after communication is restored.

LIVEcommunity team member
Stay Secure,
Joe
Don't forget to Like items if a post is helpful to you!

TranceforLife · ‎08-25-2016

Hello Jdelio,

Nice one. I could see it in the logs but just wanted to confirm. Logs from teh passive, but it became an Active due to heartbeats missing . Preemtion settings 100/50 :

2016/08/25 00:59:41 critical ha connect 0 HA Group 1: HA1 connection down

2016/08/25 00:59:41 critical ha connect 0 HA Group 1: All HA1 connections down

2016/08/25 00:59:41 critical ha connect 0 HA Group 1: HA heartbeat backup is being used to avoid split-brain; the HA functionality is in a degraded state pending the recovery of HA1

2016/08/25 00:59:41 critical ha connect 0 HA Group 1: HA heartbeat backup connection down

2016/08/25 00:59:42 high ha peer-ve 0 HA Group 1: Anti-Virus version does not match

2016/08/25 00:59:42 high ha peer-ve 0 HA Group 1: Application Content version does not match

2016/08/25 00:59:42 high ha peer-ve 0 HA Group 1: Threat Content version does not match

2016/08/25 00:59:42 info ha peer-ve 0 HA Group 1: Anti-Virus version now matches

2016/08/25 00:59:42 info ha peer-ve 0 HA Group 1: Application Content version now matches

2016/08/25 00:59:42 info ha peer-ve 0 HA Group 1: Threat Content version now matches

2016/08/25 00:59:42 info ha ha1-lin 0 HA1 peer link up

2016/08/25 00:59:42 info ha ha2-lin 0 HA2-Backup peer link up

2016/08/25 00:59:42 info ha ha2-lin 0 HA2 peer link up

2016/08/25 00:59:43 high ha state-c 0 HA Group 1: Moved from state Passive to state Active

2016/08/25 00:59:44 info ha connect 0 HA Group 1: Control link running on HA1 connection

2016/08/25 00:59:44 info ha connect 0 HA Group 1: HA1 connection up

2016/08/25 00:59:44 critical ha split-b 0 HA Group 1: Going to Passive state due to split-brain recovery (split-brain duration: 1s)

2016/08/25 00:59:44 info ha connect 0 HA Group 1: HA heartbeat backup connection up

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/1: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/2: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/3: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/5: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/6: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/7: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/9: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/10: Down 1Gb/s-full duplex

2016/08/25 00:59:44 info routing routed- 0 FIB HA sync started when local device becomes master.

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/13: Up auto duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/14: Up auto duplex

2016/08/25 00:59:44 info port ethern link-ch 0 ethernet1/19: Up 1Gb/s-full duplex

2016/08/25 00:59:45 info ha session 0 HA Group 1: Starting session synchronization with peer

2016/08/25 00:59:45 info routing routed- 0 FIB HA sync started when local device becomes master.

2016/08/25 00:59:47 info port ethern link-ch 0 ethernet1/1: Up 1Gb/s-full duplex

2016/08/25 00:59:48 info port ethern link-ch 0 ethernet1/3: Up 1Gb/s-full duplex

2016/08/25 00:59:48 info port ethern link-ch 0 ethernet1/9: Up 1Gb/s-full duplex

2016/08/25 00:59:48 info port ethern link-ch 0 ethernet1/2: Up 1Gb/s-full duplex

2016/08/25 00:59:48 info ha session 0 HA Group 1: Completed session synchronization with peer

2016/08/25 00:59:48 info port ethern link-ch 0 ethernet1/6: Up 1Gb/s-full duplex

2016/08/25 00:59:49 info port ethern link-ch 0 ethernet1/10: Up 1Gb/s-full duplex

2016/08/25 00:59:49 info port ethern link-ch 0 ethernet1/7: Up 1Gb/s-full duplex

2016/08/25 00:59:50 info port ethern link-ch 0 ethernet1/5: Up 1Gb/s-full duplex

2016/08/25 00:59:51 info ha state-c 0 HA Group 1: Moved from state Initial to state Passive

2016/08/25 00:59:51 info ha session 0 HA Group 1: Starting session synchronization with peer

2016/08/25 00:59:54 info ha session 0 HA Group 1: Completed session synchronization with peer

Thx,

Myky

What is Peer-Split-Brain?