HA broken after upgrading to 6.0.3

Reply
Highlighted
L3 Networker

HA broken after upgrading to 6.0.3

Hi,

We have just upgraded our 5020s and 3020s to 6.0.3 and encountered an issue, where the secondary device became the Active one and the primary displays this error, only on the 5020:

HA peer.JPG

does anybody else had this issue or knows how to solve it?

Thank you.

Tags (2)
Highlighted
L5 Sessionator

Re: HA broken after upgrading to 6.0.3

Hello MMCiobanu,

This could be due to the dataplane hitting the kernel NMI causing the control plane to exit with internal path monitoring failure. We need to determine the reason for the initial NMI failure which could be due to either a software caused memory corruption or due to a hardware problem.

That calls for a support case to be opened so that the assigned TAC engineer can investigate the tech support files. Remember, the tech support files needs to be gathered along with crash files (if any) before rebooting the firewall.

Hope that helps!

Thanks and regards,

Kunal Adak

Highlighted
L1 Bithead

Re: HA broken after upgrading to 6.0.3

Hi,

Got the same error after upgrade a pair of PA-5020 in HA A/P and the primary went to maint mode.

Do you have the following error by any chance? "control_plane: Depend script failed max times".


I'm opening a case also, but if you have any update a appreciate any feedback

Highlighted
L3 Networker

Re: HA broken after upgrading to 6.0.3

Hi hopcio,

We did not have the error you had; our device went into "non-functional" mode. We had a case open, and sent the tech-support file, but we did not hear back by the end of the day and decided to schedule a reboot of the broken unit, while the second one was still Active and passing traffic. The reboot fixed the issue, so far. But, still don't know what caused it in the first place.

I will post an update once hearing back from support.

Highlighted
L3 Networker

Re: HA broken after upgrading to 6.0.3

Hi!

Any updates for this?

Thanks,

Alex

Highlighted
L3 Networker

Re: HA broken after upgrading to 6.0.3

No update from the PaloAlto support team; but, as I said, we solved the issue after rebooting the broken unit. We still don't know what caused the issue in the first place.

Highlighted
L7 Applicator

Re: HA broken after upgrading to 6.0.3

Thanks for sharing and keep us posted on the progress here.

We are looking at when to move from 5 to 6 and are running four A/A HA clusters in the DC.  I'm glad the fix is a reboot, but wondering what may bring the issue back since the root cause is not identified yet.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center
Highlighted
L3 Networker

Re: HA broken after upgrading to 6.0.3

PaloAlto still investigating, this is the response:

"From the analysis of the collected the dataplane restart happened initially at 17:02 on 06/15 where an internal path monitoring failure detected . The internal path monitoring failure occurs when internal monitor packets between the daemons in the firewall are not ACKed timely. When the packets are not ACKed, the firewalls assumes some daemon is down and triggers a dataplane restart.


2014/06/15 16:56:45 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:02:39 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:02:39 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:02:39 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:02:53 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 17:03:17 critical general general 0 Internal packet path monitoring failure, restarting dataplane

Device became active after the restart of dataplane and again detected the internal path monitoring failure and triggered the data-plane restart . This time dataplane didn't restart and was down causing the device to be in non-functional state until you rebooted the device next day

2014/06/15 17:07:30 info ha state-c 0 HA Group 30: Moved from state Initial to state Passive
2014/06/15 17:09:42 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:33:46 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:33:46 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:33:46 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:34:25 critical general general 0 Internal packet path monitoring failure, restarting dataplane
014/06/15 22:24:45 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 22:25:05 critical general general 0 Internal packet path monitoring failure, restarting dataplane

The DP1 CPU was under high load for that moment right before the dataplane restart and did see some software buffer depletion which could have triggered the internal path failure. "


They opened a bug with development team to validate the findings and analyze the crash files and logs further. And they want us to call them if this happens again, so they can do a live debug.

Highlighted
L7 Applicator

Re: HA broken after upgrading to 6.0.3

Thanks for the details.

I would be concerned about finding the root cause on this one.  A series of events that brings down a cluster member is a pretty serious bug.  But if it only occurs immediately after the upgrade and clears with a reboot it possibly could be covered in the upgrade outage window.

Did you upgrade and reboot both members together or one at a time?  If separately primary or secondary first?

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center
Highlighted
L3 Networker

Re: HA broken after upgrading to 6.0.3

This happened again after a few months now since we have upgraded;

It happened after a reboot - secondary device was booted first, then primary, then secondary again. After every reboot, I have waited a few minutes and made sure HA was fine; however, in the morning the secondary device became the primary and displaying (Dataplane down: path monitor failure), and the primary became non functional.

After restarting the dataplane on the non functional device, it came back as passive.

From the logs it shows that it kept restarting the dataplane for a few hours.

Capture.JPG

We have contacted PaloAlto support again and they are trying to figure out what causes these issues after the restart.

Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!

The Live Community thanks you for your participation!