HA broken after upgrading to 6.0.3

MMCiobanu · ‎06-16-2014

Hi,

We have just upgraded our 5020s and 3020s to 6.0.3 and encountered an issue, where the secondary device became the Active one and the primary displays this error, only on the 5020:

does anybody else had this issue or knows how to solve it?

Thank you.

kadak · ‎06-16-2014

Hello MMCiobanu,

This could be due to the dataplane hitting the kernel NMI causing the control plane to exit with internal path monitoring failure. We need to determine the reason for the initial NMI failure which could be due to either a software caused memory corruption or due to a hardware problem.

That calls for a support case to be opened so that the assigned TAC engineer can investigate the tech support files. Remember, the tech support files needs to be gathered along with crash files (if any) before rebooting the firewall.

Hope that helps!

Thanks and regards,

Kunal Adak

hopcio · ‎06-16-2014

Hi,

Got the same error after upgrade a pair of PA-5020 in HA A/P and the primary went to maint mode.

Do you have the following error by any chance? "control_plane: Depend script failed max times".

I'm opening a case also, but if you have any update a appreciate any feedback

MMCiobanu · ‎06-17-2014

Hi hopcio,

We did not have the error you had; our device went into "non-functional" mode. We had a case open, and sent the tech-support file, but we did not hear back by the end of the day and decided to schedule a reboot of the broken unit, while the second one was still Active and passing traffic. The reboot fixed the issue, so far. But, still don't know what caused it in the first place.

I will post an update once hearing back from support.

Oleksandr · ‎06-17-2014

Hi!

Any updates for this?

Thanks,

Alex

MMCiobanu · ‎06-17-2014

No update from the PaloAlto support team; but, as I said, we solved the issue after rebooting the broken unit. We still don't know what caused the issue in the first place.

pulukas · ‎06-21-2014

Thanks for sharing and keep us posted on the progress here.

We are looking at when to move from 5 to 6 and are running four A/A HA clusters in the DC. I'm glad the fix is a reboot, but wondering what may bring the issue back since the root cause is not identified yet.

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

MMCiobanu · ‎06-21-2014

PaloAlto still investigating, this is the response:

"From the analysis of the collected the dataplane restart happened initially at 17:02 on 06/15 where an internal path monitoring failure detected . The internal path monitoring failure occurs when internal monitor packets between the daemons in the firewall are not ACKed timely. When the packets are not ACKed, the firewalls assumes some daemon is down and triggers a dataplane restart.

2014/06/15 16:56:45 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:02:39 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:02:39 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:02:39 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:02:53 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 17:03:17 critical general general 0 Internal packet path monitoring failure, restarting dataplane

Device became active after the restart of dataplane and again detected the internal path monitoring failure and triggered the data-plane restart . This time dataplane didn't restart and was down causing the device to be in non-functional state until you rebooted the device next day

2014/06/15 17:07:30 info ha state-c 0 HA Group 30: Moved from state Initial to state Passive
2014/06/15 17:09:42 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:33:46 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:33:46 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:33:46 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:34:25 critical general general 0 Internal packet path monitoring failure, restarting dataplane
014/06/15 22:24:45 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 22:25:05 critical general general 0 Internal packet path monitoring failure, restarting dataplane

The DP1 CPU was under high load for that moment right before the dataplane restart and did see some software buffer depletion which could have triggered the internal path failure. "

They opened a bug with development team to validate the findings and analyze the crash files and logs further. And they want us to call them if this happens again, so they can do a live debug.

pulukas · ‎06-21-2014

Thanks for the details.

I would be concerned about finding the root cause on this one. A series of events that brings down a cluster member is a pretty serious bug. But if it only occurs immediately after the upgrade and clears with a reboot it possibly could be covered in the upgrade outage window.

Did you upgrade and reboot both members together or one at a time? If separately primary or secondary first?

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

MMCiobanu · ‎07-30-2014

This happened again after a few months now since we have upgraded;

It happened after a reboot - secondary device was booted first, then primary, then secondary again. After every reboot, I have waited a few minutes and made sure HA was fine; however, in the morning the secondary device became the primary and displaying (Dataplane down: path monitor failure), and the primary became non functional.

After restarting the dataplane on the non functional device, it came back as passive.

From the logs it shows that it kept restarting the dataplane for a few hours.

We have contacted PaloAlto support again and they are trying to figure out what causes these issues after the restart.

pulukas · ‎07-30-2014

Thanks for keeping us posted.

Your reboot procedure is the same one we using in upgrades of a cluster, one at a time with the secondary first.

This is a serious bug. Do I understand you correctly that traffic is still passing with the secondary node as primary? Or does this cause an full network outage?

Steve Puluka BSEET - IP Architect - DQE Communications (Metro Ethernet/ISP)
ACE PanOS 6; ACE PanOS 7; ASE 3.0; PSE 7.0 Foundations & Associate in Platform; Cyber Security; Data Center

MMCiobanu · ‎07-30-2014

hi Steven Puluka

That's right, traffic is still passing through the secondary node as it became primary, actually it became the only node functional. There was not a network outage.

we got the failed node back in the HA by restarting the dataplane on it: "request restart dataplane". This made it re-join the cluster as the passive node.

Now, waiting for PaloAlto to come back with some kind of resolution.

j.liu · ‎07-30-2014

Do you mind sharing the case number? I am curious of details. I have a client in the process of upgrading to 6.0.3,

NGFW Customer Success Engineer, Palo Alto Networks

LuisLorienteAra · ‎08-13-2014

We are suffering the same issue.

Now rebooting secondary node.

Our soft version is 6.0.2 in both nodes.

Satish · ‎08-13-2014

Hi Luis,

What error are showing in logs file.

PAN-OS 6.0.4: Addressed Issues

Some issue are addressed in pan OS 6.0.4.

Regards

SAtish

HA broken after upgrading to 6.0.3