This could be due to the dataplane hitting the kernel NMI causing the control plane to exit with internal path monitoring failure. We need to determine the reason for the initial NMI failure which could be due to either a software caused memory corruption or due to a hardware problem.
That calls for a support case to be opened so that the assigned TAC engineer can investigate the tech support files. Remember, the tech support files needs to be gathered along with crash files (if any) before rebooting the firewall.
Hope that helps!
Thanks and regards,
We did not have the error you had; our device went into "non-functional" mode. We had a case open, and sent the tech-support file, but we did not hear back by the end of the day and decided to schedule a reboot of the broken unit, while the second one was still Active and passing traffic. The reboot fixed the issue, so far. But, still don't know what caused it in the first place.
I will post an update once hearing back from support.
Thanks for sharing and keep us posted on the progress here.
We are looking at when to move from 5 to 6 and are running four A/A HA clusters in the DC. I'm glad the fix is a reboot, but wondering what may bring the issue back since the root cause is not identified yet.
PaloAlto still investigating, this is the response:
"From the analysis of the collected the dataplane restart happened initially at 17:02 on 06/15 where an internal path monitoring failure detected . The internal path monitoring failure occurs when internal monitor packets between the daemons in the firewall are not ACKed timely. When the packets are not ACKed, the firewalls assumes some daemon is down and triggers a dataplane restart.
2014/06/15 16:56:45 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:02:39 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:02:39 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:02:39 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:02:53 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 17:03:17 critical general general 0 Internal packet path monitoring failure, restarting dataplane
Device became active after the restart of dataplane and again detected the internal path monitoring failure and triggered the data-plane restart . This time dataplane didn't restart and was down causing the device to be in non-functional state until you rebooted the device next day
2014/06/15 17:07:30 info ha state-c 0 HA Group 30: Moved from state Initial to state Passive
2014/06/15 17:09:42 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:33:46 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:33:46 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:33:46 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:34:25 critical general general 0 Internal packet path monitoring failure, restarting dataplane
014/06/15 22:24:45 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 22:25:05 critical general general 0 Internal packet path monitoring failure, restarting dataplane
The DP1 CPU was under high load for that moment right before the dataplane restart and did see some software buffer depletion which could have triggered the internal path failure. "
They opened a bug with development team to validate the findings and analyze the crash files and logs further. And they want us to call them if this happens again, so they can do a live debug.
Thanks for the details.
I would be concerned about finding the root cause on this one. A series of events that brings down a cluster member is a pretty serious bug. But if it only occurs immediately after the upgrade and clears with a reboot it possibly could be covered in the upgrade outage window.
Did you upgrade and reboot both members together or one at a time? If separately primary or secondary first?
This happened again after a few months now since we have upgraded;
It happened after a reboot - secondary device was booted first, then primary, then secondary again. After every reboot, I have waited a few minutes and made sure HA was fine; however, in the morning the secondary device became the primary and displaying (Dataplane down: path monitor failure), and the primary became non functional.
After restarting the dataplane on the non functional device, it came back as passive.
From the logs it shows that it kept restarting the dataplane for a few hours.
We have contacted PaloAlto support again and they are trying to figure out what causes these issues after the restart.
Thanks for keeping us posted.
Your reboot procedure is the same one we using in upgrades of a cluster, one at a time with the secondary first.
This is a serious bug. Do I understand you correctly that traffic is still passing with the secondary node as primary? Or does this cause an full network outage?
That's right, traffic is still passing through the secondary node as it became primary, actually it became the only node functional. There was not a network outage.
we got the failed node back in the HA by restarting the dataplane on it: "request restart dataplane". This made it re-join the cluster as the passive node.
Now, waiting for PaloAlto to come back with some kind of resolution.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!