We have just upgraded our 5020s and 3020s to 6.0.3 and encountered an issue, where the secondary device became the Active one and the primary displays this error, only on the 5020:
does anybody else had this issue or knows how to solve it?
This could be due to the dataplane hitting the kernel NMI causing the control plane to exit with internal path monitoring failure. We need to determine the reason for the initial NMI failure which could be due to either a software caused memory corruption or due to a hardware problem.
That calls for a support case to be opened so that the assigned TAC engineer can investigate the tech support files. Remember, the tech support files needs to be gathered along with crash files (if any) before rebooting the firewall.
Hope that helps!
Thanks and regards,
Got the same error after upgrade a pair of PA-5020 in HA A/P and the primary went to maint mode.
Do you have the following error by any chance? "control_plane: Depend script failed max times".
I'm opening a case also, but if you have any update a appreciate any feedback
We did not have the error you had; our device went into "non-functional" mode. We had a case open, and sent the tech-support file, but we did not hear back by the end of the day and decided to schedule a reboot of the broken unit, while the second one was still Active and passing traffic. The reboot fixed the issue, so far. But, still don't know what caused it in the first place.
I will post an update once hearing back from support.
No update from the PaloAlto support team; but, as I said, we solved the issue after rebooting the broken unit. We still don't know what caused the issue in the first place.
Thanks for sharing and keep us posted on the progress here.
We are looking at when to move from 5 to 6 and are running four A/A HA clusters in the DC. I'm glad the fix is a reboot, but wondering what may bring the issue back since the root cause is not identified yet.
PaloAlto still investigating, this is the response:
"From the analysis of the collected the dataplane restart happened initially at 17:02 on 06/15 where an internal path monitoring failure detected . The internal path monitoring failure occurs when internal monitor packets between the daemons in the firewall are not ACKed timely. When the packets are not ACKed, the firewalls assumes some daemon is down and triggers a dataplane restart.
2014/06/15 16:56:45 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:02:39 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:02:39 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:02:39 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:02:53 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 17:03:17 critical general general 0 Internal packet path monitoring failure, restarting dataplane
Device became active after the restart of dataplane and again detected the internal path monitoring failure and triggered the data-plane restart . This time dataplane didn't restart and was down causing the device to be in non-functional state until you rebooted the device next day
2014/06/15 17:07:30 info ha state-c 0 HA Group 30: Moved from state Initial to state Passive
2014/06/15 17:09:42 high ha state-c 0 HA Group 30: Moved from state Passive to state Active
2014/06/15 17:33:46 high general general 0 9: path_monitor HB failures seen, triggering HA DP down
2014/06/15 17:33:46 critical ha datapla 0 HA Group 30: Dataplane is down: path monitor failure
2014/06/15 17:33:46 critical ha state-c 0 HA Group 30: Moved from state Active to state Non-Functional
2014/06/15 17:34:25 critical general general 0 Internal packet path monitoring failure, restarting dataplane
014/06/15 22:24:45 high general general 0 flow_mgmt: exiting because missed too many heartbeats
2014/06/15 22:25:05 critical general general 0 Internal packet path monitoring failure, restarting dataplane
The DP1 CPU was under high load for that moment right before the dataplane restart and did see some software buffer depletion which could have triggered the internal path failure. "
They opened a bug with development team to validate the findings and analyze the crash files and logs further. And they want us to call them if this happens again, so they can do a live debug.
Thanks for the details.
I would be concerned about finding the root cause on this one. A series of events that brings down a cluster member is a pretty serious bug. But if it only occurs immediately after the upgrade and clears with a reboot it possibly could be covered in the upgrade outage window.
Did you upgrade and reboot both members together or one at a time? If separately primary or secondary first?
This happened again after a few months now since we have upgraded;
It happened after a reboot - secondary device was booted first, then primary, then secondary again. After every reboot, I have waited a few minutes and made sure HA was fine; however, in the morning the secondary device became the primary and displaying (Dataplane down: path monitor failure), and the primary became non functional.
After restarting the dataplane on the non functional device, it came back as passive.
From the logs it shows that it kept restarting the dataplane for a few hours.
We have contacted PaloAlto support again and they are trying to figure out what causes these issues after the restart.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!
The Live Community thanks you for your participation!