I've got a couple of support tickets open on my issues just seeing if anyone has any suggestions/ideas as Support I wait for support to help me out.
Configuration: 2x PA-5220's in HA (Active-Passive)
Code: 8.0.11-h1 (Moving to 8.0.12 per PA recommendation)
Initial Issue: Started having Commit failures on our Primary PAN.
After some troubleshooting with TAC they suggested we fail over to the Passive PAN and try commits and then do a firmware upgrade. This morning we were able to fail over, and tested commits on the Passive (now active) firewall. The commits worked fine. Great news.
Time to do a firmware updates... Started with the Primary firewall. Suspended it before doing the install. Everything looks great, downloaded new firmware and go through the install process. After doing a reboot I log back in and notice that the firewall seems to be in a little bit of a funky state. For example, my AD credentials are not working. Log in with local admin credentials and take a look at the jobs. Per the Upgrade KB, the "Auto Commit" job should complete but this job is failing.
Because its in a funky state I dont feel comfortable doing the upgrade on the Secondary firewall (now active). HA is down and everyone is now a little concerned because we're also only running on one PAN. What's interesting is before our HA broke, we could make a change on the Passive, run a commit and it would synchnonize.
Anyone run into similar issues? Ideas? Fixes?
On a side note, less than happy with Support. I opened a second Critical ticket because of the upgrade failure as we are now in a worse state than before (no HA) because of their recommendation to do the firmware upgrade. We told them our reservations of doing a firmware upgrade because during troubleshooting there seemed to be database errors related to commits. And despite being critical (to us) they want to wait to "confer" with the original engineer. Only thing is he doesnt start work for another ~4 hours. smh
High-availability won't be operational if the autocommit process fails. Because of that reason, you won't have a passive firewall to failover to so I think you should definitely keep the upgrade postponed for the time being.
If you click on the commit details, does it give you any reason as to why the commit failed?
When a commit starts, there are multiple phases:
phase0 checks that all processes are willing to accept the new configuration
phase1 is a validation stage, a commit might fail here due to an invalid configuration
phase2 is where the configuration is actually pushed
The command "show management-clients" tells us which processes failed with the commit, and at what phase.
Edit: For advanced troubleshooting purposes only - you can even exclude a process from the commit operation if we believe the issue is specific to that process with the command "debug management-server client disable"
Further to the commit details, you can get some very verbose logs in the management-server log that can help us troubleshoot why a commit has failed. You can do this by the command "less mp-log ms.log"
As you probably know, the AutoCommit job is very important to PAN-OS as it initialises a lot of the processes, components and features such as high-availability and it doesn't usually fail and less something has gone pretty wrong.
You may be lucky in that the AutoCommit failing was a one-time occurrence, if so, you can try restarting the management-server (not service affecting but will lose GUI and CLI access for ~5 mins) and see how HA status is afterwards.
"debug software restart process management-server"
Thanks for the insight @LukeBullimore! When working with TAC they did checked many of the things you pointed out and your info helps me fill in what TAC was looking at and doing.
Ultimatley this mostly got fixed on Friday and I literally just finished the final pieces of getting these 5220's in a good state (synced and matching on firmware, and app versions). Turns out an EDL that was being used by MindMeld had some invalid characters in a url which really hosed up one of the databases on our Active firewall (related to this: https://live.paloaltonetworks.com/t5/MineMeld-Discussions/MineMeld-Invalid-character-for-PaloAlto-Im... This was causing commits to fail and later when trying to upgrade the firmware, the auto commits to fail. The curious thing was when we failed over to our passive and tried doing commits, they worked fine, so everyone's guess is that the invalid characters were not present at the time we tried this on the passive. In any case, after about a day of troubleshooting with TAC, the solution was to do a factory reset on the box.
Not a fun day because getting the two boxes in a stable state caused a couple of outages to our production environment during the day.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!
The Live Community thanks you for your participation!