So, I recently ran into an issue and I wanted to try to see if I could get some feedback from users to see if anyone else had something similar happen to them.
We recently ran into an issue where our active firewall tanked and transferred responsibility to it's peer. Everything was working as it should, so i contact support to check what the issue could have been. After looking at the tech support files, they discovered that it's a memory leak issue in the 4.1.5 release and that we should upgrade to 4.1.7 because apparently it fixes "hundreds of memory leak issues". So, we upgraded and everything was working fine...for about 2 hours. I tried accessing the CLI and GUI of the active firewall but I was unable to. However, the passive was working fine AND the data plane on the active was still working as well. After doing a tac-login with a challenge/response for the tech to have root access the my box, he was able to restart the authd service because there's yet another race condition issue with 4.1.7 where there are lots of log queries happening at the same time which causes the authd service to fail. This is were the h2 or hotfix 2 comes in and fixes the issue.
Is it me, or is it every time that palo alto releases a new code version that they break something in the previous release that was once working? I've been dealing with this exact scenario since 4.0.x days, and frankly, it's getting annoying having to upgrade our firewalls every 6 weeks when they release a new code.
I also upgrade one customer (cluster of PA 500) from 4.0.x to 4.1.x.
The first release I tried was 4.1.6. It runs...3 days. A reboot was required every week (2 times at minimum).
Then I moved to 4.1.7 since 10 days.
Three days ago, I was unable to login into the active (backup one was working fine) firewall.
Management plane didn't respond correctly.
Because some rules are on based User-ID, some policies didn't work...
4.0.x is VERY stable now.
4.1.x still need some fix in my point of view.
I will probably wait for 4.1.8 or 4.1.9 before upgrading other customers...
How do you restart the authd service from the CLI ??
We too had the issues you described with the management GUI/CLI not responding. I'm having the same thoughts about 4.1.X, still waiting for a "stable" release, without having to upgrade to a new release that solves certain bugs that affect us, but introduces new ones (which will be solved in the next and so on...)
I also had unresponsive management (GUI, CLI and Serial) on the active box of a 2050-cluster just a few days after upgrading to 4.1.7.
A reboot was the only solution.
The issue did not occur yet any of our 4000 or 5000 clusters that where upgraded ot 4.1.7.
Can I conclude that issue is only on the 2000 series ?
Is it possible to restart the authd or mgmt service on the active unit via the CLI of the passive unit ?
It woud be nice to have a techdoc from the TAC on this issue, which seems to impact lots of people ...
The loss of management CLI /GUI access is a software issue accross all the hardware platforms and not just PA 2000..
A possible workaround is to restart the masterd deamon by logging into the shell.
Please contact TAC by opening a ticket since it will require a challenge/response.
I somewhat agree with you. I forgot to add that it is possible work around but NOT guaranteed.
In some instances, we have achieved success by getting the prompt by pressing Ctrl + C when we get errors similar to "'Cannot connect to management server".Once we get the prompt we can log into root.
However, as I said, it is not guaranteed that we will get to enter the shell.
Same issue on 4.1.7. One of our 2050s became complete unresponsive on the management side. Data plane side worked great, and continued to flow traffic. Although I think I was having some issues with theMP dynamic URL cache because the management side was completely eaten up. I wasn't able to login with the serial cable either. I eventually had to force a fail over by disconnecting the HA1, HA2, and management interfaces and restart the locked up 2050.
Frustrating for the administrators, but the users never knew we were having a problem.
Just had the issue again on a 4050. We also had some weeks ago on a 2050.
After pressing CTRL-C several times after the login prompt I had the message 'Cannot connect to management server'
I then tried 'debug software restart management-server' -> no help.
then issued the command request restart system -> OK ( but reboot of course, while HA unit took over)
Opened a TAC case to obtain the 4.1.7 h2 hotfix.
Wondering why they don't post this one on the website....
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!