Issue: New Palo Altos crashing domain controller with migrated config

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Issue: New Palo Altos crashing domain controller with migrated config

L0 Member

Morning all!
Have an odd one for y'all that has defeated us so far. We are currently attempting to side-grade from PA-5250s to new PA-3440s but have experienced a showstopper twice that caused us to revert back to the PA-5250s

Issue: New Palo Alto firewalls (seemingly) crash domain controllers once load hits the network (KB5035849 is uninstalled from DCs). Issue is not experienced when there is no load on the network after hours when we make the changes, including letting it run in service for an entire weekend with no true load.


Quick details:
(Downsizing to reduce subscriptions costs)
*Config from PA-5250s was migrated to PA-3440s using Expeditions to swap ports

*New PA-3440s are on 10.2.7-h3, updated after config import

*Config changes on new PA-3440s mostly consisted of cosmetic changes and corrected a broken certificate chain. No NAT, security or Authentication policies were modified
*All normal day to day services tested fine after hours (content filtering, vpn, ipsec tunnels, duo auth), all WAN sites reachable and have Internet
*Using Kerberos for agentless user-id

Attempt One: Our first morning on the new boxes, we seemingly lose Internet when load hits the network as a 20000 users arrive.  Mistakenly suspecting the new firewall, we begin the failback process as it was the last network change.  A few minutes into the process, I remote into the domain controllers to verify they have Internet access for DNS and discover they are running at 100% memory.  I had just read about the memory leak caused by the March update KB5035849 and had checked our DCs before the event.  All that time DCs were running at their usual 45-50% memory usage (pre-event) so I assumed we had dodged that issue but we scheduled an after hours reboot of them just to be safe.  Once we discovered the issue with the DCs, we rebooted them and uninstalled the faulty patch one DC at a time.  I had completed one DC when the PA-5250 finished and strangely enough Internet service had been restored even though I hadn't finished uninstalling from all DCs. We were stable so we scheduled remaining uninstalls for after hours and, still not suspecting the PA-3440s, rescheduled another go-live for the following Monday. Updates were paused on the DCs and KB5035849 has not reinstalled since.

Attempt Two: Exact same scenario occurred, load hits the network, domain controllers seize up from memory usage. While PA-3440s are still live, reboot domain controllers to no affect and there is no KB to uninstall this time. Begin process to revert to PA-5250s and reboot the domain controllers while the PA-3440s are suspended.  Internet service is restored albeit on the PA-5250s and memory usage returns to normal on the DCs.

Attempt Three: This has not been scheduled as everyone is mad at me now for losing an hour on two different mornings lol.  


--------------------------------------------------------------------------------------------------------------------------

We did a Config Audit on the PA-3440s comparing the running config to the one we imported and cant find anything unexpected that's been changed that could possibly cause this issue.  The only things standing out are the cosmetic changes we made and the big one being the certificate corrections. Nothing stands out on the DC event logs that we can gleam. We are now walking around the office hanging our heads in shame from the defeat.

If anyone has any advice or direction, we would greatly appreciate it! We're prepping to submit a formal ticket as well once we research/investigate for a few more days.



1 REPLY 1

Cyber Elite
Cyber Elite

are you running clientless userID on these firewalls?

for this amount of users I'd consider using the userID agent and run it on the DCs directly (or set up a RODC and run it from there)

 

I know this answer sucks as it doesn't address your issue directly, but what you describe sounds like the 3k does more/more intense reads that cause your AD to overheat. you could try playing with the read frequency, but seeing the number of users this could cause userid to not pick up on all user logins, which is also not great

Tom Piens
PANgurus - Strata specialist; config reviews, policy optimization
  • 1203 Views
  • 1 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!