PA-850 Cluster Went Non Functional

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

PA-850 Cluster Went Non Functional

L2 Linker

My PA-850 Active/Passive cluster went non-functional last night causing an outage at our main corporate headquarters.  I do have a ticket in with PA but they're being a bit slow to do a root cause on this so figured I'd post and see if anyone has ran into this before.  Just before the cluster went non functional the logs looked like this for the active node:

850-2.PNG

And the passive node looked like this:

850-1.PNG

 

The only post I could find was about a 3000 series that did something similar (but didn't appear to be in a HA cluster). 

1 accepted solution

Accepted Solutions

It appears we finally have some resolution over this issue, and have been running for a few days on 850s again without any trouble.  Eventually support identified it as a problem with the 850s handling fragmented traffic.  They referenced PAN-79084 as the issue (even though the wording specifically mentions globalprotect).  This was fixed in 8.0.4 for the 5200 series but for no other platforms.  Our old 3020 was running 7.1.7 which is why it did not exhibit any problems.  Support states the particular issue affecting me was fixed for all platforms in 8.1.0 and up. 

View solution in original post

6 REPLIES 6

L2 Linker

Bit of a non update update-  We suffered another outage on 4/16 during business hours, 4/17 during the early morning and then at that point we just moved back on to our old 3020. 

 

Basically after multiple calls to support and reaching out to our regional sales guy I finally got a response back Tuesday afternoon.  The tech said he stopped his root cause analysis because the serial numbers from the tech support files I uploaded were the same.  First, they absolutely weren't, and second, even if that was the case, why not continue with analyzing the one file instead of just putting the case on hold?

 

We uploaded fresh tech support logs yesterday morning, only to be told we again sent the same serial # logs in, and then to have support email us 20 minutes later stating they did receive the new logs and would begin analysis.  We didn't have the core files available from the last crash so we decided to put the 850s back in production last night to see if they would crash again and they did.  So we got a good sent of logs sent in this morning.

 

We're now basically 1 week into an outage, the only reason we're up is because we had old hardware available.  I had to stomp my feet a bit with support to get Threat/AV temporary licensing for the 3020 but at least they came through on that I suppose.

 

This is the second major issue we've had with the 850s, the first one was here:  https://live.paloaltonetworks.com/t5/General-Topics/PA-850-Default-MTU/m-p/197124

 

This also tacks on to the PA-200 fiasco which I posted about here:  https://live.paloaltonetworks.com/t5/General-Topics/PA200-Failures/m-p/75159, which was nearly a year before PA said yes we have some bad hardware, and here's how to check.  By the time PA announced an official recall we had RMA'd 15 PA-200s, and after the recall RMA'd another 12 units before the bad ones crashed on us.

 

All of this is incredibly frustrating because despite support, when the PA firewalls are working they really are great appliances. 

I feel your pain with the 850s. What PAN-OS version is now installed on yours?

We were on 8.0.7, and then before we tried putting them in production last night moved them to 8.0.9.

PA sent us RMA notices yesterday afternoon without any word directly from support.  After pressing our regional guys the engineers said our units both had high "U17 values" which apparently means something wrong with the amps usage.  We received the replacements this morning but we likely won't try production until tomorrow evening.

We tried putting in the replacement 850s on Friday night, and they both went non-functional within a few seconds of having production traffic passed through.  It was identical symptoms to the previous units, the first unit passed traffic for a few seconds, unit goes non functional, passive node becomes active for a few seconds then also goes non functional.

 

I don't think these units are undersized-  We were running on 3020s before this and even right now we have a 5% session table utiliziation,  new cps rate generally around 200-300, dataplane usage CPU usage averaging at about 25-30%.  However looking at our performance stats in Solarwinds, our CPU usage was actually lower on the 850s then it is on the 3020s.

 

We're about to hit a 10 day change blackout window with the beginning of a month so unfortunately I won't be able to test anything anytime soon, so now we're just entirely at the mercy of support.

It appears we finally have some resolution over this issue, and have been running for a few days on 850s again without any trouble.  Eventually support identified it as a problem with the 850s handling fragmented traffic.  They referenced PAN-79084 as the issue (even though the wording specifically mentions globalprotect).  This was fixed in 8.0.4 for the 5200 series but for no other platforms.  Our old 3020 was running 7.1.7 which is why it did not exhibit any problems.  Support states the particular issue affecting me was fixed for all platforms in 8.1.0 and up. 

  • 1 accepted solution
  • 4645 Views
  • 6 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!