- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
01-21-2022 01:00 PM
Recently deployed several PA-5250s Running 10.1.3 and there is a issue that randomly comes and goes.
Latency for traffic going through the firewalls spikes to 100-500ms. I was able to capture one thing that looked peculiar and that was flow_fpga_ingress_exception_err counts were high (8169388322) and the rate was high (12468). But I can't seem to find a good definition as what this would indicate.
I also caught the packet descriptor (on-chip) (average): with 100 across the first two rows.
I failed to capture the CPU Cores at the same time though.
Any ideas?
01-21-2022 01:41 PM
Known issue: PAN-141630 seems to match this scenario. Interestingly, 10.1.3 is the preferred 10.1 version as of today. Therefore, upgrade to 10.1.4 may not be the best solution (yet).
You may be able to get more context looking at global counters: show counter global | match fpga
01-22-2022 09:08 PM
If this is mission critical hardware, and you don't require features in 10.1, I would highly recommend staying off of 10.1 for the time being. There's still bugs getting worked out in 10.1 and 10.0 is a fairly stable release at this point.
01-24-2022 08:59 AM
Thanks for the help, I asked TAC if we should downgrade and they replied with "why?".
01-24-2022 11:37 AM
I would ask the same question as you currently are running the preferred/stable version of 10.1.
However, instead of questioning you, I hope they are providing a solution as you seem to already have a case with them.
I am curious to hear what their solution was, if they provide one.
01-24-2022 12:35 PM
Well right now they have told us that high flow_fpga_ingress_exception_err are expected behavior and not to worry about them. As for the latency, we are just shot-gunning a few changes to see if anything helps. Like reducing port channel down to one link, possibly disabling offloading, and a couple others. Last resort is downgrade to the preferred 9 code. I will let you know if I find anything.
The reason we suggest the downgrade is because we have one 5220 running 9 code and it doesn't experience this issue. That's all we got though.
01-24-2022 12:43 PM
Any idea if there's any asymmetric routing going on?
Packet capture combined with global counters may shed some light on this. If you manage to narrow this down, to a sample source and destination that would be perfect. Then see if you get a drop pcap and use the pcap filters against the global counters.
01-24-2022 12:52 PM
There is no asymmetric routing, but we did look into that as well. We did do a packet filter and look at the drops for a single session that was experiencing latency. There were no drops. Its just slow! Very frustrating..
Thanks for the suggestion!
01-24-2022 02:00 PM
I am sorry you are going through this, I am sure you'll find the solution.
At this point, since you are already at the pcap stage, I would perform a packet diagnostics, flow basic and look at a low level what the firewall is doing with each packet/session in the flow logic.
I bet you are familiar with that or already tried it, but if not, below is a good read:
https://live.paloaltonetworks.com/t5/general-topics/debugging-packet-flow/td-p/67514
My approach for reading these is different, I get a TSF and find the txt file there and open it in notepad++
02-15-2022 06:04 AM
Hey Guys!
We're having a very similar issue on our 5220 (PAN-OS 10.1.4). The latency comes and goes. CPU / Memory usage is close to nothing, same goes for session utilization. However every few secs the flow_fpga_ingress_exception_err counter is rising. Delta says 50 more drops in a second, the next second 3000.
There's a strange thing I noticed. We gather metrics with prometheus (nevermind the software), and monitoring IfHCOutOctets and IfHCinOctets via snmp. We both monitor the firewall interfaces, and the (Cisco) switch ports they're connected to. We're using the same formula for bandwith calculation and get massive differences. On the switchport we see the nightly backups consume the whole 1Gbit bandwith on our graphs, in the same time period the matching firewall interface shows only ~700 Mbit/sec. It's the two ends of the same wire!
I'm not saying it's related to flow_fpga_ingress_exception_err but packet drops (a few thousand per few secs) could explain the difference between the two measured values.
02-15-2022 06:40 AM - edited 02-15-2022 06:43 AM
So just a quick update. The issue seamed to be related to the number of sessions we were getting through the firewall. We handle a large number of sessions of a particular protocol requests and it is our number one application by session each day. When we put an App-ID Override on the protocol it appears to have cleared up the latency.
I am now skeptical of Palos session per second capability but if you look at the datasheets they always show the max session/s count using 1 byte http traffic with app-id override. So it is what it is. Good luck out there.
02-16-2022 04:02 AM
Thanks for the update, I keep that in mind. My issue turned out to be ISP related so PAN-OS isn't guilty! 🙂
What a strange coincidence that was! Very, very similar issue and we came to the same (wrong) conclusion...
Regards
Attila
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!