flow_fpga_ingress_exception_err and high latency

Alex_Huthmacher · ‎01-21-2022

Recently deployed several PA-5250s Running 10.1.3 and there is a issue that randomly comes and goes.

Latency for traffic going through the firewalls spikes to 100-500ms. I was able to capture one thing that looked peculiar and that was flow_fpga_ingress_exception_err counts were high (8169388322) and the rate was high (12468). But I can't seem to find a good definition as what this would indicate.

I also caught the packet descriptor (on-chip) (average): with 100 across the first two rows.
I failed to capture the CPU Cores at the same time though.
Any ideas?

Gustavo_Aristi · ‎01-21-2022

Known issue: PAN-141630 seems to match this scenario. Interestingly, 10.1.3 is the preferred 10.1 version as of today. Therefore, upgrade to 10.1.4 may not be the best solution (yet).

You may be able to get more context looking at global counters: show counter global | match fpga

Cyberforce Hero.
Don't forget to hit that Like button if a post is helpful to you!

BPry · ‎01-22-2022

@Alex_Huthmacher,

If this is mission critical hardware, and you don't require features in 10.1, I would highly recommend staying off of 10.1 for the time being. There's still bugs getting worked out in 10.1 and 10.0 is a fairly stable release at this point.

Alex_Huthmacher · ‎01-24-2022

Thanks for the help, I asked TAC if we should downgrade and they replied with "why?".

Gustavo_Aristi · ‎01-24-2022

I would ask the same question as you currently are running the preferred/stable version of 10.1.

However, instead of questioning you, I hope they are providing a solution as you seem to already have a case with them.

I am curious to hear what their solution was, if they provide one.

Cyberforce Hero.
Don't forget to hit that Like button if a post is helpful to you!

Alex_Huthmacher · ‎01-24-2022

Well right now they have told us that high flow_fpga_ingress_exception_err are expected behavior and not to worry about them. As for the latency, we are just shot-gunning a few changes to see if anything helps. Like reducing port channel down to one link, possibly disabling offloading, and a couple others. Last resort is downgrade to the preferred 9 code. I will let you know if I find anything.
The reason we suggest the downgrade is because we have one 5220 running 9 code and it doesn't experience this issue. That's all we got though.

Gustavo_Aristi · ‎01-24-2022

Any idea if there's any asymmetric routing going on?

Packet capture combined with global counters may shed some light on this. If you manage to narrow this down, to a sample source and destination that would be perfect. Then see if you get a drop pcap and use the pcap filters against the global counters.

Cyberforce Hero.
Don't forget to hit that Like button if a post is helpful to you!

Alex_Huthmacher · ‎01-24-2022

There is no asymmetric routing, but we did look into that as well. We did do a packet filter and look at the drops for a single session that was experiencing latency. There were no drops. Its just slow! Very frustrating..

Thanks for the suggestion!

Gustavo_Aristi · ‎01-24-2022

I am sorry you are going through this, I am sure you'll find the solution.

At this point, since you are already at the pcap stage, I would perform a packet diagnostics, flow basic and look at a low level what the firewall is doing with each packet/session in the flow logic.

I bet you are familiar with that or already tried it, but if not, below is a good read:

https://live.paloaltonetworks.com/t5/general-topics/debugging-packet-flow/td-p/67514

My approach for reading these is different, I get a TSF and find the txt file there and open it in notepad++

Cyberforce Hero.
Don't forget to hit that Like button if a post is helpful to you!

PozsonyiAttila · ‎02-15-2022

Hey Guys!

We're having a very similar issue on our 5220 (PAN-OS 10.1.4). The latency comes and goes. CPU / Memory usage is close to nothing, same goes for session utilization. However every few secs the flow_fpga_ingress_exception_err counter is rising. Delta says 50 more drops in a second, the next second 3000.

There's a strange thing I noticed. We gather metrics with prometheus (nevermind the software), and monitoring IfHCOutOctets and IfHCinOctets via snmp. We both monitor the firewall interfaces, and the (Cisco) switch ports they're connected to. We're using the same formula for bandwith calculation and get massive differences. On the switchport we see the nightly backups consume the whole 1Gbit bandwith on our graphs, in the same time period the matching firewall interface shows only ~700 Mbit/sec. It's the two ends of the same wire!

I'm not saying it's related to flow_fpga_ingress_exception_err but packet drops (a few thousand per few secs) could explain the difference between the two measured values.

Alex_Huthmacher · ‎02-15-2022

So just a quick update. The issue seamed to be related to the number of sessions we were getting through the firewall. We handle a large number of sessions of a particular protocol requests and it is our number one application by session each day. When we put an App-ID Override on the protocol it appears to have cleared up the latency.
I am now skeptical of Palos session per second capability but if you look at the datasheets they always show the max session/s count using 1 byte http traffic with app-id override. So it is what it is. Good luck out there.

PozsonyiAttila · ‎02-16-2022

Thanks for the update, I keep that in mind. My issue turned out to be ISP related so PAN-OS isn't guilty! 🙂

What a strange coincidence that was! Very, very similar issue and we came to the same (wrong) conclusion...

Regards

Attila

Unlock your full community experience!

flow_fpga_ingress_exception_err and high latency

flow_fpga_ingress_exception_err and high latency

Show your appreciation!