flow_fpga_ingress_exception_err and high latency

cancel
Showing results for 
Search instead for 
Did you mean: 

flow_fpga_ingress_exception_err and high latency

L1 Bithead

Recently deployed several PA-5250s Running 10.1.3 and there is a issue that randomly comes and goes.

Latency for traffic going through the firewalls spikes to 100-500ms. I was able to capture one thing that looked peculiar and that was flow_fpga_ingress_exception_err counts were high (8169388322) and the rate was high (12468). But I can't seem to find a good definition as what this would indicate. 

I also caught the packet descriptor (on-chip) (average):  with 100 across the first two rows. 
I failed to capture the CPU Cores at the same time though. 
Any ideas?

11 REPLIES 11

L2 Linker

Known issue: PAN-141630 seems to match this scenario. Interestingly, 10.1.3 is the preferred 10.1 version as of today. Therefore, upgrade to 10.1.4 may not be the best solution (yet).

 

You may be able to get more context looking at global counters: show counter global | match fpga

 

 

 

Cyber Elite
Cyber Elite

@Alex_Huthmacher,

If this is mission critical hardware, and you don't require features in 10.1, I would highly recommend staying off of 10.1 for the time being. There's still bugs getting worked out in 10.1 and 10.0 is a fairly stable release at this point. 

Thanks for the help, I asked TAC if we should downgrade and they replied with "why?". 

L2 Linker

I would ask the same question as you currently are running the preferred/stable version of 10.1.

 

However, instead of questioning you, I hope they are providing a solution as you seem to already have a case with them.

 

I am curious to hear what their solution was, if they provide one.

 

 

Well right now they have told us that high flow_fpga_ingress_exception_err are expected behavior and not to worry about them. As for the latency, we are just shot-gunning a few changes to see if anything helps. Like reducing port channel down to one link, possibly disabling offloading, and a couple others. Last resort is downgrade to the preferred 9 code. I will let you know if I find anything. 
The reason we suggest the downgrade is because we have one 5220 running 9 code and it doesn't experience this issue. That's all we got though.

L2 Linker

Any idea if there's any asymmetric routing going on?

 

Packet capture combined with global counters may shed some light on this. If you manage to narrow this down, to a sample source and destination that would be perfect. Then see if you get a drop pcap and use the pcap filters against the global counters.

There is no asymmetric routing, but we did look into that as well. We did do a packet filter and look at the drops for a single session that was experiencing latency. There were no drops. Its just slow! Very frustrating.. 

Thanks for the suggestion!

L2 Linker

I am sorry you are going through this, I am sure you'll find the solution.

 

At this point, since you are already at the pcap stage, I would perform a packet diagnostics, flow basic and look at a low level what the firewall is doing with each packet/session in the flow logic.

 

I bet you are familiar with that or already tried it, but if not, below is a good read:

https://live.paloaltonetworks.com/t5/general-topics/debugging-packet-flow/td-p/67514

 

My approach for reading these is different, I get a TSF and find the txt file there and open it in notepad++

L1 Bithead

Hey Guys!

 

We're having a very similar issue on our 5220 (PAN-OS 10.1.4). The latency comes and goes. CPU / Memory usage is close to nothing, same goes for session utilization. However every few secs the flow_fpga_ingress_exception_err counter is rising. Delta says 50 more drops in a second, the next second 3000.

 

There's a strange thing I noticed. We gather metrics with prometheus (nevermind the software), and monitoring IfHCOutOctets and IfHCinOctets via snmp. We both monitor the firewall interfaces, and the (Cisco) switch ports they're connected to. We're using the same formula for bandwith calculation and get massive differences. On the switchport we see the nightly backups consume the whole 1Gbit bandwith on our graphs, in the same time period the matching firewall interface shows only ~700 Mbit/sec. It's the two ends of the same wire!

I'm not saying it's related to flow_fpga_ingress_exception_err but packet drops (a few thousand per few secs) could explain the difference between the two measured values.

 

 

 

Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!