Networking or Dataplane apparent reload (5200 w/10.1)

michelealbrigo · ‎12-13-2022

I'm experiencing issues of what seems to be a "sudden restart" of all OSPF/OSPFv3 neighborship on a PA-5220/PanOS 10.1.6 firewall. I'm trying to pinpoint the root problem on my own, since our support contact (a partner, not PA's support) is taking the route that the problem is on OSPF neighbors' side, which is unlikely to be true, since it would mean that one device (PA) is working properly while 12 devices are having simultaneous problems of the exact same type on just the links connecting them to the firewall.

This happened twice this year, with different PanOS releases.

What I see in the Monitor > System is all the OSPF and OSPFv3 neighborships going down all at the same time, with no previous event leading into that. Opening the tech support bundle and looking at the logs, I can see routed starting to complain:

2022-11-30 21:52:23.071 +0100 MON: status update md(235: 10.252.1.1 => 10.252.1.2 => 10.252.1.2) Failed

[...other neighborships going down...]

**** AUDIT 0x3e01 - 91 (0000) **** I:1799db08 F:00000002

qodmnmi.c 215 :at 21:52:34, 30 November 2022 (347569777 ms)

OSPF 1 An adjacency with a neighbor has gone down.

Resources associated with database exchange for this neighbor will be

freed.

Neighbor router ID 10.252.2.2

Neighbor IP address 10.252.2.2

Interface category network interface

Interface neighbor IP addr 10.252.2.2 i/f idx 0X00000000

**** AUDIT 0x3e01 - 210 (0000) **** I:1799db08 F:00000002

qoamnfsa.c 768 :at 21:52:34, 30 November 2022 (347569780 ms)

OSPF 1 i/f idx 0X000000EC rtr ID 10.252.2.2 IP addr 10.252.2.2 neighbor FSM state has deteriorated.

Interface address = IP addr 10.252.2.1

OSPF link category = 1

Is neighbor virtual? = 0

FSM input = QOAM_NBR_INACTIVITY_TMR (13)

Old FSM state = AMB_OSPF_NBR_FULL (8)

New FSM state = AMB_OSPF_NBR_DOWN (1)

FSM action = I (9)

Neighbor friend status = 1

Number of neighbor events = 11

Number of database exchange timeouts = 0

...and some more PanOS-specific errors on dp0/mprelay for roughly the same timeframe:

2022-11-30 21:52:45.241 +0100 Error: pan_mprelay_process_net_msg_queue(src_octeon/pan_mprelay_msg.c:676): Failed modify network message(type=bulk) sysd obj (QUEUE_DROP) slot(1)

2022-11-30 21:52:49.558 +0100 Fib entry update done

2022-11-30 21:52:50.176 +0100 Error: pan_mprelay_process_net_msg_queue(src_octeon/pan_mprelay_msg.c:676): Failed modify network message(type=bulk) sysd obj (QUEUE_DROP) slot(1) (repeats 3987 times)

Hence the question: is there any other tech-support log or timeframe I should look into, to have a clue about what's causing these issues? e.g. I have events roughly 2 hours before in cp/mprelay, which look like a restart or config refresh, but I'm unsure they are related.

kiwi · ‎12-14-2022

Hi @michelealbrigo ,

Hard to say from here on.

From the top of my head I can think of the following reasons that can potentially cause high DP CPU usage ... but it can have a lot more root causes:

Spike in SSL decryption sessions.

Lot of ZIP processing (global counter: zip_process_total)

High rate of Logging (global counter: log_loss_throttle)

High rate of policy deny (global counter: flow_policy_deny)

High rate of Fragments received (global counter: flow_ipfrag_recv)

A sudden spike in CTD slowpath packets received will cause the device to backlog processing and cause high CPU (global counter: ctd_pkt_slowpath).

Packet rate is spiking (pkt_rcv and/or pkt_sent counters) caused by possible attack (DoS or zone protection could mitigate this) ?

Spike in traffic that's not being offloaded.

Spike in SMB traffic

I'd also check interface bandwidth utilization to identify any source of high rate traffic - ACC and/or traffic logs could possibly help here.

Another useful command for determining the cause of high CPU usage is:

> debug dataplane pow performance

This command returns the average processing time in micro seconds and the count of how many times this group/function was run, allowing you to identify what group or function is taking longer in cores to process.

Hope this helps,

-Kiwi.

LIVEcommunity team member, CISSP
Cheers,
Kiwi
Please help out other users and “Accept as Solution” if a post helps solve your problem !

Read more about how and why to accept solutions.

View solution in original post

kiwi · ‎12-14-2022

Hi @michelealbrigo ,

Could you check the DP CPU and packet buffers around the time of the interruption ? Spikes on DP CPU and buffers could impact OSPF and cause traffic disruption.

Also, the earliest 10.1.6 version dates from May this year (10.1.6-h6 from August) ... to rule out any bug in 10.1.6 or earlier I would strongly recommend upgrading to the current preferred release (10.1.8).

Kind regards,

-Kiwi.

LIVEcommunity team member, CISSP
Cheers,
Kiwi
Please help out other users and “Accept as Solution” if a post helps solve your problem !

Read more about how and why to accept solutions.

michelealbrigo · ‎12-14-2022

Hi and thank you for your reply. Upgrade to 10.1.8 is scheduled for this evening. I think you nailed the issue (i.e. who killed our OSPF):

2022-11-30 21:52:23.920 +0100 --- panio
:
:Resource monitoring sampling data (per second):
:
:CPU load sampling by group:
:flow_lookup : 100%
:flow_fastpath : 100%
:flow_slowpath : 100%
:flow_forwarding : 100%
:flow_mgmt : 100%
:flow_ctrl : 100%
:nac_result : 100%
:flow_np : 100%
:dfa_result : 100%
:module_internal : 100%
:aho_result : 100%
:zip_result : 100%
:pktlog_forwarding : 100%
:lwm : 0%
:flow_host : 100%
:fpga_result : 0%
:
:CPU load (%) during last 15 seconds:
:core 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
:core 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
: 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
:core 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
: 100 100 100 100 100 100 100 100 0 0 0 0 0 0 0 0
:
:Resource utilization (%) during last 15 seconds:
:session:
: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
:
:packet buffer:
: 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96

The previous sampling (10 minutes before), which I won't post, sees all CPU core loads under 7%, and all zeros on packet buffers, which is consistent with the expected load for that timeframe (i.e. very low).
Any clue on how to further investigate the issue?

kiwi · ‎12-14-2022

Hi @michelealbrigo ,

Hard to say from here on.

From the top of my head I can think of the following reasons that can potentially cause high DP CPU usage ... but it can have a lot more root causes:

Spike in SSL decryption sessions.

Lot of ZIP processing (global counter: zip_process_total)

High rate of Logging (global counter: log_loss_throttle)

High rate of policy deny (global counter: flow_policy_deny)

High rate of Fragments received (global counter: flow_ipfrag_recv)

A sudden spike in CTD slowpath packets received will cause the device to backlog processing and cause high CPU (global counter: ctd_pkt_slowpath).

Packet rate is spiking (pkt_rcv and/or pkt_sent counters) caused by possible attack (DoS or zone protection could mitigate this) ?

Spike in traffic that's not being offloaded.

Spike in SMB traffic

I'd also check interface bandwidth utilization to identify any source of high rate traffic - ACC and/or traffic logs could possibly help here.

Another useful command for determining the cause of high CPU usage is:

> debug dataplane pow performance

This command returns the average processing time in micro seconds and the count of how many times this group/function was run, allowing you to identify what group or function is taking longer in cores to process.

Hope this helps,

-Kiwi.

LIVEcommunity team member, CISSP
Cheers,
Kiwi
Please help out other users and “Accept as Solution” if a post helps solve your problem !

Read more about how and why to accept solutions.

michelealbrigo · ‎12-14-2022

Thanks again, I can try a quick evaluation of the factors you pointed at:

Spike in SSL decryption sessions: we don't decrypt very much, so I'd exclude this
Lot of ZIP processing (global counter: zip_process_total): see above, rate is consistently under 50 on the dp log
High rate of Logging (global counter: log_loss_throttle): we do some comprehensive logging, but within best practice recommendations (i.e. session end, and correct configuration of log destinations), and that counter rate is generally 0 (I have a spike of 700, but it's in another time frame and I think it might be related to a logserver restart)
High rate of policy deny (global counter: flow_policy_deny): THIS. See below.
High rate of Fragments received (global counter: flow_ipfrag_recv): rate here is consistently low, peaks at around 200, but it's mostly in the 10-20 range
A sudden spike in CTD slowpath packets received will cause the device to backlog processing and cause high CPU (global counter: ctd_pkt_slowpath): I can't tell how many is too much, but definitely no spikes here around the issue (rate in the 5-8k zone)
Packet rate is spiking (pkt_rcv and/or pkt_sent counters) caused by possible attack (DoS or zone protection could mitigate this): this one, too.
Spike in traffic that's not being offloaded: can't tell very much about it...
Spike in SMB traffic: unlikely in that timeframe, offices are already closed, backups aren't running yet. Same applies for bandwidth utilization, it wasn't a busy time.

Nonetheless, I have a spike in policy deny and a spike in packets received. By "spike" I mean an increase of at least 10x the baseline of the surrounding samples:

:flow_policy_deny 1314 2795
:flow_policy_deny 1293 2727
:flow_policy_deny 1376 2813
:flow_policy_deny 145239 416157
:flow_policy_deny 1614 3260
:flow_policy_deny 1251 2673
:flow_policy_deny 1181 2556

:flow_np_pkt_rcv 10064 21412
:flow_np_pkt_rcv 10184 21485
:flow_np_pkt_rcv 12655 25879
:flow_np_pkt_rcv 158115 453051
:flow_np_pkt_rcv 24546 49587
:flow_np_pkt_rcv 22002 47012
:flow_np_pkt_rcv 22615 48950

Our zone protection profiles are quite lax, by choice and lack of ability to handle them properly, so, unless the drops are due to the firewall "losing" all the internal networks, thus being unable to associate a destination zone, and ending up dropping the sessions (i.e. a closed loop cause/effect reference), a short-burst style DoS might be the case. Unluckily, the event already rolled out of my main log server, and I'm unable to investigate the traffic logs properly. I can see a lot of drops coming from an IP in my auto-populated blacklist, which includes PA's high-risk IP list. If I take another route within logs, I can see drops for a single IP reaching the export log limit of 65535 rows in just 2 minutes.

I think I will have a long chat about those zone protection profiles with the rest of my team. Thank you.

Networking or Dataplane apparent reload (5200 w/10.1)