Hello and welcome back to PANCast. Today is all about the dataplane, the heart of Palo Alto Networks' Next-Gen Firewall, and why dataplane CPU can go high. So first, what exactly is the dataplane CPU or commonly shortened to DP CPU? Palo Alto Networks firewalls have a separation of the management plane and the dataplane. While the management plane takes care of all the management functions like configuration, logging and routing, the dataplane is what handles the actual traffic passing through the firewall. It handles all the security processing on the device, so is obviously quite important. Things like URL filtering, threat prevention and AppID are all handled by the dataplane. All Palo Alto Networks firewalls, whether they are physical or virtual, will have one or more dataplanes and these dataplanes are then further broken down into a number of actual processing cores. Each firewall model therefore has different processing power.
What is considered high DP CPU? That is a good question and one that is sometimes difficult to answer. There are probably two scenarios that could cause concern. One is if you are seeing impact on your traffic which can vary from some minimal latency when traversing the firewall all the way to not only latency but also packet loss.
The second is the overall DP CPU in terms of average and maximum during your business or production hours. You may not be seeing any impact, but if your firewall is generally running at 80% or higher, then it is definitely something you may want to investigate. At this level there is not a lot of room to grow. Let's say a new application is rolled out to your users, or you are adding a new office with an additional 50 users in the near future. In these situations, it may just be the difference between the firewall handling the traffic to suddenly there is some impact because of the increased traffic load.
Before we look at how to troubleshoot high DP CPU, let's dig a bit deeper on what the dataplane is actually doing. So anything to do with processing traffic through the firewall is handled by the dataplane. Obviously things like threat inspection is done by the dataplane but it is also doing things like security policy lookup and NAT policy lookup. It is doing the lookups for URLs and custom URLs to see if the traffic should be allowed or not. It is sending traffic logs to the management plane. It is performing SSL decryption if enabled. It is keeping track of all the sessions through the firewall and their current states. It is actually doing a lot of different functions.
So, how do we troubleshoot high DP CPU? The first thing is to look if anything has changed; is the high DP CPU something that has slowly built over time, or is this out of character for the firewall so could be an unexpected event? If you have SNMP monitoring on the device and have historical graphs then this is the best place to start. If you don't, there are a couple of other things we can do. If you are using AIOps for NGFW then this will also have historical CPU data. If you aren’t and would like further information then check it out on the Palo Alto Networks website. We may cover AIOPs in detail in a future episode but for now if you are not using it, have a look as it can offer some great monitoring tools for your devices. By using either SNMP or AIOps you should be able to see if the increase in CPU is related to another change, for example the packet rate today is significantly higher than previous days or the number of sessions on the firewall is much higher today.
If you don't have either SNMP or AIOps then we need to look at some data on the firewall itself. Let’s start with looking at what the current load is and what it has been recently.
From the CLI, we start with the command "show running resource-monitor" as this gives us an idea of the DP CPU usage for different periods, for example minutes, hours and days. This is where we can check if the CPU is high now, or was earlier and also what it's normally expected to be. Now these stats have average and maximum for the period but they will be an average for longer periods so they may not show small spikes. SNMP and AIOPs are better to get granular stats but we can still get an idea from the "show running resource-monitor".
We can also check in the ACC on the web UI. The ACC which is short for Application Command Center gives details of traffic through the firewall. You can check here to see if there has been a change in the normal load the firewall is seeing.
Going back to my earlier comment, if this is a gradual increase and you have been seeing the same level for some time then it is probably more a case that this is the normal load for the firewall and you may need to do some capacity planning for the future. For the case where the CPU has been normal but you see a spike now, let's discuss what you can look for.
Back on the CLI you can use a similar command to before which is “"show running resource-monitor ingress-backlogs". This gives a point in time snapshot of sessions consuming a lot of resources on the firewall. You can run this command a few times over a short period and see if it is one session that always shows up or multiple sessions. You can then check further on the session or sessions listed.
And back on the ACC you can also look at things like top talkers for either source or destination as well as top applications to see what could be the cause of the sudden increase. Again if this is an abnormal event the key is to try and find what is different that could be causing the high CPU.
So these are some high level checks you can do to try and find the cause but in some cases it may not be that simple to find and some additional data needs to be collected. Troubleshooting high DP CPU can be complex at times as there are a lot of different causes.
Let's talk about some common causes. Not all traffic is equal. Some traffic requires more processing and a common example is SMB due to the nature of the protocol. This means the amount of SMB traffic in your network could impact the amount of processing required. Things like IPSec tunnels or GRE tunnels on the firewall also add load as there is now also a decryption/encryption or encapsulation/decapsulation component required. Along the same lines SSL decryption adds more overhead to the dataplane.
Now, what can be done once you find the cause? It really depends on the root cause. If it is traffic that perhaps should not be reaching the firewall, you’ll need to try and work out how to stop this. If it is legitimate traffic, there are some workaround options, like an app override. Please be aware that by doing this, you are disabling inspection on that traffic — so, this needs to be acceptable for your environment and should only really be done for trusted traffic and only as a temporary workaround. If the firewall isn’t a high enough spec to deal with your normal traffic load then it gets back to capacity planning.
One last thing I want to talk about is expected performance. I have often been asked how much throughput should a particular model be able to handle. While we have performance stats listed in our datasheets, this is not a simple question to answer. Hopefully from some of the things we have discussed you now know the types of processing the dataplane does. The amount of processing required will actually vary on quite a lot of factors. As an example, it can be possible that say a 3220 in two different networks with similar basic throughput can be running at very different CPU loads.
Let's take a look at why. Let's say firewall A has a pretty basic config. It has 10 security policies, no NAT and not all traffic is being inspected. For firewall B we have over 1000 security policies, a lot of NAT being applied and all traffic is being inspected. Not only this, but Firewall B is also doing SSL decryption and also has some large SMB file transfers being done hourly. Looking at the throughput in Mbps for both firewalls it is pretty much the same yet the DP CPU for Firewall B is significantly higher. I hope you can now appreciate why this would be the case and why there can be a lot of factors into how much processing is required for each firewall.
Troubleshooting DP CPU can sometimes be quite challenging but in a lot of cases there is a reasonable explanation. I hope the information in this episode has given you some further insights into why you would see high DP CPU and where to start troubleshooting.
If and when you get that alert for high DP CPU or you have complaints of slow performance, remember the following:
And that’s a wrap on another PANCast. As always I hope you enjoyed listening and more importantly you got some useful information out of today’s episode!
Check out the full YouTube playlist: PANCast: Insights for Your Cybersecurity Journey.