I am reaching out for help with bizarre issue I have encountered and looking for advice.
I am running Palo Alto VM-300 version 9.0.14(issue was observed in 9.0.11 as well) in Azure cloud.
VM interconnects subnets, Prod, Dev, Untrust, Outside etc
I noticed that we were receiving alarms regarding some devices in prod not being reachable. I carried out ping test originating from Untrust interface to Prod (Rules are in place to allow this)
Ping was indicating around 91% of packet loss and with response time of over 3000ms at times where usually would be around 17ms.
I carried out trace route and this is where I noticed something strange. Apart from hosts not allowing ping on hops it completed, but listed the end host I was checking several times in output:
1 <1 ms <1 ms <1 ms 10.10.32.1
2 <1 ms <1 ms <1 ms 10.20.2.150
3 * * * Request timed out.
4 * * * Request timed out.
5 * * * Request timed out.
6 * * * Request timed out.
7 * * * Request timed out.
8 * * * Request timed out.
9 * * * Request timed out.
10 * * * Request timed out.
11 * * * Request timed out.
12 * * * Request timed out.
13 * 531 ms * 10.17.0.29
14 * * * Request timed out.
15 * * * Request timed out.
16 * * * Request timed out.
17 * * * Request timed out.
18 * 1042 ms * 10.17.0.29
19 * * * Request timed out.
20 * * * Request timed out.
21 * * * Request timed out.
22 * * * Request timed out.
23 * * 971 ms 10.17.0.29
10.17.0.29 is the VM I saw alerts raised for and was checking its reachability, there were other servers in that same subnet which would respond to ping with usual response times and without drops.
I managed to resolve(mitigate as I am not sure of root cause) the issue by setting Prod interface in down state, committing change, and then binging the interface back online.
As mentioned, this is running in Azure and Prod interface is configured to use User Defined Routing in order to allow for multiple subnets to be connected to same interface (This Prod at the moment only has 1 VM subnet attached).
PA-VM - Eth1/1 - Prod: 10.17.4.254 with default pointing to 10.17.4.1 (This is the User Defined Routing). Then the 10.17.0.0/24 network has route to state default to point at 10.17.4.254.
This is network I inherited and is more complex than what I managed to convey to you all, but the fact that I was having loop and odd one at that (I suspect that drops were down to TTL expired in transit and discarded by firewall) and that fix for this was a case of restarting network interface.
We had seen similar issue on Dev interface some long time ago when the firewall was running 9.0.11 at the time and decided to upgrade to 9.0.14 to see if this was a bad update of some sort.
We had raised case with Palo Alto support, but it was with no result, as we are not able to reproduce this issue request.
I hope someone is able to shed some light as of what could be the cause of this.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!