Posting here as a final act of desperation. I am still pursuin the case with TAC, but I hope that perhaps someone will have seen something like this before.
Our network resides in a hosted ESXI environment provided by iLAND. We have an HA pair of virtual series firewalls with a public IP on the outside interface with a single iLand GW as our 0/0 and we have a number of VLANs all terminated to the FW in question with the Palos acting as GWs for the corresponding subnets. We have 3 such hosted data centers (hosted by iLand), spread across the US with two using VM-300 and one using VM-100 VMs.
One of these DCs (One of the VM-300s) is having a series of interesting issues.
1. We are seeing traffic hit our FW and get dropped from within a subnet. In other words, traffic from Host A 10.0.0.1/24 going to 10.0.0.2/24 somehow hits the GW and is dropped as expected. The drop reason we have determined to be the following based off of debugs: flow_rcv_dot1q_tag_err=XXXXXXX.Packet info: len 64 portX tag X interface X. IP: 10.0.0.2->10.0.0.3, protocol 6. Apologies for the redactions (IPs are just an example). Bottom line is, these two hosts come from the same vlan, and the packet is unicast, with the appropriate vlan ID (took a pcap to verify). That being said, it should have gone through the virtualized network host to host, not through its GW. The destination MACs are not the mac of the GW.
2. Important to note for the above problem. This only happens occasionally, we never see full sessions in this manner, usually it's just bits and pieces of full sessions. A few packets here and there. Which brings me to my next problem. I am seeing a consistent 1-2% packet loss throughout this data center (and this data center only), primarily as traffic traverses through our Palos, but also between hosts in the same VLAN on occasion. I suspect the two issues may be intertwined, but I'm not sure what would cause so many VMs to somehow have their traffic sent to their GW instead of to their intended destination.
a. As part of troubleshooting this issue, I took a pcap both at the GW and at one of the hosts sending and initiated an ICMP stream from the host to the GW. 2% packet lost, but most concerningly, we didn't see the packets arrive at the FW, and we didn't see any of our drop counters increment. The packet leaves the host but does not arrive at the GW, so all that's left between the two is the ESXI environment and/or any physical switches provided by iLand if the two VMs are not on the same hosts. So this sounds like two separate issues, and yet I can't help but feel this is all at least slightly related.
b. Note that we recently added a large number of VLANs, so not sure if that could be related. We are still under our limit for sessions, zones, etc. so I doubt this is a resource limitation issue, but still very strange.
3. Lastly, we have one additional issue that is happening at all three DCs. I don't think this one is related to the other two, but perhaps it may be seen as such by other community members. The three Palo clusters were recently set to not use the hypervisor-assigned VM, and all interfaces were set to promiscuous mode and with spoofed mac address protection disabled from the ESXI level. Note that this was to ensure that during HA failover traffic destined to the internet would fail over more seamlessly as our Palo's GW (iLand provided) was not able to update its arp table quickly enough to fail over seamlessly without the same mac address being in place on both outside interfaces of the two HA members. What we are seeing as a result is duplicate packets in some parts of our network, usually as they traverse the Palos. Not sure why this is so. One suggestion I've heard is to only enable promisc mode on the HA NICs. Sometimes the duplicate packets come from the Palo inside interface to the host, sometimes host to host, etc. and it's fairly inconsistent where and when it happens.
a. At one point, we thought it was an issue with having both members of the HA pair on the same vHost, so we put in an anti-affinity rule and at first this worked, but then the issue came back. Since then, we've determined moving the two VMs onto the same host and then separating them again clears up this condition, but it only solves it for a time, and it always comes back.
Apologies for the flood of info, but I genuinely am not sure which of these issues are related and which aren't. I joined the company after troubleshooting had begun, so not sure how far back some of these stretch. If anyone has any questions, feel free to ping me.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!