08-03-2018 07:55 AM
We have migrated our production web infrastructure to run through Palo Alto (previously running through Checkpoint) and although we have no issues with production traffic we are seeing some intermittent failures on our health checks between Child and Parent bluecoat proxy devices. The health check is purely doing a TCP connection on port 8000 but there is no actual data in the check so shows up as Incomplete on PA App-ID. We only seem to get a failure every few hours but no consistency so very difficult to get a packet capture at the time of failure.
In other words only this health check "incomplete" traffic is affected. All other client browsing traffic with as valid App-ID's passes ok.
Also the health check failures only occur from the child proxy that is actively sending client traffic to the parent. Health checks from the other child proxy that is not sending client traffic does not fail. So could be related to some type of load but nothing consistent.
We have turned off all zone protection, threat profiles and enabled content ID features to Forward segments exceeding TCP App-ID inspection queue and TCP content Inspection queue as well as created an App-ID override for this traffic and still no change in behaviour. No other obvious drops are seen in the Global Counters or in the logs.
Wondering if anyone has experienced something similar or can think of anything we haven't looked at?
08-13-2018 02:13 AM
Thanks for all your comments on this. It looks like we may have finally got to the bottom of it.
Just to note based on previous comments we definitely were getting a full handshake even though they were showing up as incomplete, so was just down to the fact there was not data and session immediately closed out.
It looks like our issues were down to Bluecoat proxies taking a long time in some cases to close out sessions and the Palo Alto being more aggressive in closing sessions down particularly in a time wait and half-closed state. For example we observed proxies taking longer that 120 seconds to respond in a half closed state ie. after first FIN-ACK. We ended up increasing the half closed timer on the Palo Alto which seems to have stabalized things. We will also increase the Time-Wait setting on Palo or reduce the 2MSL setting on Bluecoat so that the firewalls are not closing sessions prematurely.
More info on timers:
08-03-2018 08:28 AM
An incomplete for tcp traffic means a tcp 3-way handshake did not complete or there wasnt enough traffic for the PAN to figure it out. In my experience it has been the 3-way handshakre almost all of the time. I would say check and see if anything is getting blocked or dropped. Also I would not perform SSL decryption on it if you are.
Routing has been an issue for me in the past with the imcolpmete as well.
08-03-2018 08:39 AM
Thanks @OtakarKlier I can confirm 3-way-handshake is good as we can see this in normal capture but there is not other data in the health check so we excpect the incomplete app-id. The question is why it fails randomly every now an again. Not currently running any Decryption.
08-03-2018 08:48 AM
Are you seeing a session generated for each health check, or is it running along once session? Depending on how the health check is seen by the firewall you could be running into an issue where the healthcheck fails once the firewall closes the connection. The firewall has a default session timeout on TCP traffic as 3600 seconds; it could be that you are seeing an issue depending on if the session closes just as the health-check is in progress.
If it's hitting that timeout you should be able to see that if you look at the session id through the cli.
08-03-2018 10:18 AM
If the 3-way handshake completed, you shouldn't see "incomplete" as the application. It should be "unknown-tcp", even if there is no data. At a minimum, you should be seeing around 6 or 7 packets depending on how the TCP session is closed. Are you seeing that many frames in your sessions for these health checks?
If I remember correctly from my years with BlueCoat, there are a couple options for health checks: simple TCP handshake, or full request/response parsing. If you have an option for the more robust checks that may help.
Since you know it's just a health check and you never have to worry about inspecting the traffic, you could also just set up an application override policy for the traffic to ensure you don't even enter layer 7 checks.
Click Accept as Solution to acknowledge that the answer to your question has been provided.
The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!
These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!
The LIVEcommunity thanks you for your participation!