Status Red for log-collector-es-cluster health in M600

Mostafavi_DWR · ‎12-21-2021

After upgrading to 9.1.12-h3 from 9.1.8 the ElasticSearch cluster changed to Red on one the M600 log collectors and to no status shown for the other M600 collector and the logs stopped coming into Panorama.

PavelK · ‎12-22-2021

Thank you for the post @Mostafavi_DWR

In PAN-OS 9.1.12 there is no known issue for ElasticSearch. When status is red, there is not much you can do. I would give log collector reboot. If the issue continues after reboot, I would generate tech-support file and open ticket.

For the second issue, do you mean that log collectors are not showing status under: Panorama > Managed Collectors? As a next thing, I would check logs on log collector: tail lines 200 mp-log ms.log to see it can give more information. After upgrade the log collector should try to connect to Panorama on TCP: 3978. If on Panorama status is showing properly either something is preventing to connect or log collector is not initiating connection.

Kind Regards

Pavel

Help the community: Like helpful comments and mark solutions.

Darren_Schubert · ‎12-22-2021

If the OP's situation and issue are like mine, the Panoramas (Primary-Active and Secondary/Passive) are in Panorama-mode -- meaning they function as Management and Log Collectors in what might be the only Collector Group. Whenever we upgrade the Panoramas -- we have a pair of M-600s, the ES cluster is red for 6-24 hrs... I think (and need to test this) the best method to avoid this is to stop the LCs on the Panoramas from taking in logs from the managed FWs just seconds before you begin the upgrade on the Secondary PAN -- the ES cluster appears to go red after the upgrade because the ES database on each PAN's LC is out of sync (for lack of a better word). If there is a way to disconnect the Managed Firewalls connections to the LCs, that is likely best overall -- they'll sit and queue locally on the FWs until they can connect to the LCs and forward them on. We've had 3-4 TAC cases about this and about the best advice they could provide was ... don't wait very long to begin the upgrade on the Primary Panorama once the Secondary is done -- the shorter the time period in which the ES cluster's nodes are out of sync, the less shards they have to process in order to get back into sync. (Green) There seems to be very little information out there regarding the ES cluster and issues such as this. Here's my current ES Cluster health:

(primary-active)> show log-collector-es-cluster health

{
"cluster_name" : "__pan_cluster__",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 4,
"active_primary_shards" : 2506,
"active_shards" : 4886,
"relocating_shards" : 0,
"initializing_shards" : 47,
"unassigned_shards" : 83,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 97.40829346092504

Mostafavi_DWR · ‎12-22-2021

Good to know. I had to upgrade the Passive Panorama during the working hours to get the management's green light for upgrading the active panorama and LCs during the night, so the active Panorama upgrade started after 8 hours from the completion of the passive Panorama. However in my case, on one the log collectors the ES cluster was struggling to come up (the output for command show log-collector-es-cluster health was empty or sometimes later cli was hanging or even one time crashed), at the end it took about 2 hours seeing the ES cluster to come up on that LC (show log-collector-es-cluster health worked fine for the first time and showed the expected outputs) and also at this time the FW logs started coming into Panorama. For my case all FWs immediately after the upgrades got connections to LC according to "show logging-status device xxx. TAC is saying the status of ES Cluster should change to green after having "active_shards_percent_as_number" at 100%. I guess I have to wait another 10 hours for this.

Kamal_Modi · ‎04-17-2022

Hi @Mostafavi_DWR,

we have seen this issue on our environment multiple times. All the time TAC use to say upgrade the LC or reboot the hardware. But finally we came to know that elastic search process was keep on restarting. you can check it using "show system software status | match elasticsearch". In order to fix it you can use debug "elasticsearch es-restart option all" once you restart it, it may take 5 to 10 mins to show the logs and 10 to 15 mins to show logs collector status in green. and as a final option you simply restart the Log collectors or in case Panorama is used a LC then restart the Panorama. I am assuming that all the necessary ports are already open so we'll not go into it. I hope this helps.

Thanks,

Kamal Modi

Unlock your full community experience!

Status Red for log-collector-es-cluster health in M600

Status Red for log-collector-es-cluster health in M600

Show your appreciation!