- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
Logging on Panorama has evolved significantly over the years, with consistent upgrades in the underlying frameworks and utilities that support logging and reporting. With this evolution, we have also seen significant changes in default behavior with respect to Logging clusters. This article explains the most significant change that is the quorum requirements of the Logging cluster, highlights design use-cases, and provides best practices for users to follow while designing Log Collector Groups.
Panorama started using Elasticsearch as far back as 8.0 to improve logging and reporting performance. Since then we have upgraded Elasticsearch three times over the years, each time it was done to not only improve and optimize performance, but also for features like secure logging channels, optimized indexing, etc.
PAN-OS Version |
ElasticSearch Version |
8.0, 8.1 |
2.2 |
9.0, 9.1, 10.0 |
5.6 |
10.1, 11.0 |
6.8 |
11.1, 11.2 |
7.17 |
However, one thing that has been a constant through all the versions, and that is quorum-based decision making. But before understanding quorum, we must first understand the main reason behind introducing quorum; the split-brain problem.
The split brain problem is a critical issue in distributed systems, including Elasticsearch, where a network partition or a systems outage causes segments of the cluster to become isolated. Each isolated segment may independently elect a master node, leading to multiple masters within the same cluster. This can result in data inconsistencies, conflicts, and potential data loss once the partition is resolved.
It was to avoid this issue, that quorum-based decision making was introduced.
A “Quorum” basically means the minimum number of members in a group that can form a majority and take a decision. In the case of Elasticsearch, this means the minimum majority of members in a cluster that can, for example, elect a new master.
If the cluster is split into two partitions, neither partition may have enough master-eligible nodes to achieve quorum. Each partition will be unable to elect a master, resulting in two isolated segments that cannot coordinate.
If several master-eligible nodes fail simultaneously, the remaining nodes may not form a quorum, leading to the same issues described above.
To prevent split brain scenarios, Elasticsearch introduced a setting “discovery.zen.minimum_master_nodes” in version 1.0.0 by which we could define the quorum requirements. This would need to be manually set during initial cluster formation and reset every time the cluster state changed.
The best practice for configuring this setting was n/2 + 1, where n is the total number of master-eligible nodes in the cluster.
In version 7.x, Elasticsearch introduced a major change where the quorum would be automatically calculated and configured by the master-node in the cluster.
With this change, the quorum for the Elasticsearch cluster was automatically calculated as n/2 + 1, where n is the total number of master-eligible nodes in the cluster.
However, they also mandated that the quorum should always be odd-numbered. This means that if there were an even number of members in a cluster, the calculation for the quorum will be based on one less than total members in the cluster. The formula in such a case will be (n-1)/2 + 1, where n is the total number of master-eligible nodes in the cluster.
In the PAN-OS and Panorama context, the Log Collector Group is the Elasticsearch cluster, where every Log Collector in the group is a master-eligible member of the cluster and hence, participates in the quorum every time the cluster state changes.
In simple words, if there is a change in the cluster state leading to the failure scenarios mentioned in the previous section, the Log Collector Group becomes non-operational until enough members of the cluster are back online to form the quorum once again.
During the time that the quorum is not met:
It was only in PAN-OS version 10.0 that we announced a change in default behavior of the Log Collector group, that is the minimum number of Log Collectors required for a Collector Group to be operational is based on the formula, n/2+1, where n equals the total number of Log Collectors in the Collector Group.
For example, if you configure a Collector Group with three Log Collectors, a minimum of two Log Collectors are required for the Collector Group to be operational.
So, up until PAN-OS version 11.0, the Collector Group manually calculates and configures the “discovery.zen.minimum_master_nodes” setting every time a Log Collector is added or removed from the Group.
The behavior of a Collector Group with only one Log Collector is pretty much straightforward. If the Log Collector goes down, there is no remaining node to assign the master role and hence, the Collector Group becomes non-operational. This is applicable to all PAN-OS versions.
Generally, in a 2-node cluster, there is a master node and a data node.
If the master-node goes down, the whole cluster becomes non-operational, which means that neither new logs will be indexed, nor will search queries be processed.
However, if the master is up and the data node goes down, the master can still index the logs forwarded to it and respond to search requests. However, search queries may not yield consistent results unless Log Redundancy is enabled on the Collector Group.
Of course, in the case of a node failure, there is no way to predict whether it was the master or data node. So it is good to expect unpredictable behavior until the fallen node is brought back up.
Two-node clusters are supported and handled in slightly different ways in different PAN-OS versions.
The quorum requirements and failure scenarios apply as respective to the underlying Elasticsearch version.
More specifically versions 10.2.10+ and 11.0.5+, we have introduced a workaround to overcome the unpredictability that comes with the failure scenario discussed earlier.
In this case, when one node goes down, the Collector Group updates the “discovery.zen.minimum_master_nodes” setting to be 1 and assigns the master role to the node that is still functional. When the fallen node is back online, the Collector Group resets the setting to 2 as it was earlier.
Considering that in PAN-OS 11.1, Elasticsearch has been upgraded to 7.17, this has now removed the setting that allows us to manually handle the quorum setting. So, the workaround applied for versions 10.2 and 11.0, does not apply for PAN-OS versions 11.1 and higher. Therefore, we fall back to the original behavior as already explained in the Ensuring Quorum section.
Standard behavior applies in this case as per quorum requirements.
While standard behavior as per quorum requirements applies in this case too, there is a slight deviation in the case of Collector Groups with an even number of Log Collectors. As explained in the Ensuring Quorum section for Elasticsearch version 7.17, which is applicable in this case, for Collector Groups with an even number of Log Collectors, the quorum is formed with the formula as (n-1)/2 + 1, where n is the total number of master-eligible nodes in the cluster. As an example, if there are 4 Log Collectors in the Group, the quorum required would now be 2 as calculated below,
(4-1)/2 + 1 = 2
This is different from the previous versions where the quorum required for a 4-node cluster would be n/2 + 1 = 3. This is advantageous in use-cases where Log Collectors need to be deployed equally across two locations and the network partition failure scenario needs to be handled.
Understanding the evolution of quorum-based decision making in Elasticsearch is crucial for designing stable and resilient clusters. By following best practices and leveraging the features introduced in newer versions, users can ensure their Panorama Logging deployments are robust and capable of handling various operational challenges.