Decoding the Dreaded Quorum for Logging with Panorama

shv · ‎08-21-2024

Introduction

Logging on Panorama has evolved significantly over the years, with consistent upgrades in the underlying frameworks and utilities that support logging and reporting. With this evolution, we have also seen significant changes in default behavior with respect to Logging clusters. This article explains the most significant change that is the quorum requirements of the Logging cluster, highlights design use-cases, and provides best practices for users to follow while designing Log Collector Groups.

Tryst with Elasticsearch

Panorama started using Elasticsearch as far back as 8.0 to improve logging and reporting performance. Since then we have upgraded Elasticsearch three times over the years, each time it was done to not only improve and optimize performance, but also for features like secure logging channels, optimized indexing, etc.

PAN-OS Version	ElasticSearch Version
8.0, 8.1	2.2
9.0, 9.1, 10.0	5.6
10.1, 11.0	6.8
11.1, 11.2	7.17

However, one thing that has been a constant through all the versions, and that is quorum-based decision making. But before understanding quorum, we must first understand the main reason behind introducing quorum; the split-brain problem.

What is the Split Brain Problem?

The split brain problem is a critical issue in distributed systems, including Elasticsearch, where a network partition or a systems outage causes segments of the cluster to become isolated. Each isolated segment may independently elect a master node, leading to multiple masters within the same cluster. This can result in data inconsistencies, conflicts, and potential data loss once the partition is resolved.

It was to avoid this issue, that quorum-based decision making was introduced.

Quorum-based decision making

A “Quorum” basically means the minimum number of members in a group that can form a majority and take a decision. In the case of Elasticsearch, this means the minimum majority of members in a cluster that can, for example, elect a new master.

Cluster Behavior without Quorum

No Master Node Election: Without quorum, a new master node cannot be elected. Since the master node is responsible for cluster-wide operations, such as managing the cluster state and coordinating shard allocation, the absence of a master node renders the cluster unable to perform these critical tasks.
Cluster Operations Halt:

Indexing and Searching: The cluster will not accept new indexing operations. Existing indices remain readable, but the cluster's ability to process search requests might be impaired, especially if the queries require coordination across multiple shards.
Cluster State Updates: Changes to the cluster state, such as adding or removing nodes, cannot be committed, leading to a static and potentially outdated cluster state.
Shard Allocation Stalls: Shard allocation and rebalancing operations, which are crucial for maintaining data availability and redundancy, are halted. This means that if nodes go down or new nodes are added, the cluster cannot redistribute shards to maintain optimal data distribution and redundancy.

Cluster Instability: The cluster becomes unstable and might enter a degraded state. Nodes may repeatedly attempt to elect a master, leading to increased resource consumption and potential performance degradation.

Failure Scenarios Without Quorum

Network Partition

If the cluster is split into two partitions, neither partition may have enough master-eligible nodes to achieve quorum. Each partition will be unable to elect a master, resulting in two isolated segments that cannot coordinate.

Node Failures

If several master-eligible nodes fail simultaneously, the remaining nodes may not form a quorum, leading to the same issues described above.

Ensuring Quorum

Before Elasticsearch 7.x

To prevent split brain scenarios, Elasticsearch introduced a setting “discovery.zen.minimum_master_nodes” in version 1.0.0 by which we could define the quorum requirements. This would need to be manually set during initial cluster formation and reset every time the cluster state changed.

The best practice for configuring this setting was n/2 + 1, where n is the total number of master-eligible nodes in the cluster.

After Elasticsearch 7.x

In version 7.x, Elasticsearch introduced a major change where the quorum would be automatically calculated and configured by the master-node in the cluster.

With this change, the quorum for the Elasticsearch cluster was automatically calculated as n/2 + 1, where n is the total number of master-eligible nodes in the cluster.

However, they also mandated that the quorum should always be odd-numbered. This means that if there were an even number of members in a cluster, the calculation for the quorum will be based on one less than total members in the cluster. The formula in such a case will be (n-1)/2 + 1, where n is the total number of master-eligible nodes in the cluster.

PAN-OS Implications

In the PAN-OS and Panorama context, the Log Collector Group is the Elasticsearch cluster, where every Log Collector in the group is a master-eligible member of the cluster and hence, participates in the quorum every time the cluster state changes.

In simple words, if there is a change in the cluster state leading to the failure scenarios mentioned in the previous section, the Log Collector Group becomes non-operational until enough members of the cluster are back online to form the quorum once again.

During the time that the quorum is not met:

New logs may not be indexed, depending on the PAN-OS version and the role of the nodes that are down.
Existing logs that are able to be retrieved may not be consistent.

It was only in PAN-OS version 10.0 that we announced a change in default behavior of the Log Collector group, that is the minimum number of Log Collectors required for a Collector Group to be operational is based on the formula, n/2+1, where n equals the total number of Log Collectors in the Collector Group.

For example, if you configure a Collector Group with three Log Collectors, a minimum of two Log Collectors are required for the Collector Group to be operational.

So, up until PAN-OS version 11.0, the Collector Group manually calculates and configures the “discovery.zen.minimum_master_nodes” setting every time a Log Collector is added or removed from the Group.

Log Collector Design Implications

Single-Node Clusters

The behavior of a Collector Group with only one Log Collector is pretty much straightforward. If the Log Collector goes down, there is no remaining node to assign the master role and hence, the Collector Group becomes non-operational. This is applicable to all PAN-OS versions.

Two-Node Clusters

Generally, in a 2-node cluster, there is a master node and a data node.

If the master-node goes down, the whole cluster becomes non-operational, which means that neither new logs will be indexed, nor will search queries be processed.

However, if the master is up and the data node goes down, the master can still index the logs forwarded to it and respond to search requests. However, search queries may not yield consistent results unless Log Redundancy is enabled on the Collector Group.

Of course, in the case of a node failure, there is no way to predict whether it was the master or data node. So it is good to expect unpredictable behavior until the fallen node is brought back up.

Two-node clusters are supported and handled in slightly different ways in different PAN-OS versions.

PAN-OS versions 10.1 and lower

The quorum requirements and failure scenarios apply as respective to the underlying Elasticsearch version.

PAN-OS versions 10.2 and 11.0

More specifically versions 10.2.10+ and 11.0.5+, we have introduced a workaround to overcome the unpredictability that comes with the failure scenario discussed earlier.

In this case, when one node goes down, the Collector Group updates the “discovery.zen.minimum_master_nodes” setting to be 1 and assigns the master role to the node that is still functional. When the fallen node is back online, the Collector Group resets the setting to 2 as it was earlier.

PAN-OS versions 11.1 and higher

Considering that in PAN-OS 11.1, Elasticsearch has been upgraded to 7.17, this has now removed the setting that allows us to manually handle the quorum setting. So, the workaround applied for versions 10.2 and 11.0, does not apply for PAN-OS versions 11.1 and higher. Therefore, we fall back to the original behavior as already explained in the Ensuring Quorum section.

Multi-Node (3+) Clusters

PAN-OS versions 11.0 and lower

Standard behavior applies in this case as per quorum requirements.

PAN-OS versions 11.1 and higher

While standard behavior as per quorum requirements applies in this case too, there is a slight deviation in the case of Collector Groups with an even number of Log Collectors. As explained in the Ensuring Quorum section for Elasticsearch version 7.17, which is applicable in this case, for Collector Groups with an even number of Log Collectors, the quorum is formed with the formula as (n-1)/2 + 1, where n is the total number of master-eligible nodes in the cluster. As an example, if there are 4 Log Collectors in the Group, the quorum required would now be 2 as calculated below,

(4-1)/2 + 1 = 2

This is different from the previous versions where the quorum required for a 4-node cluster would be n/2 + 1 = 3. This is advantageous in use-cases where Log Collectors need to be deployed equally across two locations and the network partition failure scenario needs to be handled.

Conclusion

Understanding the evolution of quorum-based decision making in Elasticsearch is crucial for designing stable and resilient clusters. By following best practices and leveraging the features introduced in newer versions, users can ensure their Panorama Logging deployments are robust and capable of handling various operational challenges.

Decoding the Dreaded Quorum for Logging with Panorama