High Availability Deployments with Axon Server
So you have decided to use the Axon Server Enterprise because you want a highly available cluster? Then your next question may be how to configure your cluster setup to take full advantage of this. In this blog post, you’ll learn some of the trade-offs you should consider as well as some starting points you can use for your Axon Server Enterprise cluster.
Weak points of some classic deployment patterns
Traditionally, high availability is approached using a model with multiple instances of a server, either configured in an active-passive setup or in an active-active setup. In the case of the former, all client applications connect to the active server while an alternative passive server stands by, ready to take over operations should the active server fail. For the latter, client applications connect to multiple active servers, each accepting incoming transactions simultaneously.
Despite these approaches representing the traditional model, they both present serious downsides. An active-passive setup presents a couple onerous hurdles: one, data synchronization is still a requirement since the passive node would require all available data in the case of a failover – which, as you may imagine, is a resource suck. Two, if there’s no consensus on which node is the active node, you may find yourself in a nasty, inconsistent situation. The active-active setup does not suffer from the aforementioned issues, but does require complex transaction management across the servers in order to prevent data inconsistency.
Luckily, nowadays there’s a more resilient alternative through consensus algorithms such as RAFT. We will always have a group of active servers that forms a replication group, of which one leader is chosen by an automatic internal election. In order to become a leader, a node will need votes from a majority. For example, in the case of a replication group of three nodes, two votes will get a leader elected. From there, the leader will take on the responsibility of initiating the appending of new events, which requires acknowledgement from the majority to consider the transaction durable. This consensus protects against the failure of individual nodes in our cluster, since as long as a majority of your cluster is up, your cluster will be available.
In Axon Server Enterprise, we use the RAFT algorithm for the exact purpose of guaranteeing consistency when storing new events in the event store. For all other message routing needs, a single node of the cluster is sufficient for routing a message from one service to another, since we won’t need any acknowledgements for this.
When building our cluster, we can never optimize everything. Instead, we’ll have to consciously make a trade-off between consistency, high availability (HA) and low latency. AS prioritizes consistency above all else, leading us to ponder HA versus low latency. Below are some examples of how we can balance these two factors against each other.
Ideal solutions when not limited by hardware
AS provides various different node types. However, for the purpose of this blog, we’ll only focus on primary nodes. Consider this type of node to be the jack-of-all-trades: it stores data, participates in transactions, elects leaders, and more. This blog focuses on a generic cloud environment, but the uses are not limited to that. For example, the ideas presented here can easily translate into a similar setup such as your company’s own datacenters.
Let’s start with a cluster of three primary nodes. Such a replication group will always be resistant to the failure of a single node, since you’ll still have a majority – two out of three nodes – to acknowledge a transaction. However, where we choose to deploy these nodes is crucial. We can place them all inside the same datacenter (DC) or availability zone (AZ). This presents the lowest latency of communication between nodes, since they deploy together. However, should anything happen to that AZ, then the entire cluster becomes unavailable.
Image 1: Three nodes in a single AZ, lowest latency, but not AZ fault tolerant
If we want our AZ to be resilient to failure, then we can deploy our three nodes across three different AZs. Here, even if an entire zone goes down or a network split between two zones occurs, we will still have two nodes alive in the remaining zones, so our cluster will remain healthy. However, this comes at a cost of increased latency.
Image 2: Three nodes in three AZs, medium latency, AZ fault tolerant
Extrapolating this idea, suppose we deploy our servers across three different regions. This provides even better HA, since we can now handle the failure of an entire region, but at the cost of even higher latency.
Image 3: Three nodes in three regions, high latency, but region fault tolerant
We can do even better. Increasing the number of nodes in our replication group makes us more resilient to the failure of individual nodes. For example, with a group of five nodes, we can deal with the loss of up to two servers and still have a fully functioning cluster. The same deployment patterns apply as described above, we’d just place two more nodes in the first two availability zones, which acts as insurance against the complete AZ failure.
Image 4: Five nodes in three AZs
We can scale this same example even further to seven nodes or more. However, do note that our replication group should always contain an uneven number of nodes. If we were to add only a single node – for example, going from three to four – we would actually lose some resilience. The newly added node introduces a new potential point of failure, but is not allowed to fail if one other node has also gone down before, since that would bring us down to only two out of four nodes, which is not a majority. However, we could build a cluster with these four nodes and have multiple replication groups, each using a subset of three of these nodes.
Image 5: An even number of nodes (six), but with replication groups of three nodes each. Each replication group manages one or more contexts
Two datacenters instead of three
Many of the above examples all focus on setups that leverage three datacenters, but what should you do when you have only two datacenters? In that case, one of these will always contain the majority of your cluster, so if that DC fails, your cluster also fails. However, in this case it’s probably best to focus on a fast recovery time by being able to extend the minority node into a full cluster in the second datacenter. This process can be made smoother by leveraging one or more backup nodes that you change into another role when the need arises. How to do that goes beyond the scope of this article, though.
In this article we’ve seen that the RAFT protocol, as used in Axon Server Enterprise, ensures a fully functional cluster as long as a majority of the nodes is still available. With three availability zones available to us, we can ensure the correct working of our AS cluster despite the failure of an entire AZ. Having the nodes in our cluster geographically further separated by choosing regions over AZs, then we’ll have an even more resilient cluster at the expense of increased latency.The examples presented in this article don’t represent all of the options of achieving HA in your AS cluster, but instead serve as starting points for understanding the various trade-offs faced when building a cluster and helping you determine the best deployment strategy for your needs. Be aware that your requirements may be different. Perhaps you do not want to focus on high availability, but on disaster recovery, scalability, or something else. When doing this, you should consider tiered storage or adding other node types, such active or passive backup nodes, or secondary nodes.
Once you’ve found the ideal setup for your AS cluster and are ready to take it to the next level, you can find additional information in the following blog on the various configuration details.
Christian is a Senior Solutions Architect at AxonIQ.
Having worked primarily as a consultant in his career, he has seen many complex domains and helped businesses tackle complexity both in their domain and software, as was well as within their organisations.
After coming into contact with the Axon Framework in early 2016, he was quickly convinced by the power of concepts such as Domain-Driven Design, CQRS and Event Sourcing, enabling people to spend their time and energy on building the right things in the right way and to bring about change in an organisation or market.
September 28th, Amsterdam
Join us for the AxonIQ Conference 2023, where the developer community attends to get inspired by curated talks and networking.
September 27th, Amsterdam
The event to collaborate, discuss, and share knowledge about techniques, tools, and practices for building complex, distributed event-driven applications.