High Availability with Axon Server and Axon Framework

By Bert Laverman September 25 2020

Blog by AxonIQ

12 min

Key takeaways

High Availability concerns the “art” of making a system deal with local failures in a way that enables it to limit the impact of the losses and prevent businesses and users from being impacted.
AxonServer Enterprise Edition clusters use the Raft consensus algorithm to continue normal operations in individual node failures.
Replication-group node roles can be used further to optimize the usage of the underlying storage infrastructure and provide off-site backups for disaster recovery scenarios.
Extending the cluster’s fault tolerance to include public cloud regional unavailability is possible but will require more complex deployment patterns and substantial performance costs.

A choice primarily influences building applications based on the Axon Framework for the Command and Query Responsibility Separation software architecture or its support for Event Sourcing. At the same time, we have often argued that using the Axon Framework provides an excellent way to support a path towards a microservices architecture, as it promotes Event-Driven architectures, which dramatically lowers coupling. As a bonus, we can more easily divide large applications across multiple deployments, which allows us to quickly scale those components that benefit most from the increased responsiveness of parallel computation. However, with the increase in “moving parts,” we also increase possible failure scenarios that we need to consider.

Application availability is not just a matter of ensuring that all parts of the system are “up”, just as cutting an extensive monolithic application into a large collection of microservices doesn’t necessarily increase its likelihood of failure. Instead, it all has to do with the impact of individual losses on the overall system. For example, where a failure in a monolith can more easily bring the complete system crashing down, a well-designed system composed of multiple independent components is more likely able to provide alternative flows compensating for those failures. Also, actions needed to restore the system to regular operation are more likely to be localized while the rest of the system keeps working around it. That said, a team transitioning from the first to the second architectural approach is suddenly confronted with a lot more up-front decisions that a monolith more can easily bundle or hide.

In this blog, we’ll look at some of the choices you have that Axon Framework provides and its infrastructural companion Axon Server.

Basic Concepts and Definitions

In an AxonServer EE cluster, we talk about primary and secondary nodes. The terms “active-backup” and “passive-backup” are used, and they tend to cause a bit of confusion due to overlap with common terminology for High-Availability clustering. So let’s start this discussion with some terms and how we use them.

Errors, Faults, and Failures

I’ll acknowledge that most of this will be common knowledge, but these three are pretty basic:

Errors are the things that are wrong, such as typing mistakes. I’m not talking about the ones the compiler catches, but mistakes in (configured or coded) values, forgotten validations or validations that test the wrong things, or even design mistakes. They can also be more physical things like disk or memory errors or those concerning networks. (We all know the “Fallacies of distributed computing,” right?)
Faults are what happens because of errors, possibly the most famous one happening in the “AE-35 unit” of the HAL-9000. In the context of a Java application, they often take the form of an exception, if you’re lucky, but it might just as well be nothing more than a log message and the use of some default value.
Failures are the result of faults and can be generalized as “abnormal behavior.” For example, the (application-) component returns an incorrect or no value, or maybe the component hangs and never returns a response. The most noticeable response would probably be a crash, but it is entirely dependent on the situation if, and how, you’ll notice that.

In the system, we also need to be prepared for cascading failures, as the failure we notice may be classified as symptoms of the original fault. Consequently, we should be careful to go off hunting for the error we appear to have and instead start with root-cause analysis.

To continue with our narrative, we need to look at how we can deal with failures to make our system fault-tolerant.

Scaling and Failover

To enable our system to deal with the load we put on it, we need to provide it with the resources required to pick up the work. Traditionally, this potentially meant adding more “iron”, in the sense that we pick a beefier server; more CPUs, more memory, more disks, or more capable versions of those components. However, this doesn’t necessarily change for virtual hardware and may even involve migrating the application component to another physical system. This is called “vertical scaling” and has a big problem: we need to pre-scale for the expected maximum load or choose hardware that can be dynamically resized. That is precisely what makes this approach very expensive. The bigger problem is that we still have a single point of failure. We can keep throwing money at it and buy more expensive hardware that contains redundant components. Still, the cost-benefit graph is a nice exponential line, where we end up purchasing minimal improvements for ridiculous amounts of money.

With disks, the turnaround came with Redundant Arrays of Inexpensive Disks, or RAID arrays, and the same can be done for computing power: We can rework our system to allow multiple copies of components to provide that same service, so when one is unresponsive, the other can be queried instead. This is called “horizontal scaling”, and we can apply it to both hardware and software. Taking the architectural helicopter, you can sometimes notice that the one is used to build the other effectively: if we create a cluster for a central database, have we now used horizontal scaling to ensure the database is always on, or vertical scaling because we still have a single, be it more capable, database? To prevent us from falling into this trap, we need to implement services as a collaborative effort as much as possible.

Applications built using AxonFramework fully support this collaborative approach: we can provide multiple handlers for Commands, Queries, and Events, and they use different strategies to exploit parallelism. With Commands, a single handler is needed for any particular instance. The framework’s runtime (optionally supported by AxonServer) ensures that one is chosen from the available handlers to process it. If that handler is no longer available or currently busy, another handler will be selected instead. QueryHandlers can operate in the same way, but you can also use a divide-and-conquer strategy, or more to-the-point a scatter-gather approach, to let several handlers collaborate in producing the wanted result. Finally, we do not care how many handlers will pick it up with Events, as long as the runtime ensures that anyone interested in that particular event type gets it.

High Availability of AxonServer

AxonServer is an infrastructural component that sits like a spider in the web of our communications, ensuring all events are stored for future replays. As a result, every client application gets the messages it needs. This central role means we need to look at its status as a single point of failure.

AxonServer Standard Edition

AxonServer Standard Edition has no intrinsic capabilities to support automated failover because it has no concept of a clustered operation. We can improve the situation a lot by using platform features that help automatic restart based on a health probe. AxonServer has such a feature on its REST endpoint. We can even use some properties to expose it on its port, making the health probe independent of the “standard” endpoint:

management.port=8080
management.ssl.enabled=false

This has the additional benefit that it won’t be exposed publicly if the 8024 port is, so you can keep it unsecured.

Given this check, we now need to choose a platform for fast startup times, so any downtime gets minimalized. However, we’re left with vertical scaling for performance and security under pressure: select a beefy VM with a fast disk. That is not optimal, however.

AxonServer Enterprise Edition

As far as security and performance are concerned, our best options are with the horizontal scaling options we get from a cluster of AxonServer EE instances. To start with the most prominent aspect: a cluster of AxonServer EE nodes is no longer a single point of failure, as the nodes use the Raft consensus algorithm to coordinate any changes. For every replication group active in the cluster, an election is held to decide a leader, and all nodes acknowledge transactions to it. This is then acknowledged back to the client as soon as most of the nodes have committed the change. If a node becomes unresponsive, this will keep working as long as a majority is available. If the unresponsive node happens to be the leader, a new election is started, and the client either loses its connection (if to the leader) or is informed that the transaction failed. It is then up to the client to re-establish the connection to the cluster and retry, which behavior is partly implemented in the Axon Framework. The rest is supported through the configuration of retry policies.

To bootstrap clients, the choice was made not to introduce a service registry but instead give each client a complete list of all gRPC endpoints of the AxonServer nodes in the cluster. Using extra tooling, this can easily be added externally to the client during startup if required. Using the list, the client can iterate through a known list of endpoints until it finds one that responds. The initial response from that node may be a request to connect to yet another node depending on load distribution and a possible better match in server tags. A further improvement since release 4.4 is that clients are no longer dependent on the leader as the single source of truth and can read from “followers" for reading events”. This reduces the dependency on the current leader, at the cost of a slight delay to ensure any last events have been transferred. Naturally, this behavior can be reverted to always read from the leader using the “force-read-from-leader” setting. In a distributed setup, depending on where your significant concerns lie, you can focus on performance and keep all nodes close to the clients, or go for availability and do a physical distribution across geographical regions to reduce the chance of a catastrophic loss of the whole cluster.

Raft and Cluster Availability

The Raft algorithm aims to provide consensus on values, and AxonServer uses this to consider any data change supported by this consensus as committed. Effectively, at least half the nodes in the cluster must confirm it for the event store, and this process can continue correctly as long as a majority of the nodes are available. For example, a cluster of three AxonServer EE nodes will continue to work and accept events as long as two of those nodes are still available, meaning running and communicating with each other. This requirement is imposed per replication group in the cluster. So, for example, if you have a five node cluster with three nodes down, a replication group served by three nodes, of which two are the ones remaining available, is still fully functional. The Raft algorithm is also used for the cluster’s configuration data, using a unique context named “_admin” in a replication group with the same name. A node servicing this replication group is called an “admin node” and has complete visibility on the cluster’s structure.

Contexts and Replication Groups

CQRS and Event Sourcing are strongly connected to Domain-Driven Design. In the case of the Axon Framework in combination with AxonServer, it has led to the concept of an AxonServer context to provide the boundary around the domain. An SE instance supports a single context named “default”, whereas EE clusters can support multiple contexts. When a client application connects to AxonServer, it specifies the context it belongs to (with “default” being the default). All its messages and aggregates are strictly local to this context. EE cluster nodes share information on the client applications, registered message handlers, and the (optional) event store, and one or more contexts are combined into replication groups. Replication groups allow you to reserve nodes based on their hardware capabilities or geographical location and apply these to contexts with comparable requirements.

Messaging-only Nodes

The most straightforward node role is that of a messaging-only node. This is the only role that disables the event store functionality, and it is mainly used in situations where you have an alternative event store. Please see this blog about why we think you really should use AxonServer for that. It can also be used to increase the availability of the cluster for messaging, with the limitation of not helping for events.

Primary and Secondary Nodes

Nodes that do have an event store are split into primary and non-primary nodes. The difference is that only primary nodes are candidates for leadership, and the right to vote is restricted to them and active backup nodes. (more on that later) Initially, all nodes with an event store were what we now call primary nodes, and AxonServer SE nodes can be viewed as such also. Naturally, calling this role “primary” implies there are also “secondary” nodes, and with the introduction of version 4.4, this is so. Suppose a secondary node is present in the replication group, and a retention time is defined on a context in it. In that case, primary nodes will remove event store segments that entirely lie beyond this limit, as we have guaranteed access to the events and snapshots removed from another source. From an availability perspective, this means that we can now differentiate in storage options, giving the primary nodes smaller and faster devices. In contrast, the secondary nodes can use slower alternatives, which also tend to be cheaper. However, we want to ensure that we have multiple copies, so preferably we want at least two secondary nodes.

From a Raft perspective, the secondary nodes can be ignored for leadership elections, as they are not the preferential nodes for replays, for which we want to use primaries. This also means that adding secondary nodes will not affect the number needed for a majority. However, event transactions involve the secondary nodes, requiring at least one secondary node to have confirmed storage.

Backup Nodes

The final group of roles concerns the “backup” roles, which are not aimed at failover, but rather disaster recovery scenarios. Active backup nodes are (groups of) nodes that join in the event store transactions. At least one active backup needs to acknowledge storage (apart from most primary nodes) for the transaction to be committed. Passive backup nodes receive the data asynchronously, so they are not on the critical path for the transaction but lose the guarantee of up-to-dateness. From a Raft perspective, the active backup nodes add to the size of the voter population, as they contribute to the data availability, even though they are not available for leadership.

Backup nodes tend to be most valuable for disaster recovery when located in a different geographical region, providing a similar function as offline backups of the disk storage. Choosing between the two alternatives is a matter of the requirements for up-to-dateness. If you need the guarantee that all committed events and snapshots are available in the backup, use active-backup nodes. Maintenance, or a different level of backups using offline disk copies, can be supported, as only one node needs to stay online for the cluster to function.

If the requirements for backup up-to-dateness are less strict, you can use a passive backup node (or a set of them) because they can receive events and snapshots after the transaction has been confirmed to the clients. This effectively means the stream of data follows at “best-effort” speed, which in practice won’t lag too far behind. Bringing a passive backup node down will not affect the availability of the cluster, nor will temporary connectivity problems. The streaming uses the same strategy used at the startup of all node roles or after other forms of connectivity loss, where the node checks the event store to determine whether to ask for the remaining data or a data file snapshot from the master. As soon as the backlog has been transferred, it goes back to regular follower mode.

The Wider Picture: Going Multi-Region

Naturally, the availability of AxonServer is greatly influenced by the platform on which it runs, not only in the performance-related choices available for hardware but also in the support for monitoring and automatic restarts and regional distribution redundancy. As discussed in the “Running AxonServer” installment on Virtual Machines, Kubernetes is nearly unbeatable in its ease of use and options for deployment automation, at the cost of a single-region deployment limitation. Suppose your requirements do not include a multi-region deployment, and all AxonServer nodes are the same in their configuration of the supported replication groups. In that case, a single StatefulSet will do admirably. Still within a single region, having a StatefulSet per AxonServer node solves any configuration variations. Combined with custom storage classes, you can differentiate between disk types and profit from multi-tier storage and backup nodes. Having the cluster and clients in a single availability zone allows for optimal network performance, while a multi-zone deployment increases availability at a relatively small cost in performance.

Multi-region Deployments and Kubernetes

Increasing the scope, we can use a single outlier deployment in a different region for a passive backup node, but this will require some networking configuration to ensure the communication between all nodes works predictably. The choice between a second Kubernetes cluster in a different region or VMs as the outlier depends on the support available for this networking magic. This is where the choice between Kubernetes and VMs comes up more strongly. Kubernetes itself is not designed as a transparent layer between the applications and the data center. It runs as it only standardizes LoadBalancers and Ingresses for HTTP, and we have a gRPC port that needs to be shared. The default Nginx Ingress controller that most providers use can be configured for gRPC traffic. Still, the picture quickly becomes complicated if we want to use that for a private communication channel. In contrast, another Ingress controller needs to regulate public traffic towards the HTTP-based UI and REST endpoints. Going for VMs as the only platform solves many of these problems, as we now have a more comprehensive range of products available for setting up our traffic lanes, at the cost of losing the ease of deployment that Kubernetes offers. A multi-region multi-Kubernetes-cluster setup is possible with all cloud providers, but the networking products available to connect the clusters differ.

AxonServer Tagging

Multi-region and multi-zone deployments can cause performance issues if the load-balancing of clients causes them to be sent to fewer favorite instances. AxonServer node tagging has been around since the first release of version 4 and can direct client applications to the nearest part of the cluster. The tags are set by adding entries to the property file under “axoniq.axonserver.tags”, for example:

axoniq.axonserver.tags.region=europe-west4
axoniq.axonserver.tags.zone=europe-west4-a

The name of the tag needs to be matched with the tags on the client, for example:

axon.tags.region=europe-west4

If more than one match is available, the load-balancing strategy is used to choose between the available matches. If no match can be made, the standard (load-balancing) strategy is applied over the whole set of available nodes. Consequently, any clients connected to it will reconnect using the same strategy if a node goes down. If you have multiple contexts and the clients for a specific context are in a particular region, it makes sense to locate enough nodes in that region to set up a replication group for it in that region so that all traffic will stay local. If this is too costly, a single tagged node with the default “read from followers” strategy will allow Event Processors to read from the closest node even if they are not context leaders. In contrast, published events, commands, and queries may have to take a longer route.

Please note that tag names, because they can be sourced from the names of environment variables, are matched without matching cases. This allows “AXON_TAGS_REGION” to be used as well as the above-shown property. If you’re using YAML files for properties, the syntax is probably a bit easier to understand:

axoniq:
axonserver:
	tags:
		region: "europe-west4"
		zone: "europe-west4-a"

Final Words

As always, with this subject, there is no single solution that fits all needs, but I hope you have seen that AxonServer provides many features that can help you improve the availability of your application. For example, depending on how high you set the bar for recovering from increasingly less likely events, you can build clusters, automate real-time backups, and support multi-regional deployments. If you have any questions concerning specific scenarios or would like to discuss deployment plans, do not hesitate to contact the AxonIQ team.

Bert Laverman

Senior Software Architect and Developer at AxonIQ. From hands-on software development to academic research on software reusability. Bert is a strong proponent of good software design, Agile and Lean software development, and DevOps practices.

Axon Server, Axon Server Enterprise, Cluster, Axon Framework, High Availability