For many systems, high availability is crucial to survive, honor the SLA’s with (internal or external) customers.
To achieve a high SLA with three, four or even five nines a redundant infrastructure is needed, there is no other way to achieve that if you don’t have your architecture redundant.
- Five-nines or 99.999% availability means 5 minutes, 15 seconds or less of downtime in a year
- Four nines or 99.99% availability allows 52 minutes, 35 seconds downtime per year
- Three nines or 99.9% availability allows 8 hours, 45 minutes downtime per year
You can check it here: https://uptime.is/
Can you imagine, if your SLA agreement is 99.99% availability, that means that you have only 8 seconds per day to restore your unavailable system? If your architecture doesn’t have redundant mechanisms to recovery by itself, I would say it’s nearly impossible to achieve this SLA.
In terms of redundancy, this picture below, show us a completely weak and simple architecture, without any redundancy.
If for any reason we lose the database (one machine), the whole application goes down, if we lose the web server (one machine) the whole application goes down as well.
In this scenario, recovery from a server outage in a short frame of time is very hard.

In this case below, we can find the same application with a bit more redundancy.

So, in order to make my environment more robust, what we need to consider then?
I would say, there are at least four topics that you have to pay attention:
Monitoring and alerting
Monitoring and alerting are the soul of everything, to achieve a high SLA with three, four, or five nines everything should be done automatically, and for this to happen you need a good and efficient monitoring system.
When one node inside your cluster crashes, your monitoring system will notice that, and automatically will exclude the server from the health pool and create an alert to you, whether to have a manual intervention to fix it or include another node into the pool.
If your system is not able to execute this task automatically, probably a human interaction will not be done in time.
SPOF – Single point of failure
Redundancy is also something crucial, you need it everywhere, when I say everywhere it is everywhere. We usually think only in our architecture layer, but sometimes our system is running under on-premise machines and we need to be sure that the physical machines has redundancy too, e.g. two power supplies ( one plugged in each power strip ), nodes within the same pool installed in different racks, and so on.
But not only in physical layer, but every single point of failure has to be eliminated.
Deployment process
I pretty sure that you heard this already, the simplest way to keep the system running smoothly is, don’t update it, don’t touch it. That’s why in many companies is “forbidden” deploy the application on Friday, or even during the day ( high access period).
You need a clear and efficient deployment process to avoid break the system and still be able to update your system as soon as possible.
Canary deployment, code review, code promotion approval, are technics that can help with this process
Budget
And last but not least, money! Every time when we talk about high availability, redundancy, resiliency, we are talking about spending more money. It’s a complicated point sometimes, we always need to evaluate and balance everything, the SLA is a “product” that offer to your customer, then you need to honor the contract, otherwise you have to pay fines, this is not all, this high SLA is about how much the customer can really in your application, how much time you can keep your system up and running without interruption and delivering new features and so on.
To make our lives easier, we have cloud now (thank God), which means we don’t need to care about physical stuff and configure redundancy, scalability, etc are so much easier, and the sky is the limit. We can create redundancy in the application, network, server, even in a region layer. And if your goal is to achieve 5 nines, region redundancy is something that you in fact need.
Of course, there are so many more options and approaches, here I wanted just to exemplify one scenario, if it’s could help you somehow my goal was achieved.
Both diagrams are available in github:
https://github.com/carlospcastro/architecture/tree/master/availability/general
If you have any question, please let me know in the comment section below.