Field Notes- HA versus Fault Tolerant
This is a topic I often see customers and folks in general get confused with. High Availability (HA) is one of those buzz words that vendors and the industry as a whole throw around a lot. Fault tolerant is also thrown around as well but I see HA out in the field being used in documentation and in meetings.
First let's take a look at why both of these terms even exist- No matter the size of the business, stability and availability of IT infrastructure is paramount. Everyone wants their network to be stable and run 100% of the time. That's where HA and fault tolerance come into play.
The way I view High Availability (HA) is where you need access to a resource (Or resources) but can handle a small outage. Service needs to be restored in a reasonable time frame. Depending on the technology being used, a reasonable time frame in this scenario is typically minutes. If you cannot restore service within this time frame, it isn't highly available.
A good example of this which I discuss with my clients is VMWare HA. In VMWare when you have the appropriate license and at least two hosts, you can enable the HA feature. I cannot tell you how many people look at that and assume that it means if one host goes down, everything will seamlessly fail over to the other host and there will be no downtime. This is incorrect. In this scenario, if host 1 goes down, the VM's reboot and boot up on host 2 (Compute resources). This causes a brief disruption but the VM's come back in a reasonable time frame. Due to the relatively brief disruption the resources are considered to be highly available.
I look at fault tolerance as a more robust enhancement of HA. What I mean by this is building out hardware and software to be able to handle a fault whether it's hardware or software, and recover without human intervention. In general, I feel that this is indeed what most people truly want when discussing infrastructure in meetings or conference calls.
Many examples of this can be seen with the movement of cloud computing. AWS has so many services that can provide fault tolerant systems. For example, let's take a look a scenario with AWS RDS service with multi-availability zone functionality. With this technology you have a primary DB in one availability zone (AWS Datacenter) and a replica in another. If the entire availability zone goes down due to infrastructure failure a failover is triggered and service resumes on the replica DB.
If you work in an organization that's reviewing infrastructure or if you are consulting be sure to be as clear as possible with these terms. I've seen proposals use both terms when they were intending to provide HA. I've heard folks say both terms in the same meeting when discussing the same thing.