I’ve been trying to understand what Byzantine Fault Tolerance means for almost a year now and slowly peeling off the layers of the onion to understand better but I’m probably still only a quarter of the way there. In this article I will try to explain BFT in a different way.
From the dictionary
One of the main challenges for me in understanding BFT is understanding the word itself. What does Byzantine actually mean? The Oxford dictionary defines it as:
- Relating to Byzantium (now Istanbul), the Byzantine Empire, or the Eastern Orthodox Church
- (of a system or situation) Excessively complicated
- Characterized by deviousness or underhand procedure.
Straight of the bat, these definitions don’t make understanding BFT any easier. Let’s go to Wikipedia
Wikipedia
The standard go to for any BFT explanation is some team of dudes (a general) wanting to attack some castle with a handful of other teams and the rule is they need to either attack all at once or not attack at all. Any half baked attempt will result in catastrophic failure and with everyone retreating with their tails between their legs way back to ancient Mesopotamia in early antiquity.
I’ve read it about a million times, watched YouTube explanations a million more so here goes my attempt.
Key Concepts
The key concept is trying to agree on something within a group of people or machines. In more technical terms, it’s trying to achieve consensus within a distributed network.
Let’s imagine a group of us wanting to watch a movie. How do we agree on what movie to watch? In other words, how do we come to consensus? First of all, we can all take turns at suggesting a movie, we can draw straws, play Jan-Ken-Pon, have a running race, have a vote or all sorts of other things.
However, in this situation the constraint is that we are located in different countries and can ONLY communicate via homing pigeons. This is to reflect a world wide computer network that communicates via network messages.
We then have to imagine that there are some people in this group that are trouble makers and want to prevent consensus being achieved. They can lie, cheat, collude, or even just make an honest mistake by misunderstand something. In technical jargon, we would call these people or computers ‘faults in the network’.
The goal is to still have the group come to an agreement even with a small number of bad actors so how do we achieve this?
Acknowledgments
How about we agree that it’s Alice’s turn to pick a movie this week. She chooses Saving Private Ryan, and everyone has to vote. Alice tells everyone to send a yes or no vote back via a homing pigeon. But what if the original message is intercepted and replaced with the opposite decision? What if the homing pigeon lost its bearings and never reached the destination? What if another person in the group was like Donald Trump and tried to confuse everyone by saying stuff that just doesn’t make sense?
Alice can wait for a confirmation but what if the same thing happens? We can wait for a confirmation of the confirmation but this is an infinite regress.
First concept
Therefore, we have to worry that some of our people or ‘computers’ are byzantine meaning they are devious or use underhand tactics. This in essence is the Byzantine Generals problem. How to achieve consensus when we have a whole bunch of Donald Trumps in the system trying to confuse everyone?
Second concept
We have to now introduce another rule. Consensus has to match at least one of our initial vote. This means that if no one selected yes at the beginning, then the final consensus cannot be a yes. If no one selected no, then the final consensus cannot be a no. This is added to avoid the trivial problem of programmatically forcing all the computers to vote yes.
The catch
Byzantine fault tolerance refers to properties of a consensus algorithm. The best algorithms are called Asynchronous Byzantine Fault Tolerant. This means that a group of computers can come to an agreement even though there are lots of Donald Trumps as part of the network.
The second best algorithms are just called Byzantine Fault Tolerant or in some cases Partially Asynchronous Byzantine Fault Tolerant. This means that the network of computers can come to an agreement BUT honest computers can always communicate and there are no “Donald Trumps” in the system. Technically speaking the assumption is that no Botnets or DDoS attacks exist. All messages need to arrive with x seconds also.
Therefore, when presented with an algorithm, some clever person will be able to say yes it is BFT, no it is not BFT or yes it is aBFT or no it’s not aBFT.
A network that is BFT is saying that it can solve the Byzantine’s Generals Problem unless there are botnets and aBFT is saying that the network can solve the Byzantine’s Generals Problem regardless of botnets. This means that aBFT is better than BFT.
What algorithms CANNOT achieve consensus with bad actors in the system?
RAFT is a consensus algorithm but it CANNOT tolerate a fault with a byzantine computer. ie It is not Byzantine Fault Tolerant.
What algorithms CAN still achieve consensus with bad actors in the system?
- Some early solutions to Byzantine faults include the family of consensus protocols known as Paxos.
- Practical Byzantine Fault Tolerance (PBFT) is another consensus algorithm that is Byzantine fault tolerant
- Bitcoin uses proof of work as its algorithm to achieve consensus of the order of transaction. Proof of work allows the network to overcome Byzantine failures.
- Proof of Stake, Delegated Proof of Stake, Proof of Authority are all BFT.
When is consensus impossible?
It has been proven in the original Byzantine Generals Problem paper by Lamport, Shostak, and Pease, that no solution with fewer than 3m + 1 generals can cope with m traitors. In other words, consensus is impossible to achieve if a third or more of the generals are traitors.
Surprise
Nakamoto said in an email that proof of work satisfies byzantine fault tolerance (https://satoshi.nakamotoinstitute.org/emails/cryptography/11/). The shocking part is that it does not and there has been no formal proof.
Proof of work or proof of stake are probabilistic solutions to byzantine fault tolerance (meaning that it’s not guaranteed 100%). Part of the problem is that these consensus algorithms do not map “very cleanly” to all this maths in the area of distributed algorithms (where all this terminology comes from).
Leemon Baird explaining BFT
Leemon does a great job explaining BFT in the video below.
References:
https://medium.com/@alexandratran/a-cursory-introduction-to-byzantine-fault-tolerance-and-alternative-consensus-1155a5594f18