Split-Brain
This happens when a cluster is split (most commonly via network partition) into >1 sub-clusters that are not aware of each other. Different requests to the system can be routed to different sub-clusters, causing potential consistency issues. A system using single-leader replication must ensure that two (or more) leaders can’t be elected during a network partition; all nodes must agree on a single leader (or refuse to serve traffic).
At a high-level, this can be mitigated by:
- A (sub-)cluster needs a majority of votes (otherwise known as a quorum) to continue functioning. If this condition is not met, reject incoming requests (roughly CP).
- Do nothing, and attempt to reconcile divergent data once the network partition has been resolved (roughly AP).
Two-node clusters (or arbitrarily even-numbered clusters, although these less risky) can’t form a quorum in the event of a partition, so a “witness node” can be used - no storage, just a vote.
Note that a split-brain scenario doesn’t require a network partition. Any bug or failure mode that causes two nodes to both believe themselves to be leaders (assuming this is a single-leader system) can trigger a split-brain situation.