Split-Brain

This happens when a cluster is split (most commonly via network partition) into >1 sub-clusters that are not aware of each other. Different requests to the system can be routed to different sub-clusters, causing potential consistency issues. A system using single-leader replication must ensure that two (or more) leaders can’t be elected during a network partition; all nodes must agree on a single leader (or refuse to serve traffic).

At a high-level, this can be mitigated by:

  1. A (sub-)cluster needs a majority of votes (otherwise known as a quorum) to continue functioning. If this condition is not met, reject incoming requests (roughly CP).
  2. Do nothing, and attempt to reconcile divergent data once the network partition has been resolved (roughly AP).

Two-node clusters (or arbitrarily even-numbered clusters, although these less risky) can’t form a quorum in the event of a partition, so a “witness node” can be used - no storage, just a vote.

Note that a split-brain scenario doesn’t require a network partition. Any bug or failure mode that causes two nodes to both believe themselves to be leaders (assuming this is a single-leader system) can trigger a split-brain situation.

References

Edit