Raft

  • Raft stands for Replicated and Fault Tolerant
  • Raft is something you can build out of a collection of logs
  • Raft is something you can use to get away from the island of Paxos

Goal: Replicated Log

  • Replicated log => replicated state machine
    • All servers execute same commands in same order
  • Consensus module ensures proper log replication
  • System makes progress as long as any majority of servers are up
  • Failure model: fail-stop(not Byzantine), delayed/lost messages

Approaches to Consensus

Two possible approaches to consensus:

  • Symmetric, leader-less:
    • All servers have equal roles
    • Clients can contact any server
  • Asymmetric, leader-based:
    • At any given time, one server is in charge, others accept its decision
    • Clients communicate with the leader
  • Raft uses a leader:
    • Decomposes the problem (normal operation, leader changes)
    • Simplifies normal operation (no conflicts)
    • More efficient than leader-less approaches

Leader Election

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
                  +------------------+
| Become Candidate |
+------------------+
|
v
+---------------+ timeout
| currentTerm++ |<--------------+
| vote for self | |
+---------------+ |
| |
v |
+-----------------------+ |
| Send RequestVote RPCs |-----------+
| to other servers |
+-----------------------+
votes from majority | | RPC from Leader
v v
+----------------+ +---------+
| Become Leader, | | Become |
| send heartbeats| | Follower|
+----------------+ +---------+
  • Safety: allow at most one winner per term
    • Each server gives only one vote per term(persist on disk)
    • Majority required to win election
1
2
3
4
5
6
7
8
   B can't also                  Voted for
get majority candidate A
+-----------------+ +-------------------------+
| +-----+ +-----+ | | +-----+ +-----+ +-----+ |
| | | | | | | | | | | | | |
| +-----+ +-----+ | | +-----+ +-----+ +-----+ |
+-----------------+ +-------------------------+
Servers
  • Livebess: some candidate must eventually win
    • Choose election timeout randomly in [T, 2T] (e.g. 150-300ms)
    • One server usually times out and wins election before others time out
    • Works well if T >> broadcast time

Randomized approach simpler than ranking

Normal Operation

  • Client sends command to leader
  • Leader appends command to its log
  • Leader send AppendEntries RPCs to all followers
  • Once new entry committed
    • Leader executes command in its state machine, returns result to client
    • Leader notifies followers of committed entries in subsequent AppendEntries RPCs
    • Followers execute committed commands in their state machines
  • Crashed/slow followers
    • Leader retries AppendEntries RPCs until they succeed
  • Optimal performance in common case:
    • One successful RPC to any majority of servers

Log Structure

1
2
3
4
5
6
7
8
type Command string

type Log struct {
Term int
Command Command
}

type Logs []Log
  • Must survive crashes (store on disk)
  • Entry committed if safe to execute in state machines
    • Replicated on majority of servers by leader of its term

Log Inconsistencies

Crashes can result in log inconsistencies

  • Raft minimizes special code for repairing inconsistencies
    • Leader assumes its log is correct
    • Normal operation will repair all inconsistencies
Pieces of Valuable Programming Knowledges