聊聊Raft的一个实现(2)-日志提交

发表于： 2018年11月29日 2022年7月21日
分类：算法
标签： append, commit, consensus, follower, goraft, goraft/raftd, leader, raft

版权声明 本站原创文章 由 萌叔 发表
转载请注明 萌叔 | https://vearne.cc

1. 前言

在我的上一篇文章聊聊Raft的一个实现(1)，我简要的介绍了 goraft/raftd。这篇文章我将结合goraft的实现，来聊聊raft中的一些场景

2. 场景1-正常的执行1条WriteCommand命令

在上一篇文章，我们已经提到WriteCommand和NOPCommand、JoinCommand一样，对goraft而言都是LogEntry, 执行它时，这条命令会被分发到整个Cluster，让我们看看其中的详细过程

当前我们有3个node

节点	state	name	connectionString	term	lastLogIndex	commitIndex
node1	leader	2832bfa	localhost:4001	17	26	26
node2	follower	3320b68	localhost:4002	17	26	26
node3	follower	7bd5bdc	localhost:4003	17	26	26

从上表可以看出整个集群处于完全一致的状态，我们开始执行WriteCommand

step1 client:通过API提交WriteCommand命令

curl -XPOST http://localhost:4001/db/aaa -d 'bbb'

step2 node1:收到指令后，生成LogEntry
1) 写入logfile (磁盘文件)
2）添加到Log.entries (内存)

step3 node1:等待Heartbeat(周期性由leader发往其它每个node1和node2), 把LogEntry带给其它node（这里的node1，node2状态相同，所以AppendEntriesRequest是一样的）
AppendEntriesRequest

{
    "Term": 17,
    "PrevLogIndex": 26, 
    "PrevLogTerm": 17,
    "CommitIndex": 26,
    "LeaderName": "2832bfa",
    "Entries": [{
        "Index": 27,
        "Term": 17,
        "CommandName": "write",
        "Command": "eyJrZXkiOiJhYWEiLCJ2YWx1ZSI6ImJiYiJ9Cg=="
    }]
}

这里对PrevLogIndex做下简单的解释，PrevLogIndex表示的是leader所认为的follower与leader保持一致的最后一个日志index。PrevLogTerm是与PrevLogIndex对应的term。
Command做了base64编码解码后

{
    "key": "aaa",
    "value": "bbb"
}

现在解释下上面的AppendEntriesRequest，node1(leader)告诉node2(follower)

如果LogIndex 26, 咱们是一致的, 那么Append LogIndex 27

注释: 如果node2(follower) 的情况与node1(leader)了解的一致，那么它就执行Append, 否则会拒绝, 并告知实际情况

step4 node2:收到 AppendEntriesRequest后，提取出LogEntry
1) 写入logfile (磁盘文件)
2）添加到Log.entries (内存)
发出AppendEntriesResponse

{
    "Term": 17,
    "Index": 27, 
    "CommitIndex": 26,
    "Success": true
}

step5 node1:收到node2 发出AppendEntriesResponse后，由于这个cluster一共只有3个node，那么已经2个node完成了Append动作，下面可以执行Commit动作。
node1 将 {"key":"aaa","value":"bbb"} 插入内存数据库

step6 node1:等待Heartbeat将Commit消息通知给其它节点。
AppendEntriesRequest

{
    "Term": 17,
    "PrevLogIndex": 27,
    "PrevLogTerm": 17,
    "CommitIndex": 27, 
    "LeaderName": "2832bfa",
    "Entries": null
}

step7 node2收到AppendEntriesRequest后
将 {"key":"aaa","value":"bbb"} 插入自己的内存数据库
返回AppendEntriesResponse

{
    "Term": 17,
    "Index": 27,
    "CommitIndex": 27,
    "Success": true
}

这些步骤都执行完成以后，WriteCommand命令才算是在整个cluster中执行完成，你可以使用

curl http://localhost:4002/db/aaa

查看aaa对应的值。
node3 与node2的情况完全一致，所以这里不再赘述。

完整日志

3. 分析

可以很明显的看出Append动作和Commit动作是两个完全不同的动作。
只要client等到leader完成Commit动作。即使后续leader发生变更或部分节点崩溃，raft协议可以保证，client所提交的改动依然有效。

4. 后记

raft非常的复杂，聊聊Raft的一个实现可能会成为一个比较长的系列

2021年4月2日
上文所提到的内存数据库，在某些文章中，也被称作state machine

另外需要提醒读者注意的是，leader先将数据写入state machine，然后才通知其他follower日志已经被提交。
所以这里是有一个先后顺序的, 在中间的某个时刻

curl -XGET http://localhost:4001/db/aaa

curl -XGET http://localhost:4002/db/aaa

返回的结果可能会不相同。
要确保从raft集群的读取的数据强一致性, 就需要保证所有的数据都是由leader返回

Consul's Consistency Modes

default - If not specified, the default is strongly consistent in almost all cases. However, there is a small window in which a new leader may be elected during which the old leader may service stale values. The trade-off is fast reads but potentially stale values. The condition resulting in stale reads is hard to trigger, and most clients should not need to worry about this case. Also, note that this race condition only applies to reads, not writes.

consistent - This mode is strongly consistent without caveats. It requires that a leader verify with a quorum of peers that it is still leader. This introduces an additional round-trip to all server nodes. The trade-off is increased latency due to an extra round trip. Most clients should not use this unless they cannot tolerate a stale read.

参考资料

Consistency Modes

微信公众号

vearne@ut

279