摘要:
测试组上报没有按测试预期完成failover, 本文记录问题分析及解决过程。
问题描述:
集群描述:
cluster集群: 4主4从
测试组执行操作:
- 集群正确创建
- 172.16.0.12为master, 以下简称0.12
- 172.16.0.7是0.12节点的slave, 以下简称0.4
- 测试人员杀掉0.12的pod, 使0.12成为判定下线状态
- 测试人员启动0.12的pod,预期
- 0.7节点完成failover,成为master
- 0.12节点成为0.7节点的slave
实际情况:
- 0.12节点依然为master
- 0.7节点依然为0.12节点的slave
测试组认为该情况不符合预期,需要做出详细分析
分析:
分析思路:
- 要解释为什么和预期不符前, 需要对redis的cluster集群的failover过程有了解
- 哪些参数会导致哪些影响
- 在failover过程中, 集群的状态机是如何切换的
- failover的发生, 是否是不能回退的?
- 其次, 要认真分析日志,还原现场,了解cluster集群整体对failover的处理
- 从cluster集群对failover的处理入手,解释现象,透过现象看到本质
nodes与ip映射表:
a27e9975e23182d986d0e607124608e60cf7e34c 172.16.0.9:6379@16379 master - 0 1625036845000 3 connected 8192-1228711e6c61fb9c8f21a6173ff7b28db33fbf4076935 172.16.0.14:6379@16379 slave a27e9975e23182d986d0e607124608e60cf7e34c 0 1625036845000 8 connected2b61ab2d905bdf849053800ef1221c0d6908d421 172.16.0.8:6379@16379 master - 0 1625036847996 2 connected 4096-819111fd46e959051892a1c850ffa5e5360eec89b8ac 172.16.0.13:6379@16379 slave 2b61ab2d905bdf849053800ef1221c0d6908d421 0 1625036846993 7 connected2a97e5e42e5fe226b535d309630fd9b8dde86fb3 172.16.0.10:6379@16379 master - 0 1625036845000 4 connected 12288-163838d8e158ce96fe0527edff9681c52ca5859becfe9 172.16.0.7:6379@16379 myself,slave d6f53105af7ef908f67357b33b6fc16fdda3ff5d 0 1625036842000 1 connectedd6321b72076237d0d8f26f42e9e05692f0f8bbd2 172.16.0.11:6379@16379 slave 2a97e5e42e5fe226b535d309630fd9b8dde86fb3 0 1625036846000 5 connectedd6f53105af7ef908f67357b33b6fc16fdda3ff5d 172.16.0.12:6379@16379 master - 0 1625036847000 9 connected 0-4095 |
分析:
- 8d8e158ce96fe0527edff9681c52ca5859becfe9 172.16.0.7:6379@16379 myself,slave d6f53105af7ef908f67357b33b6fc16fdda3ff5d 0 1625036842000 1 connected
- d6f53105af7ef908f67357b33b6fc16fdda3ff5d 172.16.0.12:6379@16379 master - 0 1625036847000 9 connected 0-4095
172.16.0.12的日志, master
29:C 29 Jun 2021 11:31:59.536 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo29:C 29 Jun 2021 11:31:59.536 # Redis version=5.0.12, bits=64, commit=00000000, modified=0, pid=29, just started29:C 29 Jun 2021 11:31:59.536 # Configuration loaded29:M 29 Jun 2021 11:31:59.536 * No cluster configuration found, I'm d6f53105af7ef908f67357b33b6fc16fdda3ff5d29:M 29 Jun 2021 11:31:59.537 * Running mode=cluster, port=6379.29:M 29 Jun 2021 11:31:59.537 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.29:M 29 Jun 2021 11:31:59.537 # Server initialized29:M 29 Jun 2021 11:31:59.537 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.29:M 29 Jun 2021 11:31:59.538 * Ready to accept connections29:M 29 Jun 2021 11:31:59.538 * The server is now ready to accept connections at redis.sock29:M 29 Jun 2021 11:33:57.217 # configEpoch set to 6 via CLUSTER SET-CONFIG-EPOCH29:M 29 Jun 2021 11:33:57.316 # IP address for this node updated to 172.16.0.1229:S 29 Jun 2021 11:33:59.224 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.29:S 29 Jun 2021 11:33:59.224 # Cluster state changed: ok29:S 29 Jun 2021 11:33:59.826 * Connecting to MASTER 172.16.0.7:637929:S 29 Jun 2021 11:33:59.826 * MASTER <-> REPLICA sync started29:S 29 Jun 2021 11:33:59.826 * Non blocking connect for SYNC fired the event.29:S 29 Jun 2021 11:33:59.826 * Master replied to PING, replication can continue...29:S 29 Jun 2021 11:33:59.826 * Trying a partial resynchronization (request 4f8a9aac12f7c6543a1be464fe5f8064460fdbfc:1).29:S 29 Jun 2021 11:34:05.737 * Full resync from master: d32c2238eecf5bed700410eb6ffa292046ea786b:029:S 29 Jun 2021 11:34:05.737 * Discarding previously cached master state.29:S 29 Jun 2021 11:34:05.738 * MASTER <-> REPLICA sync: receiving streamed RDB from master29:S 29 Jun 2021 11:34:05.738 * MASTER <-> REPLICA sync: Flushing old data29:S 29 Jun 2021 11:34:05.738 * MASTER <-> REPLICA sync: Loading DB in memory29:S 29 Jun 2021 11:34:05.738 * MASTER <-> REPLICA sync: Finished with success29:S 29 Jun 2021 11:34:05.739 * Background append only file rewriting started by pid 5729:S 29 Jun 2021 11:34:05.761 * AOF rewrite child asks to stop sending diffs.57:C 29 Jun 2021 11:34:05.762 * Parent agreed to stop sending diffs. Finalizing AOF...57:C 29 Jun 2021 11:34:05.762 * Concatenating 0.00 MB of AOF diff received from parent.57:C 29 Jun 2021 11:34:05.762 * SYNC append only file rewrite performed57:C 29 Jun 2021 11:34:05.762 * AOF rewrite: 10 MB of memory used by copy-on-write29:S 29 Jun 2021 11:34:05.842 * Background AOF rewrite terminated with success29:S 29 Jun 2021 11:34:05.842 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)29:S 29 Jun 2021 11:34:05.842 * Background AOF rewrite finished successfully29:S 30 Jun 2021 03:42:44.846 # Connection with master lost.29:S 30 Jun 2021 03:42:44.846 * Caching the disconnected master state.29:S 30 Jun 2021 03:42:45.836 * Connecting to MASTER 172.16.0.7:637929:S 30 Jun 2021 03:42:45.837 * MASTER <-> REPLICA sync started29:S 30 Jun 2021 03:43:01.911 * FAIL message received from a27e9975e23182d986d0e607124608e60cf7e34c about 8d8e158ce96fe0527edff9681c52ca5859becfe929:S 30 Jun 2021 03:43:01.971 # Start of election delayed for 699 milliseconds (rank #0, offset 81172).29:S 30 Jun 2021 03:43:02.673 # Starting a failover election for epoch 9.29:S 30 Jun 2021 03:43:02.675 # Failover election won: I'm the new master.29:S 30 Jun 2021 03:43:02.675 # configEpoch set to 9 after successful failover29:M 30 Jun 2021 03:43:02.675 # Setting secondary replication ID to d32c2238eecf5bed700410eb6ffa292046ea786b, valid up to offset: 81173. New replication ID is fdfbea1397b6d1ca8585367ff20f5376fa65a6ca29:M 30 Jun 2021 03:43:02.675 * Discarding previously cached master state.29:M 30 Jun 2021 04:43:03.768 * Replication backlog freed after 3600 seconds without connected replicas.29:M 30 Jun 2021 05:54:27.941 * Clear FAIL state for node 8d8e158ce96fe0527edff9681c52ca5859becfe9: master without slots is reachable again.29:M 30 Jun 2021 05:54:28.879 * Replica 172.16.0.7:6379 asks for synchronization29:M 30 Jun 2021 05:54:28.879 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '627e69dd4ff1a96ad775511f8befe17a3cf67918', my replication IDs are '7dda72b4574fb135e9921ad4c1dda9855e2fd4ee' and '0000000000000000000000000000000000000000')29:M 30 Jun 2021 05:54:28.879 * Delay next BGSAVE for diskless SYNC29:M 30 Jun 2021 05:54:34.361 * Starting BGSAVE for SYNC with target: replicas sockets29:M 30 Jun 2021 05:54:34.363 * Background RDB transfer started by pid 6969:C 30 Jun 2021 05:54:34.364 * RDB: 12 MB of memory used by copy-on-write29:M 30 Jun 2021 05:54:34.463 * Background RDB transfer terminated with success29:M 30 Jun 2021 05:54:34.463 # Slave 172.16.0.7:6379 correctly received the streamed RDB file.29:M 30 Jun 2021 05:54:34.463 * Streamed RDB transfer with replica 172.16.0.7:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming29:M 30 Jun 2021 05:54:34.890 * Synchronization with replica 172.16.0.7:6379 succeeded34:C 30 Jun 2021 06:38:11.580 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo34:C 30 Jun 2021 06:38:11.580 # Redis version=5.0.12, bits=64, commit=00000000, modified=0, pid=34, just started34:C 30 Jun 2021 06:38:11.580 # Configuration loaded34:M 30 Jun 2021 06:38:11.581 * Node configuration loaded, I'm d6f53105af7ef908f67357b33b6fc16fdda3ff5d34:M 30 Jun 2021 06:38:11.581 * Running mode=cluster, port=6379.34:M 30 Jun 2021 06:38:11.581 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.34:M 30 Jun 2021 06:38:11.581 # Server initialized34:M 30 Jun 2021 06:38:11.581 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.34:M 30 Jun 2021 06:38:11.582 * Reading RDB preamble from AOF file...34:M 30 Jun 2021 06:38:11.582 * Reading the remaining AOF tail...34:M 30 Jun 2021 06:38:11.582 * DB loaded from append only file: 0.000 seconds34:M 30 Jun 2021 06:38:11.582 * Ready to accept connections34:M 30 Jun 2021 06:38:11.582 * The server is now ready to accept connections at redis.sock34:M 30 Jun 2021 06:38:13.588 # Cluster state changed: ok34:M 30 Jun 2021 06:38:14.300 # Failover auth denied to 8d8e158ce96fe0527edff9681c52ca5859becfe9: its master is up34:M 30 Jun 2021 06:38:24.348 * Replica 172.16.0.7:6379 asks for synchronization34:M 30 Jun 2021 06:38:24.349 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'f692b1befe939778f81fc31ae6fe9dac461e91b7', my replication IDs are '4aaf5ab94b22f1cdb5678467c15f56f9c188cdcb' and '0000000000000000000000000000000000000000')34:M 30 Jun 2021 06:38:24.349 * Delay next BGSAVE for diskless SYNC34:M 30 Jun 2021 06:38:30.632 * Starting BGSAVE for SYNC with target: replicas sockets34:M 30 Jun 2021 06:38:30.633 * Background RDB transfer started by pid 5252:C 30 Jun 2021 06:38:30.635 * RDB: 10 MB of memory used by copy-on-write34:M 30 Jun 2021 06:38:30.733 * Background RDB transfer terminated with success34:M 30 Jun 2021 06:38:30.733 # Slave 172.16.0.7:6379 correctly received the streamed RDB file.34:M 30 Jun 2021 06:38:30.733 * Streamed RDB transfer with replica 172.16.0.7:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming34:M 30 Jun 2021 06:38:31.629 * Synchronization with replica 172.16.0.7:6379 succeeded |
分析:
- 0.12节点被启动了两次, 中间间隔了42分钟
- 0.12节点接受了0.7节点为其从节点
- 第一次启动后,执行全量复制
- 第二次启动后,执行pysnc的增量复制
- 0.12节点, 拒绝了id为 8d8e158ce96fe0527edff9681c52ca5859becfe9 的从节点的failover请求,原因是认为该从节点的主节点已经启动了
- 0.12作为master节点, 参与投票,详细结合下文分析
172.16.0.7日志, slave
29:C 30 Jun 2021 05:54:27.872 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo29:C 30 Jun 2021 05:54:27.872 # Redis version=5.0.12, bits=64, commit=00000000, modified=0, pid=29, just started29:C 30 Jun 2021 05:54:27.872 # Configuration loaded29:M 30 Jun 2021 05:54:27.873 * Node configuration loaded, I'm 8d8e158ce96fe0527edff9681c52ca5859becfe929:M 30 Jun 2021 05:54:27.873 * Running mode=cluster, port=6379.29:M 30 Jun 2021 05:54:27.873 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.29:M 30 Jun 2021 05:54:27.873 # Server initialized29:M 30 Jun 2021 05:54:27.873 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.29:M 30 Jun 2021 05:54:27.873 * Ready to accept connections29:M 30 Jun 2021 05:54:27.873 * The server is now ready to accept connections at redis.sock29:M 30 Jun 2021 05:54:27.876 # Configuration change detected. Reconfiguring myself as a replica of d6f53105af7ef908f67357b33b6fc16fdda3ff5d29:S 30 Jun 2021 05:54:27.876 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.29:S 30 Jun 2021 05:54:27.876 # Cluster state changed: ok29:S 30 Jun 2021 05:54:28.878 * Connecting to MASTER 172.16.0.12:637929:S 30 Jun 2021 05:54:28.878 * MASTER <-> REPLICA sync started29:S 30 Jun 2021 05:54:28.878 * Non blocking connect for SYNC fired the event.29:S 30 Jun 2021 05:54:28.878 * Master replied to PING, replication can continue...29:S 30 Jun 2021 05:54:28.879 * Trying a partial resynchronization (request 627e69dd4ff1a96ad775511f8befe17a3cf67918:1).29:S 30 Jun 2021 05:54:34.362 * Full resync from master: f692b1befe939778f81fc31ae6fe9dac461e91b7:8117229:S 30 Jun 2021 05:54:34.362 * Discarding previously cached master state.29:S 30 Jun 2021 05:54:34.364 * MASTER <-> REPLICA sync: receiving streamed RDB from master29:S 30 Jun 2021 05:54:34.364 * MASTER <-> REPLICA sync: Flushing old data29:S 30 Jun 2021 05:54:34.364 * MASTER <-> REPLICA sync: Loading DB in memory29:S 30 Jun 2021 05:54:34.364 * MASTER <-> REPLICA sync: Finished with success29:S 30 Jun 2021 05:54:34.365 * Background append only file rewriting started by pid 5229:S 30 Jun 2021 05:54:34.387 * AOF rewrite child asks to stop sending diffs.52:C 30 Jun 2021 05:54:34.387 * Parent agreed to stop sending diffs. Finalizing AOF...52:C 30 Jun 2021 05:54:34.387 * Concatenating 0.00 MB of AOF diff received from parent.52:C 30 Jun 2021 05:54:34.387 * SYNC append only file rewrite performed52:C 30 Jun 2021 05:54:34.387 * AOF rewrite: 4 MB of memory used by copy-on-write29:S 30 Jun 2021 05:54:34.389 * Background AOF rewrite terminated with success29:S 30 Jun 2021 05:54:34.389 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)29:S 30 Jun 2021 05:54:34.389 * Background AOF rewrite finished successfully29:S 30 Jun 2021 06:37:51.817 # Connection with master lost.29:S 30 Jun 2021 06:37:51.818 * Caching the disconnected master state.29:S 30 Jun 2021 06:37:52.531 * Connecting to MASTER 172.16.0.12:637929:S 30 Jun 2021 06:37:52.531 * MASTER <-> REPLICA sync started29:S 30 Jun 2021 06:38:09.074 * Marking node d6f53105af7ef908f67357b33b6fc16fdda3ff5d as failing (quorum reached).29:S 30 Jun 2021 06:38:09.169 # Start of election delayed for 574 milliseconds (rank #0, offset 84798).29:S 30 Jun 2021 06:38:09.769 # Starting a failover election for epoch 10.29:S 30 Jun 2021 06:38:24.348 * Non blocking connect for SYNC fired the event.29:S 30 Jun 2021 06:38:24.348 * Master replied to PING, replication can continue...29:S 30 Jun 2021 06:38:24.348 * Trying a partial resynchronization (request f692b1befe939778f81fc31ae6fe9dac461e91b7:84799).29:S 30 Jun 2021 06:38:29.121 # Currently unable to failover: Waiting for votes, but majority still not reached.29:S 30 Jun 2021 06:38:30.632 * Full resync from master: 9e530414a7ef697d78daec46d5586640ef8e39c1:029:S 30 Jun 2021 06:38:30.632 * Discarding previously cached master state.29:S 30 Jun 2021 06:38:30.634 * MASTER <-> REPLICA sync: receiving streamed RDB from master29:S 30 Jun 2021 06:38:30.635 * MASTER <-> REPLICA sync: Flushing old data29:S 30 Jun 2021 06:38:30.635 * MASTER <-> REPLICA sync: Loading DB in memory29:S 30 Jun 2021 06:38:30.635 * MASTER <-> REPLICA sync: Finished with success29:S 30 Jun 2021 06:38:30.636 * Background append only file rewriting started by pid 164429:S 30 Jun 2021 06:38:30.658 * AOF rewrite child asks to stop sending diffs.1644:C 30 Jun 2021 06:38:30.658 * Parent agreed to stop sending diffs. Finalizing AOF...1644:C 30 Jun 2021 06:38:30.658 * Concatenating 0.00 MB of AOF diff received from parent.1644:C 30 Jun 2021 06:38:30.658 * SYNC append only file rewrite performed1644:C 30 Jun 2021 06:38:30.659 * AOF rewrite: 6 MB of memory used by copy-on-write29:S 30 Jun 2021 06:38:30.726 * Background AOF rewrite terminated with success29:S 30 Jun 2021 06:38:30.726 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)29:S 30 Jun 2021 06:38:30.727 * Background AOF rewrite finished successfully29:S 30 Jun 2021 06:38:39.746 # Currently unable to failover: Failover attempt expired.29:S 30 Jun 2021 06:38:42.053 * Clear FAIL state for node d6f53105af7ef908f67357b33b6fc16fdda3ff5d: is reachable again and nobody is serving its slots after some time. |
分析:
- 0.7节点为slave, 主从请求连接master0.12
- 0.7节点的日志中, 没有主节点0.12的错误日志
- 0.7节点主动两次连接master0.12
- 说明从节点0.7与主节点0.12的socket通信中间被中断过一次
其他master节点作为Acceptor角色进行投票, 投票日志
28:M 30 Jun 2021 05:54:31.796 * Clear FAIL state for node 8d8e158ce96fe0527edff9681c52ca5859becfe9: replica is reachable again.28:M 30 Jun 2021 06:38:09.770 # Failover auth denied to 8d8e158ce96fe0527edff9681c52ca5859becfe9: its master is up28:M 30 Jun 2021 06:38:11.080 * FAIL message received from 2b61ab2d905bdf849053800ef1221c0d6908d421 about d6f53105af7ef908f67357b33b6fc16fdda3ff5d28:M 30 Jun 2021 06:38:43.938 * Clear FAIL state for node d6f53105af7ef908f67357b33b6fc16fdda3ff5d: is reachable again and nobody is serving its slots after some time. |
分析:
- 节点 8d8e158ce96fe0527edff9681c52ca5859becfe9 的failover请求被拒绝,理由是该从节点的主节点又可用
日志总结:
- 0.12节点被停止过一次,但是其从节点0.7的投票被Acceptor给拒绝了
- 原因在于0.12随后又上线
代码分析:
为了彻底搞明白这个原因,需要分析代码是如何处理的。
cluster.c
/* Vote for the node asking for our vote if there are the conditions. */void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) { clusterNode *master = node->slaveof; uint64_t requestCurrentEpoch = ntohu64(request->currentEpoch); uint64_t requestConfigEpoch = ntohu64(request->configEpoch); unsigned char *claimed_slots = request->myslots; int force_ack = request->mflags[0] & CLUSTERMSG_FLAG0_FORCEACK; int j; /* IF we are not a master serving at least 1 slot, we don't have the * right to vote, as the cluster size in Redis Cluster is the number * of masters serving at least one slot, and quorum is the cluster * size + 1 */ if (nodeIsSlave(myself) || myself->numslots == 0) return; /* Request epoch must be >= our currentEpoch. * Note that it is impossible for it to actually be greater since * our currentEpoch was updated as a side effect of receiving this * request, if the request epoch was greater. */ if (requestCurrentEpoch < server.cluster->currentEpoch) { serverLog(LL_WARNING, "Failover auth denied to %.40s: reqEpoch (%llu) < curEpoch(%llu)", node->name, (unsigned long long) requestCurrentEpoch, (unsigned long long) server.cluster->currentEpoch); return; } /* I already voted for this epoch? Return ASAP. */ if (server.cluster->lastVoteEpoch == server.cluster->currentEpoch) { serverLog(LL_WARNING, "Failover auth denied to %.40s: already voted for epoch %llu", node->name, (unsigned long long) server.cluster->currentEpoch); return; } /* Node must be a slave and its master down. * The master can be non failing if the request is flagged * with CLUSTERMSG_FLAG0_FORCEACK (manual failover). */ if (nodeIsMaster(node) || master == NULL || (!nodeFailed(master) && !force_ack)) { if (nodeIsMaster(node)) { serverLog(LL_WARNING, "Failover auth denied to %.40s: it is a master node", node->name); } else if (master == NULL) { serverLog(LL_WARNING, "Failover auth denied to %.40s: I don't know its master", node->name); } else if (!nodeFailed(master)) { serverLog(LL_WARNING, "Failover auth denied to %.40s: its master is up", node->name); } return; } |
分析:
- 可以看出, acceptor在做投票时候,都会根据现在与申请failover的从节点的主节点的存活关系,来作为判断
- 那么acceptor如何判定某个主节点已经fail呢?有以下流程:
- 与该节点通信超过 cluster-node-timeout 配置的时间,认为主动下线
- 经过过半数的master节点都认为主动下线后, 接受gossip协议的节点认定该节点为fail
- fail状态随gossip协议广播
总结:
- 由以上问题可以看出, cluster集群做了措施来避免频繁的failover
- 需要谨慎的处理 cluster-node-timeout 的配置
- cluster集群的状态同步, 经过gossip协议需要一定的通信时间