MySQL 二阶段提交及组提交分析

以下基于MySQL 5.7.21 源码分析

1. 完整二阶段提交和组提交时序图

sync_binlog=0， fsync()的调用完全交给操作系统，即文件系统缓存中的binlog是否刷新到disk完全由操作系统控制。

sync_binlog=1，每次事务提交都会调用 fsync()，最大限度保证数据安全。

sync_binlog=N，积累一定数量的binlog一起fsync，当数据库崩溃时，可能会丢失 N-1 个事务。

innodb_flush_log_at_trx_commit=0 : redo log buffer的内容每秒会被写入文件系统缓存的redo log里，同时被fsync到disk上的redo log file中。

innodb_flush_log_at_trx_commit=1 : redo log buffer的内容会在事务commit时被写入文件系统缓存的redo log里，同时被fsync到disk上的redo log file中。

innodb_flush_log_at_trx_commit=2 : redo log buffer的内容会在事务commit时被写入文件系统缓存的redo log里，而文件系统缓存的redo log每秒一次被fsync到disk上的redo log file中。

2. 二阶段提交

2.1 二阶段提交的背景

Since MySQL supports PSEA (pluggable storage engine architecture), more than one transactional engine can be active at a time. Hence transactions, from the server point of view, are always distributed. In particular, transactional state is maintained independently for each engine. In order to commit a transaction the two phase commit protocol is employed.

另外，通过二阶段提交，解决了binlog写入顺序和redo log写入顺序的一致性问题。因为binlog是master-salve的桥梁，如果不一致，则在数据库崩溃恢复后，主从数据可能不一致。

2.2 二阶段提交实现的基本流程

binlog既是二阶段的参与者，又是协调者，所以在源码实现中可以看到 prepare阶段和 commit阶段函数入口都在MYSQL_BIN_LOG中。

ha_commit_trans()                   # 事务提交入口  
|
|- MYSQL_BIN_LOG::prepare()         # 进入prepare阶段
|  |- ha_prepare_low()
|  |  |- binlog_prepare()           # 记录last_committed
|  |  |- innobase_xa_prepare()      # 执行SQL，生成redo log和undo log内存日志（log buffer）
|
|- MYSQL_BIN_LOG::commit()          # 进入commit阶段
|  |- Trans_delegate::before_commit()   # run hook
|  |- ordered_commit()              # 进入组提交
|  |  |- FLUSH_STAGE                # flush 阶段，write+fsync redo log, 写binlog文件
|  |  |- SYNC_STAGE                 # sync 阶段，fsync binlog文件到磁盘
|  |  |- COMMIT_STAGE               # commit阶段,事务按队列顺序提交
|  |  | ...
|  |  |- finish_commit()

innodb prepare阶段并没有写redo log到文件中，除非redo log buffer空间不足。以下可看出：

innobase_xa_prepare()                             # innodb prepapre
| ...
|- trx_prepare_for_mysql()
|  |- trx_prepare()
|  |  |- trx_prepare_low()
|  |  |  |- mtr_t::start（）                      # 开启一个mini-transaction
|  |  |  |- mtr_t::commit()                      # 通过mtr,写redo到redo log buffer
|  |  |  |  |- Command::execute()
|  |  |  |  |  |- prepare_write()                # 准备写mtr log到redo-log buffer
|  |  |  |  |  |- finish_write()
|  |  |- trx->state = TRX_STATE_PREPARED

finish_write()
| ...
|- log_reserve_and_open()
|  |- # not enough space,do a write of buffer
|  |- log_buffer_sync_in_background(false)       # not enough space,do a write of buffer
|  |  |- log_write_up_to(lsn, false)             # write to redo log file,没有执行fsync
|  |  |  |- log_group_write_buf()                # Writes a buffer to a log file group
|...
|- mtr_write_log_t::operator()                   # append blocks to redo log buffer
|  |- log_write_low()                            # 写 redo log block 到 redo log buffer

从上面过程可以看出，prepare阶段并没有写redo log到文件中，只有一种情况会写入到文件中，那就是redo log buffer空间不足时，并且这里只是写入文件系统缓存，并不执行flush操作。

3. 组提交

3.1 组提交的目的及意义

解决写日志时频繁刷盘的问题。组提交包括binlog和redo log的组提交。

3.2 binlog组提交实现

binlog组提交的基本思想是，引入队列机制保证innodb commit顺序与binlog落盘顺序一致，并将事务分组，组内的binlog刷盘动作交给一个事务进行，实现组提交。

ordered_commit() 中涉及到的几个参数：
- innodb_flush_log_at_trx_commit          # (重要)决定redo log刷盘
- sync_binglog                            # (重要)决定bin log刷盘
- binlog_group_commit_sync_delay          # 延迟多久后开始binlog刷盘
- binlog_group_commit_sync_no_delay_count # 在中止当前延迟之前等待的最大事务数
- binlog_order_commits                    # 决定是否按照binlog顺序提交

ordered_commit()
|...
|  |- change_stage(FLUSH_STAGE)        # 进入FLUSH_STAGE，将当前事务加入到m_queue[FLUSH_STAGE]中，如果队列为空，则自动为leader，对LOCK_log加锁,返回false，否则队列不为空为follower,follower 会一直阻塞,直到事务提交后被唤醒，返回true，follower直接进入finish_commit()。
|  |...
|  |- process_flush_stage_queue()
|  |  |- ha_flush_logs(true)          # 根据innodb_flush_log_at_trx_commit配置决定是否write+fsync redo log到磁盘
|  |  |- assign_automatic_gtids_to_flush_group() # 给flush队列里面的事务分配gtid
|  |  |- flush_thread_caches()         # 写binlog到I/O cache中,先写gtid再写其他事件
|  |...
|  |- flush_cache_to_file()            # 实际写入binary log文件,没有fsync
|  |- RUN_HOOK(binlog_storage, after_flush)
|  |...
|  | - change_stage(SYNC_STAGE)        # 进入 sync 阶段,将当前事务加入到m_queue[SYNC_STAGE]中,队列为空，该线程成为leader，对LOCK_sync加锁,释放LOCK_log锁，返回false；否则队列不为空，成为follower，返回true，follower直接进入finish_commit
|  |...
|  |- sync_binlog_file(false)          # 根据 sync_binlog 配置，决定是否执行fsync
|  |...
|  |- 开启了 binlog_order_commits配置，会继续下面的COMMIT_STAGE，否则直接进入finish_commit
|  |...
|  |- change_stage(COMMIT_STAGE)       # 进入 commit 阶段，当前事务加入到m_queue[COMMIT_STAGE]中，队列为空，自动成为leader，对 LOCK_commit 加锁,释放LOCK_sync锁，返回false；否则队列不空，成为follower，返回true，follower直接进入 finish_commit()
|  |- RUN_HOOK(binlog_storage, after_sync)  
|  |- process_commit_stage_queue()     # 按commit_queue队列顺序执行ha_commit_low
|  |  |- ha_commit_low() 
|  |  |  |- binlog_commit()            # binlog 提交
|  |  |  |- innobase_commit()          # innodb 提交
|  |- process_after_commit_stage_queue()
|  |  |- RUN_HOOK(transaction, after_commit)  # 按commit_queue队列顺序RUN_HOOK
|  |...
|  |- stage_manager.signal_done(final_queue)  # 事务提交完成，通知所有等待的follower事务
|  |- finish_commit()

ha_flush_logs(true)
|- flush_handlerton(true)
|  |- innobase_flush_logs(true)
|  |  |- if innodb_flush_log_at_trx_commit=0,return
|  |  |- log_buffer_flush_to_disk(flush_log_at_trx_commit == 1)
|  |  |  |- log_write_up_to(flush_log_at_trx_commit == 1)
|  |  |  |  |- log_group_write_buf()         # 写到redo log file
|  |  |  |  |- log_write_flush_to_disk_low() # fsync redo log file 到磁盘

3.3 redo log 组提交实现

将多个事务redo log的刷盘动作合并，减少磁盘顺序写（redo log是顺序写）。
每个redo log都有一个LSN（Log Sequence Number），LSN是单调递增。
每个事务执行更新操作都会包含一条或多条redo log，各个事务将日志拷贝到log_sys_buffer时(log_sys_buffer 通过log_mutex保护)，都会获取当前最大的LSN，因此可以保证不同事务的LSN不会重复。
那么假设三个事务Trx1,Trx2和Trx3的日志的最大LSN分别为LSN1,LSN2,LSN3(LSN1<LSN2<LSN3)，它们同时进行提交，那么如果Trx3日志先获取到log_mutex进行落盘，它就可以顺便把[LSN1---LSN3]这段日志也刷了，这样Trx1和Trx2就不用再次请求磁盘IO。
通过上面的order_commit()，我们知道，刷redo log动作是在flush阶段、刷binlog之前，这个优化是在mysql 5.7.6引入，（之前是在prepare阶段刷redo log），该优化做到了：
- 一方面，保证了刷binlog之前一定会刷redo log，保证了原来的故障恢复的逻辑；
- 另一方面，可以不用每个事务都刷盘，而是leader事务帮助刷一批redo。
redo log组提交实现流程（log_write_up_to()）：
- 1、获取 log_sys->write_mutex
- 2、若flushed_to_disk_lsn>=lsn，表示日志已经被刷盘，跳转8后，退出
- 3、若 current_flush_lsn>=lsn，表示日志正在刷盘中，跳转8后，进入等待状态
- 4、获取 log_sys->mutex
- 5、计算写入的起点、终点，记录write_lsn等
- 6、释放 log_sys->mutex
- 7、将小于LSN的日志写入文件 : log_group_write_buf()
- 8、释放 log_sys->write_mutex
- 9、fsync日志文件： log_write_flush_to_disk_low()
- 备注：lsn表示事务的lsn（ log_sys->lsn始终保持了当前最大的lsn ），flushed_to_disk_lsn和current_flush_lsn分别表示已刷盘的LSN和正在刷盘的LSN。

MySQL 二阶段提交及组提交分析

以下基于MySQL 5.7.21 源码分析

1. 完整二阶段提交和组提交时序图

sync_binlog=0， fsync()的调用完全交给操作系统，即文件系统缓存中的binlog是否刷新到disk完全由操作系统控制。

sync_binlog=1，每次事务提交都会调用 fsync()，最大限度保证数据安全。

sync_binlog=N，积累一定数量的binlog一起fsync，当数据库崩溃时，可能会丢失 N-1 个事务。

innodb_flush_log_at_trx_commit=0 : redo log buffer的内容每秒会被写入文件系统缓存的redo log里，同时被fsync到disk上的redo log file中。

innodb_flush_log_at_trx_commit=1 : redo log buffer的内容会在事务commit时被写入文件系统缓存的redo log里，同时被fsync到disk上的redo log file中。

innodb_flush_log_at_trx_commit=2 : redo log buffer的内容会在事务commit时被写入文件系统缓存的redo log里，而文件系统缓存的redo log每秒一次被fsync到disk上的redo log file中。

2. 二阶段提交

2.1 二阶段提交的背景

Since MySQL supports PSEA (pluggable storage engine architecture), more than one transactional engine can be active at a time. Hence transactions, from the server point of view, are always distributed. In particular, transactional state is maintained independently for each engine. In order to commit a transaction the two phase commit protocol is employed.

另外，通过二阶段提交，解决了binlog写入顺序和redo log写入顺序的一致性问题。因为binlog是master-salve的桥梁，如果不一致，则在数据库崩溃恢复后，主从数据可能不一致。

2.2 二阶段提交实现的基本流程

binlog既是二阶段的参与者，又是协调者，所以在源码实现中可以看到 prepare阶段和 commit阶段函数入口都在MYSQL_BIN_LOG中。

ha_commit_trans()                   # 事务提交入口  
|
|- MYSQL_BIN_LOG::prepare()         # 进入prepare阶段
|  |- ha_prepare_low()
|  |  |- binlog_prepare()           # 记录last_committed
|  |  |- innobase_xa_prepare()      # 执行SQL，生成redo log和undo log内存日志（log buffer）
|
|- MYSQL_BIN_LOG::commit()          # 进入commit阶段
|  |- Trans_delegate::before_commit()   # run hook
|  |- ordered_commit()              # 进入组提交
|  |  |- FLUSH_STAGE                # flush 阶段，write+fsync redo log, 写binlog文件
|  |  |- SYNC_STAGE                 # sync 阶段，fsync binlog文件到磁盘
|  |  |- COMMIT_STAGE               # commit阶段,事务按队列顺序提交
|  |  | ...
|  |  |- finish_commit()

innodb prepare阶段并没有写redo log到文件中，除非redo log buffer空间不足。以下可看出：

innobase_xa_prepare()                             # innodb prepapre
| ...
|- trx_prepare_for_mysql()
|  |- trx_prepare()
|  |  |- trx_prepare_low()
|  |  |  |- mtr_t::start（）                      # 开启一个mini-transaction
|  |  |  |- mtr_t::commit()                      # 通过mtr,写redo到redo log buffer
|  |  |  |  |- Command::execute()
|  |  |  |  |  |- prepare_write()                # 准备写mtr log到redo-log buffer
|  |  |  |  |  |- finish_write()
|  |  |- trx->state = TRX_STATE_PREPARED

finish_write()
| ...
|- log_reserve_and_open()
|  |- # not enough space,do a write of buffer
|  |- log_buffer_sync_in_background(false)       # not enough space,do a write of buffer
|  |  |- log_write_up_to(lsn, false)             # write to redo log file,没有执行fsync
|  |  |  |- log_group_write_buf()                # Writes a buffer to a log file group
|...
|- mtr_write_log_t::operator()                   # append blocks to redo log buffer
|  |- log_write_low()                            # 写 redo log block 到 redo log buffer

3. 组提交

3.1 组提交的目的及意义

解决写日志时频繁刷盘的问题。组提交包括binlog和redo log的组提交。

3.2 binlog组提交实现

binlog组提交的基本思想是，引入队列机制保证innodb commit顺序与binlog落盘顺序一致，并将事务分组，组内的binlog刷盘动作交给一个事务进行，实现组提交。

ordered_commit() 中涉及到的几个参数：
- innodb_flush_log_at_trx_commit          # (重要)决定redo log刷盘
- sync_binglog                            # (重要)决定bin log刷盘
- binlog_group_commit_sync_delay          # 延迟多久后开始binlog刷盘
- binlog_group_commit_sync_no_delay_count # 在中止当前延迟之前等待的最大事务数
- binlog_order_commits                    # 决定是否按照binlog顺序提交

ordered_commit()
|...
|  |- change_stage(FLUSH_STAGE)        # 进入FLUSH_STAGE，将当前事务加入到m_queue[FLUSH_STAGE]中，如果队列为空，则自动为leader，对LOCK_log加锁,返回false，否则队列不为空为follower,follower 会一直阻塞,直到事务提交后被唤醒，返回true，follower直接进入finish_commit()。
|  |...
|  |- process_flush_stage_queue()
|  |  |- ha_flush_logs(true)          # 根据innodb_flush_log_at_trx_commit配置决定是否write+fsync redo log到磁盘
|  |  |- assign_automatic_gtids_to_flush_group() # 给flush队列里面的事务分配gtid
|  |  |- flush_thread_caches()         # 写binlog到I/O cache中,先写gtid再写其他事件
|  |...
|  |- flush_cache_to_file()            # 实际写入binary log文件,没有fsync
|  |- RUN_HOOK(binlog_storage, after_flush)
|  |...
|  | - change_stage(SYNC_STAGE)        # 进入 sync 阶段,将当前事务加入到m_queue[SYNC_STAGE]中,队列为空，该线程成为leader，对LOCK_sync加锁,释放LOCK_log锁，返回false；否则队列不为空，成为follower，返回true，follower直接进入finish_commit
|  |...
|  |- sync_binlog_file(false)          # 根据 sync_binlog 配置，决定是否执行fsync
|  |...
|  |- 开启了 binlog_order_commits配置，会继续下面的COMMIT_STAGE，否则直接进入finish_commit
|  |...
|  |- change_stage(COMMIT_STAGE)       # 进入 commit 阶段，当前事务加入到m_queue[COMMIT_STAGE]中，队列为空，自动成为leader，对 LOCK_commit 加锁,释放LOCK_sync锁，返回false；否则队列不空，成为follower，返回true，follower直接进入 finish_commit()
|  |- RUN_HOOK(binlog_storage, after_sync)  
|  |- process_commit_stage_queue()     # 按commit_queue队列顺序执行ha_commit_low
|  |  |- ha_commit_low() 
|  |  |  |- binlog_commit()            # binlog 提交
|  |  |  |- innobase_commit()          # innodb 提交
|  |- process_after_commit_stage_queue()
|  |  |- RUN_HOOK(transaction, after_commit)  # 按commit_queue队列顺序RUN_HOOK
|  |...
|  |- stage_manager.signal_done(final_queue)  # 事务提交完成，通知所有等待的follower事务
|  |- finish_commit()

ha_flush_logs(true)
|- flush_handlerton(true)
|  |- innobase_flush_logs(true)
|  |  |- if innodb_flush_log_at_trx_commit=0,return
|  |  |- log_buffer_flush_to_disk(flush_log_at_trx_commit == 1)
|  |  |  |- log_write_up_to(flush_log_at_trx_commit == 1)
|  |  |  |  |- log_group_write_buf()         # 写到redo log file
|  |  |  |  |- log_write_flush_to_disk_low() # fsync redo log file 到磁盘

3.3 redo log 组提交实现

将多个事务redo log的刷盘动作合并，减少磁盘顺序写（redo log是顺序写）。
每个redo log都有一个LSN（Log Sequence Number），LSN是单调递增。
每个事务执行更新操作都会包含一条或多条redo log，各个事务将日志拷贝到log_sys_buffer时(log_sys_buffer 通过log_mutex保护)，都会获取当前最大的LSN，因此可以保证不同事务的LSN不会重复。
那么假设三个事务Trx1,Trx2和Trx3的日志的最大LSN分别为LSN1,LSN2,LSN3(LSN1<LSN2<LSN3)，它们同时进行提交，那么如果Trx3日志先获取到log_mutex进行落盘，它就可以顺便把[LSN1---LSN3]这段日志也刷了，这样Trx1和Trx2就不用再次请求磁盘IO。
通过上面的order_commit()，我们知道，刷redo log动作是在flush阶段、刷binlog之前，这个优化是在mysql 5.7.6引入，（之前是在prepare阶段刷redo log），该优化做到了：
- 一方面，保证了刷binlog之前一定会刷redo log，保证了原来的故障恢复的逻辑；
- 另一方面，可以不用每个事务都刷盘，而是leader事务帮助刷一批redo。
redo log组提交实现流程（log_write_up_to()）：
- 1、获取 log_sys->write_mutex
- 2、若flushed_to_disk_lsn>=lsn，表示日志已经被刷盘，跳转8后，退出
- 3、若 current_flush_lsn>=lsn，表示日志正在刷盘中，跳转8后，进入等待状态
- 4、获取 log_sys->mutex
- 5、计算写入的起点、终点，记录write_lsn等
- 6、释放 log_sys->mutex
- 7、将小于LSN的日志写入文件 : log_group_write_buf()
- 8、释放 log_sys->write_mutex
- 9、fsync日志文件： log_write_flush_to_disk_low()
- 备注：lsn表示事务的lsn（ log_sys->lsn始终保持了当前最大的lsn ），flushed_to_disk_lsn和current_flush_lsn分别表示已刷盘的LSN和正在刷盘的LSN。

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

MySQL二阶段提交及组提交简析

MySQL 二阶段提交及组提交分析

1. 完整二阶段提交和组提交时序图

2. 二阶段提交

2.1 二阶段提交的背景

2.2 二阶段提交实现的基本流程

3. 组提交

3.1 组提交的目的及意义

3.2 binlog组提交实现

3.3 redo log 组提交实现

MySQL二阶段提交及组提交简析

MySQL 二阶段提交及组提交分析

1. 完整二阶段提交和组提交时序图

2. 二阶段提交

2.1 二阶段提交的背景

2.2 二阶段提交实现的基本流程

3. 组提交

3.1 组提交的目的及意义

3.2 binlog组提交实现

3.3 redo log 组提交实现

活动

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

MySQL二阶段提交及组提交简析

MySQL 二阶段提交及组提交分析

1. 完整二阶段提交和组提交时序图

2. 二阶段提交

2.1 二阶段提交的背景

2.2 二阶段提交实现的基本流程

3. 组提交

3.1 组提交的目的及意义

3.2 binlog组提交实现

3.3 redo log 组提交实现

MySQL二阶段提交及组提交简析

MySQL 二阶段提交及组提交分析

1. 完整二阶段提交和组提交时序图

2. 二阶段提交

2.1 二阶段提交的背景

2.2 二阶段提交实现的基本流程

3. 组提交

3.1 组提交的目的及意义

3.2 binlog组提交实现

3.3 redo log 组提交实现