原因1是在开始备份时会开启强制Full page write,原因2是pg_basebackup工具通过复制槽的方式阻塞WAL日志的回收
公有云的用户在使用Postgresql时,经常会发现一个现象,备份过程中WAL空间占用会明显增加。
一个明显的原因是在开始备份时会开启强制Full page write,在WAL中记录完整的数据页,避免在拷贝数据页的同时刷脏页导致拷贝部分有效页的场景。
以pg12代码为例
do_pg_start_backup
{
……
XLogCtl->Insert.forcePageWrites = true;
……
}
还有一个原因经常容易被忽视,很多备份程序都会使用Postgresql的原生pg_basebackup工具来进行备份。对于Postgresql的WAL来说,pg_basebackup提供了两个方式来备份:
fetch
The write-ahead log files are collected at the end of the backup. Therefore, it is necessary for the wal_keep_segments parameter to be set high enough that the log is not removed before the end of the backup. If the log has been rotated when it's time to transfer it, the backup will fail and be unusable.
stream
Stream the write-ahead log while the backup is created. This will open a second connection to the server and start streaming the write-ahead log in parallel while running the backup. Therefore, it will use up two connections configured by the max_wal_senders parameter. As long as the client can keep up with write-ahead log received, using this mode requires no extra write-ahead logs to be saved on the master.
第一种方式和很多第三方的备份工具处理类似,可能会出现拷贝过程中WAL被复用导致拷贝一个无效WAL,发生备份不可用。
第二种方式时默认采用,pg_basebackup采用和备机类似的方法,建立一个流复制的方式来获取WAL日志,同时建立复制槽来延迟WAL日志的回收,保证了备份WAL日志完整可用。
以pg12代码为例
StartLogStreamer(xlogstart, starttli, sysidentifier);
{
……
if (!CreateReplicationSlot(param->bgconn, replication_slot, NULL, temp_replication_slot, true, true, false))
……
}
Checkpoint的过程中通过replication slot的最小lsn来阻塞住wal日志的回收。
KeepLogSeg
{
keep = XLogGetReplicationSlotMinimumLSN();
/* compute limit for wal_keep_segments first */
if (wal_keep_segments > 0)
{
/* avoid underflow, don't go below 1 */
if (segno <= wal_keep_segments)
segno = 1;
else
segno = segno - wal_keep_segments;
}
/* then check whether slots limit removal further */
if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
{
XLogSegNo slotSegNo;
XLByteToSeg(keep, slotSegNo, wal_segment_size);
if (slotSegNo <= 0)
segno = 1;
else if (slotSegNo < segno)
segno = slotSegNo;
}
}
正如pg_basebackup的文档中所示,wal_keep_segments作为第一道防线阻塞住wal日志的回收,replication slot的最小lsn作为第二道防线。
上述两个原因回答了备份过程中WAL空间膨胀的原因。