TeleDB for MySQL一大核心功能就是提供了了策略灵活的备份恢复功能,来满足各种场景,需求,但是在实际生产中,当备份工具出现问题或备份失败时如何去排障给运维人员带来一定难度,因此理清备份工具的实现逻辑会提升一线运维的排障效率!
A、备份任务(自动)提交流程
自动备份任务主要通过定时脚本:check_backup.py脚本来触发,触发检测流程如下
1、分别对已经备份过或者还未进行备份过的集群走不同的处理逻辑,对应的查询的SQL逻辑如下;
针对已经备份过的集群:
select a.prod_inst_id,a.prod_inst_flag, a.prod_inst_set_name,a.prod_inst_name, a.prod_type, a.prod_db_engine, c.store_ip, c.store_path, c.store_user, c.store_password, c.ssh_port, d.nextbackuptime, d.backupinterval, d.backupcount, d.status, d.backupunittime, d.firstbackuptime from paas_product a left join backup_info c using(prod_inst_id) left join backup_conf d using(prod_inst_id) where a.prod_order_status = 0 and d.havebackup=0 and a.prod_running_status = 0 group by a.prod_inst_id;
针对还未进行备份的集群:
select a.prod_inst_flag, a.prod_inst_set_name, a.prod_inst_name, a.prod_type, a.prod_db_engine, b.prod_inst_id, b.incremental_lsn, b.firstbackuptime, b.nextbackuptime, b.backupunittime, b.backupinterval, b.backupcount,b.status, c.store_ip, c.store_user, c.store_password, c.store_path, c.ssh_port from paas_product a left join backup_conf b on a.prod_inst_id = b.prod_inst_id left join backup_info c on b.prod_inst_id = c.prod_inst_id where a.prod_order_status = 0 and a.prod_running_status = 0 and b.backup_operation = 1 and b.havebackup = 0 " + condition + " group by a.prod_inst_id
2、针对已经备份过的集群,查询backup_recovery表中backup_type不等于tableh或instance且cType=auti、is_valid=0、且binlog_end_time最新的那条记录,查询sql如下:
select to_lsn,host ,port from backup_recovery where prod_inst_id=prod_inst_id and backup_type not in('table','instance') and cType='auto' and is_valid = 0 order by binlog_end_time desc limit 1;
如果存在相应记录,则取出incremental_lsn,db_host和db_port(上次备份选取的从库的ip和port)字段,否则将其置为空;
如果是针对未备份过的集群,判断当前时间是否超过下次备份的时间,且backup_conf中的status的字段为0(该集群不存在备份或恢复操作),则令host和port为空,然后进去检测备份逻辑;
3、如果是单实例,设置isAgain_f为True(表示为全备),如果db_host和db_port,设置isAgain_f为True,如果db_host和db_port不为空,则判断该db_host和db_port对应的实例是否仍然满足是备份库的条件(alive=0,repl=0且role=s),如果不满足则通过下面sql重新选择一个合适的从库,且设置isAgain_f=True;
select a.host, a.port, b.sys_user, b.sys_password, b.ssh_port, c.db_ag_user, c.db_ag_password, concat(a.db_path,'/etc/', d.parameter_group_used, '.cnf') mysqlcnf from db_resource a left join machine_resource b on a.machine_id = b.id left join db_user_info c on a.res_id = c.res_id left join paas_product d on a.prod_inst_id = d.prod_inst_id left join monitor_cluster_db_states e on a.prod_inst_id = e.prod_inst_id where a.prod_inst_id = " + str(prod_inst_id) + " and a.alive = 0 and e.repl = 0 order by a.role desc limit 1
4、判断isAgain_f是否为True,需要进行全备,如果是,则通过相应语句获取新的从库,得到其数据库实例相关的信息,如果通过sql获取不到这样的实例(从库),则直接返回错误;
5、通过如下sql获取上次备份的存储路径及存储目录序号
select store_path,substring_index(substring_index(store_path,'/',-2),'/',1) childPath, cType from backup_recovery_history where prod_inst_id=" + str(prod_inst_id) +" and backup_type='full' and is_valid = 0 order by childPath desc limit 1;
如果找不到,则设置order=0(全备周期存储备份的目录序号),如果查找到的上一次备份是手工备份(cType=manual),则设置cType=auto,且isAgain_f为True
6、通过上述步骤后,接着判断isAgain_f或者是否是单实例,满足其中一个条件则设置下次的备份类型为backupType为full,且存储的目录序号为order+1,incremental_lsn为空,否则通过第一步判断后传过来的backupType来进行判断,0表示全备,1表示增量备份,如果是增量备份,则通过获取backup_conf表中的last_recovery_time字段,如果该值不为0,说明上次该实例进行恢复过,则下次备份应该为全备,否则下次备份仍然按照增备进行处理;
7、将备份任务写至zk节点;
8、如果当前备份是全备且集群不是单实例,即isAgain_f为True且prod_type大于0,则设置backupcount为1,否则将backupcount加1;设置下次备份的时间,将上述值更新到backup_conf表中,且同时设置status=1;
9、设置paas_prodcut对应prod_inst_id记录的prod_running_status字段为2;
B、回收备份任务流程
该处理逻辑主要是根据get_backup.py脚本来分析。与自动备份通backup_conf关联paas_product表找到对应的set;手工备份通过manual_backup_info关联paas_product表找到对应的set;
1、在zookeeper上获取对应set的备份的ans节点,如果节点上的errcode为大于等于3(结束),对其进行解析,任务节点及内容如下所示:
{
"flag":"test",
"set_name":"set_100993015",
"host":"10.142.90.99",
"port":3015,
"bkhost":"10.142.90.100",
"store_host_ssh_port":22,
"store_user":"teledb",
"store_passwd":"Py+oNVxQcHY=",
"type":"full",
"store_dir":"/mysql/zmy-test/back/3015_1t_100993015/10.142.90.99_3015/backup_0000000008/2021-01-19_14-06-47",
"binlog_end_time":"2021-01-19 14:07:27",
"gtid":"a0cee2b0-54aa-11eb-b153-688f84f0021f:1-118712,b101b422-50cb-11eb-af77-688f84f0022d:1-48141,dbf32cc0-5303-11eb-83c8-688f84f0022d:1-35977,fc804848-4a52-11eb-9825-688f84f02c5e:1-140014",
"from_lsn":"0",
"to_lsn":"135976796",
"errcode":3,
"history":"[set_100993015]connect to StoreHost|[set_100993015]connect to DbHost|[set_100993015]run innobackup|[set_100993015]backup OK",
"start_time":"2021-01-19 14:06:47",
"end_time":"2021-01-19 14:07:34",
"data_size":57570882,
"c_type":"auto",
"description":"auto backup",
"backup_name":"test_set_100993015_202101191406"
}
2、如果errcode为3(备份成功),则按如下流程进行处理:
A. 如果是自动备份“auto”备份,则更新backup_conf表中对应的prod_inst_id中的havebackup为1,incremental_lsn为指定值;
B. 删除backup_recovery表中对应prod_inst_id对应type(此例中为full)的记录;
C. 将最新的备份获取到的信息插入到backup_recovery表,同时将该信息也插入到backup_recovery_history表;
D.将backup_conf表中的last_recovery_time字段设置为0(这个字段的设置是上次对该集群创建恢复任务时候进行更新,该自动主要是来控制上次恢复后第一次必须做全备)
3、如果errcode不为3(备份失败),则按如下流程进行处理:
A.将backup_conf中的的backupcount字段减1;
B.将备份的到的信息录入backup_recovery_history表;
C.构造备份失败的告警信息,入到alarm表;
4、更新backup_conf表中对应prod_inst_id对应的status字段为0;
5、更新paas_prodcut表中对应prod_inst_id对应的prod_running_status 字段为0;
6、删除zk上备份和备份对应的ans节点;
C、回收恢复任务流程
该处理逻辑主要是根据get_recovery.py脚本来分析。通过paas_product关联backup_recovery表,获取backup_type为instance或table的记录,找到对应set,开始进行后续的恢复任务的回收流程
1、在zookeeper上获取对应set的恢复的ans节点,如果节点上的errcode为大于等于3(结束),对其进行解析,任务节点及内容如下所示:
{
"set_name":"set_100993015",
"recovery_type":"0",
"start_time":"2021-01-18 10:08:38",
"end_time":"2021-01-18 10:09:51",
"flag":"test",
"errcode":"5",
"history":"Start Copy Backup|Copy Backup OK|Start Apply Log|Apply Log OK|[set_100993015][10.142.90.96] Start to restore|[set_100993015][10.142.90.96]Ensure MySQL and agent closed for restore|[set_100993015][10.142.90.96] Can't repare right env for restore, error [error happen when try to kill agent cannot get pid from agentProcInfo]",
"extra_info":"[{"dbport": "22", "agconfname": "", "agentdir": "", "host": "10.142.90.96", "cnffile": "mysql.cnf", "sysuser": "teledb", "sshport": "22", "syspasswd": "Py+oNVxQcHY="}, {"user_id": 98989988, "recovery_name": "set_100993015_instance_recovery_202101181008", "target_id_list": [], "src_inst_id": "3"}]"
}
2、如果节点中recovery_type为0,则将recovery_type赋值为instance,否则赋值为table;
3、更新backup_recovery中对应恢复记录的errcode,history,start_time,end_time等字段;
4、如果errorcode字段为3,则将其加入告警信息列表;
5、将第三步刚才更新的backup_recovery记录更新到backup_recovery_history表;
6、在backup_history表中,将第三步中对应的信息delete掉;
7、将backup_conf表中对应prod_inst_id记录中的status字段更新为0;
8、将paas_prodcut表中对应prod_inst_id记录中的prod_running_satus字段更新为0;
9、将节点上关于这个set的不切换集取消掉;并在excluded_set_info表中清除该恢复的记录信息;
10、如果是实例恢复,且agentdir和agentconf不为空的话,重启对应agent服务;
11、删除zk上恢复和恢复对应的ans节点;
D、binlog同步信息获取逻辑
binlog信息的更新主要是通过get_syncer_result.py脚本来更新binlog_info表
1、通过如下的sql来找到对应set,来循环更新binlog信息,sql如下:
select a.prod_inst_flag, a.prod_inst_id, a.prod_inst_set_name, b.store_ip, b.store_path, c.api_port from paas_product a inner join backup_info b on a.prod_inst_id =b.prod_inst_id left join binlog_sync_thread c on b.store_ip = c.host and b.store_path = c.location where a.prod_order_status = 0
2、通过requests.get(url)获取对应set的binlog同步信息,针对同步的每个binlog,然后通过replace into的方式来循环更新binlog_info表;
E、backuprecovery/syncer心跳更新逻辑
1、backuprecovery和syncer的心跳更新都是工具(backuprecovery和syncer工具)主动push的方式来进行更新,通过发送post请求,后端收到该请求后来进行相应的处理;
2、解析收到的post请求的内容,获取component_type属性值,属性值只能为backuprecovery和syncer;
3、如果是backuprecovery,如果请求中的user_id为0,则更新backup_conf表中的backuprecovery_hearbeat内容为传过来的时间值;如果user_id不为0,则先查出该user_id对应的所有的prod_inst_id,然后根据prod_inst_id去更新backup_conf中的backuprecovery_hearbeat值;
4、如果是syncer,则先获取属性中的flag、set_name值,然后根据这两个查询出对应的prodInstId,然后根据此prodInstId去更新backup_conf表中对应记录的syncer_heartbeat值;
注意:前段界面备份恢复->集群列表中,backuprecovery和syncer心跳显示正常与否主要是通过判断backup_conf表中对应实例的backuprecovery_heartbeat和syncer_heartbeat时间与当前时间的时间差是否超过5分钟;