文件句柄资源占用过多导致xtrabackup备份失效

详见: https://bugs.launchpad.net/percona-xtrabackup/+bug/1183322 系统句柄资占用过多(too many open file)引起的备份失效.

xtrabackup version 2.1.6 for Percona Server 5.1.70 unknown-linux-gnu (x86_64) (revision id: 702)
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /web/mysql/node3312/data
xtrabackup: using the following InnoDB configuration:
xtrabackup:   innodb_data_home_dir = ./
xtrabackup:   innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup:   innodb_log_group_home_dir = ./
xtrabackup:   innodb_log_files_in_group = 2
xtrabackup:   innodb_log_file_size = 268435456
xtrabackup: using O_DIRECT
>> log scanned up to (3891806734133)
140319  4:18:03  InnoDB: Warning: allocated tablespace 10, old maximum was 9
>> log scanned up to (3891806734133)
>> log scanned up to (3891806759902)
140319  4:18:06  InnoDB: Operating system error number 24 in a file operation.
InnoDB: Error number 24 means 'Too many open files'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html
InnoDB: Error: could not open single-table tablespace file
InnoDB: ./zongheng_log/click_log#P#p20131103.ibd!
InnoDB: We do not continue the crash recovery, because the table may become
InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it.
InnoDB: To fix the problem and start mysqld:
InnoDB: 1) If there is a permission problem in the file and mysqld cannot
InnoDB: open the file, you should modify the permissions.
InnoDB: 2) If the table is not needed, or you can restore it from a backup,
InnoDB: then you can remove the .ibd file, and InnoDB will do a normal
InnoDB: crash recovery and ignore that table.
InnoDB: 3) If the file system or the disk is broken, and you cannot remove
InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
InnoDB: and force InnoDB to continue crash recovery here.
innobackupex: Error: The xtrabackup child process has died at /usr/bin/innobackupex line 2631.

此bug和 722638 相关, 作者解决了rename table的bug后而引入了FD的bug, xtrabckup没有重用表空间的FD,而是重新打开FD直到拷贝完表空间,理论上需要2*tablespaces + 10个数量的FD, Percona的open_files_limit参数值依赖系统平台,在备份阶段业务主机的FD总使用超过ulimit -n的值的时候则会报perror 24(too many open files)错误,详见: https://bugs.launchpad.net/percona-xtrabackup/+bug/1183322/comments/17

node因为分区表的关系,存在的表空间数量较大
[root@cz log]# find /web/mysql/node/data/ -name '*.ibd' | wc -l           
1074

在下一版本中作者可能会通过增加参数选项的方式禁止备份中断后产生警告的函数(如果在备份过程中存在DDL语句对InnoDB表更新操作), 见连接:https://bugs.launchpad.net/percona-xtrabackup/+bug/1222062

作者临时提供了两种方法来解决该bug引起的问题:

  1. There are 2 kinds of limits on file descriptors: a) per-user limit that can be checked and adjusted for the current session by the “ulimit -n” command (or modifying /etc/security/limits.conf to make changes persistent) b) system-wide limit that can checked and adjusted by using either /proc/sys/fs/file-max or the sysctl utility (or modifying /etc/sysctl.conf to make changes persistent).

  2. Most Linux distributions have rather strict user-level limits, but fairly high system-wide limits by default. For example, on my CentOS 5 VM I see 1024 file descriptors for a non-root user, but 207006 as the system-wide limit in /proc/sys/fs/file-max. The user limit would have to be adjusted if I had to backup more than ~500 tablespaces, and the system-wide limit would have to be adjusted if I had to backup more than ~100,000 tablespaces.

/proc/sys/fs/file-max在centos 6中默认值3245099,是一个很大的值,系统默认的user限制为1024, 作者描述的情况是否真实,我在内网机测试如下: sqlbench创建一些innodb表:

[root@cz sql-bench]# find /web/mysql/node3306/data/ -name '*.ibd' | wc -l     
2201

[root@z6 ~]# rpm -qa|grep percona
percona-xtrabackup-2.1.6-702.rhel6.x86_64  #和业务的版本一致

备份如下: (注: xtrabackup分别使用root用户和mysql用户)

innobackupex: Created backup directory /web/mysql_back/node3306/2014-03-22_12-08-25

140322 12:08:25  innobackupex: Starting ibbackup with command: xtrabackup_55  --defaults-file="/web/mysql/node3306/my.node.cnf"  --defaults-group="mysqld" --backup --suspend-at-end --target-dir=/web/mysql_back/node3306/2014-03-22_12-08-25 --tmpdir=/dev/shm
innobackupex: Waiting for ibbackup (pid=22246) to suspend
innobackupex: Suspend file '/web/mysql_back/node3306/2014-03-22_12-08-25/xtrabackup_suspended_2'

xtrabackup_55 version 2.1.6 for Percona Server 5.5.31 Linux (x86_64) (revision id: 702)
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /web/mysql/node3306/data
xtrabackup: using the following InnoDB configuration:
xtrabackup:   innodb_data_home_dir = ./
xtrabackup:   innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup:   innodb_log_group_home_dir = ./
xtrabackup:   innodb_log_files_in_group = 2
xtrabackup:   innodb_log_file_size = 268435456
xtrabackup: using O_DIRECT
>> log scanned up to (29917172879)
140322 12:08:25  InnoDB: Warning: allocated tablespace 308, old maximum was 9
>> log scanned up to (29917172879)
140322 12:08:26  InnoDB: Operating system error number 24 in a file operation.
InnoDB: Error number 24 means 'Too many open files'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html
InnoDB: Error: could not open single-table tablespace file
InnoDB: ./percona/bench_875.ibd!
InnoDB: We do not continue the crash recovery, because the table may become
InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it.
InnoDB: To fix the problem and start mysqld:
InnoDB: 1) If there is a permission problem in the file and mysqld cannot
InnoDB: open the file, you should modify the permissions.
InnoDB: 2) If the table is not needed, or you can restore it from a backup,
InnoDB: then you can remove the .ibd file, and InnoDB will do a normal
InnoDB: crash recovery and ignore that table.
InnoDB: 3) If the file system or the disk is broken, and you cannot remove
InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
InnoDB: and force InnoDB to continue crash recovery here.
innobackupex: Error: The xtrabackup child process has died at /usr/bin/innobackupex line 2631.

测试,更换各用户ulimit数量,进行xtrabackup的备份,以下为更改root和mysql用户的资源进行相关备份:

       root :32768, mysql: 1024 : 备份成功
       root :10240, mysql: 1024 :备份成功
       root :4412,  mysql: 1024 : 备份成功
       root :2212,  mysql: 1024 : 备份成功
       root :2211,  mysql: 1024 : 备份成功
       root :2210,  mysql: 1024 : 重现
       root :2201,  mysql: 1024 :重现
       root :1600,  mysql: 1024 : 重现
       root :1024,  mysql: 1024 : 重现

root用户执行备份:
       root :1024, mysql: 32768 : 重现
mysql用户执行备份:
       root :1024, mysql: 32768 : 备份成功
       root :1024, mysql: 2212  : 备份成功
       root :1024, mysql: 2211  : 备份成功
       root :1024, mysql: 2210  : 重现
       root :1024, mysql: 1600  : 重现

综上,至少需要ibd表空间数+10个FD,才能备份成功,按照作者的说法至多需要2*ibds + 10个FD, 对于线上备份而言,在作者没有彻底解决该bug之前,需要保证用户可用的FD数量至少 >= ibds+10(因为测试环境中并没有其它线程连接,所有FD要比作者描述的少的多), 如果业务线备份使用root用户, 更改问题较严重的话,备份都改为mysql用户备份。