Atlas使用问题汇总

1.安装问题 源码编译Atlas新版 注:2.1版的编码问题解决了, /master/方式通过php,java等完成访问(命令行方式不能转到master主机上).

yum install -y libevent-devel lua-devel openssl-devel flex mysql-devel
yum install gcc.x86_64 gcc-c++.x86_64

glibc install:
  391  wget http://pkgs.fedoraproject.org/repo/pkgs/glib2/glib-2.32.4.tar.xz/bf84fefd9c1a5b5a7a38736f4ddd674a/glib-2.32.4.tar.xz
  393  tar xf glib-2.32.4.tar.xz 
  395  cd glib-2.32.4
  397  mkdir /usr/local/glib/
  398  ./configure --prefix=/usr/local/glib
  401  yum install libffi-devel.x86_64
  403  ./configure --prefix=/usr/local/glib
  405  make
  406  make install 

checking for GLIB... configure: error: Package requirements (glib-2.0 >= 2.32.0) were not met:
checking for GMODULE... configure: error: Package requirements (gmodule-2.0 >= 2.32.0) were not met:

[root@atlas Atlas-master]# cat bootstrap.sh

#!/bin/sh 
base=$(cd "$(dirname "$0")"; pwd)
cd $base
PKG_CONFIG_PATH=/usr/local/glib/lib/pkgconfig ./configure --with-mysql=/usr --prefix=/opt/mysql-proxy 

#make && make install

2.延迟处理 见: https://github.com/Qihoo360/Atlas/issues/8#issuecomment-38153303 使用中间件做read负载, 最核心的是解决延迟问题,应用端可以通过/master/方式强制指定从master获取信息,不过这不能从本质上解决问题,使用Atlas需要考虑到这点, 提高主机的IO性能,加大memory可以缓解延迟症状,但依旧不能避免延迟的出现,尤其是读多写少的应用。

Atlas没有做主从状态的检测, 在SQL_THREAD为No情况下,sql的执行失败(在延迟情况下不管是空结果集还是表不存在)都不会使slave在backends下线,不过可以通过管理接口来实现offline出问题的slave,添加命令到监控项可实现自动下线。比如:

mysql -h 10.3.254.106 -P 2345 -u admin -pxxxxxxxx -e 'SET OFFLINE 3'

下线功能或许可能在未来的版本实现,目前可以通过daemon程序来频繁检测slave的延迟情况, 延迟较大(如检测两次,每次都大于10s)则下线slave。

https://github.com/arstercz/Atlas_auto_setline

# atlas auto set line.
* * * * * cd /web/scripts/atlas_auto; perl atlas_auto_setline.pl --conf=db.conf --verbose --setline >>setline.log 2>&1

3.thread超时退出 MySQL为percona 5.1.34版本

mysql root@[localhost:s3306 information_schema] > select command,time,state,info from processlist order by time desc limit 10;
+---------+--------+----------------------------------+------+
| command | time   | state                            | info |
+---------+--------+----------------------------------+------+
| Connect | 111463 | Waiting for master to send event | NULL |
| Sleep   |   1473 |                                  | NULL |
| Sleep   |   1473 |                                  | NULL |
| Sleep   |   1356 |                                  | NULL |
| Sleep   |   1356 |                                  | NULL |
| Sleep   |   1342 |                                  | NULL |
| Sleep   |   1242 |                                  | NULL |
| Sleep   |   1242 |                                  | NULL |
| Sleep   |   1241 |                                  | NULL |
| Sleep   |   1241 |                                  | NULL |
+---------+--------+----------------------------------+------+
10 rows in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select count(*) from processlist;                                           
+----------+
| count(*) |
+----------+
|       76 |
+----------+
1 row in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select command,time,state,info from processlist order by time desc limit 10;
+---------+--------+----------------------------------+------+
| command | time   | state                            | info |
+---------+--------+----------------------------------+------+
| Connect | 111490 | Waiting for master to send event | NULL |
| Sleep   |   1383 |                                  | NULL |
| Sleep   |   1383 |                                  | NULL |
| Sleep   |   1369 |                                  | NULL |
| Sleep   |   1269 |                                  | NULL |
| Sleep   |   1269 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1252 |                                  | NULL |
+---------+--------+----------------------------------+------+
10 rows in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select count(*) from processlist;                                           
+----------+
| count(*) |
+----------+
|       74 |
+----------+
1 row in set (0.00 sec)

为interactive_timeout参数的值,见:

| interactive_timeout        | 1500  |
| wait_timeout               | 600   |
On thread startup, the session wait_timeout value is initialized from the global wait_timeout value or from the global interactive_timeout value, depending on the type of client (as defined by the CLIENT_INTERACTIVE connect option to mysql_real_connect()). See also interactive_timeout.

即Mysql 初始化wait_timeout的值根据CLIENT_INTERACTIVE(应用端交互类型)选取是wait_timeout还是interactive_timeout。 php应用采用mysql_real_connect函数连接(不一定严格使用该名字的函数,只要底层使用了CLIENT_INTERACTIVE类型;比如mysqli驱动连接分为mysqli::real_connect – mysqli_real_connect — Opens a connection to a mysql server和mysqli::__construct – mysqli_connect — Open a new connection to the MySQL server两种,前者连接前需要初始化一个对象,后者为单纯的新建连接,这个时候的初始值为wait_timeout),则选取interactive_timeout值作为初始值见(interactive_timeout选项): An interactive client is defined as a client that uses the CLIENT_INTERACTIVE option to mysql_real_connect()。

4.resource limit问题 atlas有自己的连接池,会吃掉很多CPU, php应用端改用短链接来连接atlas, 这时候atlas对php发送来的sql只负责验证和转发的操作,后端DB的连接由atlas自己管理,未使用的连接线程进行剔除操作(DB的wait_timeout和interactive_timeout设置为300s,超时亦退出)。

2014-04-12 20:56:29: (warning) (libevent) event_del: event has no event_base set.
2014-04-12 20:56:29: (critical) last message repeated 5 times
2014-04-12 20:56:29: (critical) network-conn-pool-lua.c.144: socket() failed: Too many open files (24)
2014-04-12 20:56:29: (warning) (libevent) event_del: event has no event_base set.
2014-04-12 20:56:30: (debug) chassis-unix-daemon.c:168: 12951 returned: 12951
2014-04-12 20:56:30: (critical) chassis-unix-daemon.c:196: [angel] PID=12951 died on signal=11 (it used 16 kBytes max) ... waiting 3min before restart
如果MySQL后端的连接数也满了, 或者atlas 连不上后端的数据库都可能会报以下错误: ``` 2014-11-13 12:21:07: (critical) network_mysqld_proto_password_scramble: assertion `20 == challenge_len' failed 2014-11-13 12:21:07: (warning) (libevent) event_del: event has no event_base set. 2014-11-13 12:21:07: (critical)  ``` 可以临时增加MySQL connection数量: echo -n "Max processes=SOFT_LIMIT:HARD_LIMIT" > /proc/`pidof mysqld`/limits

关于Too many open files错误,可能由两种情况引起: 一、php长连接连接到atlas后,每个线程占用一个FD,直到超出系统资源限制而出现too many错误; 二、php应用端发送到atlas的sql过多,大量并发的情况下,linevent维护的队列过多,每个event吃一个FD,超出系统资源限制引起too many错误;

避免too many错误,增加用户的ulimit值加大FD的使用量,可增加系统ulimit 资源到 ~/.bash_profile文件或/etc/security/limits.conf文件:

# cat .bash_profile 
# .bash_profile
...
...
export PATH
ulimit -n 16384

5. OOM问题 mysql-proxy进程占用系统资源过多, 由于系统评分机制的原因, atlas进程被系统杀掉, 比如以下错误:

May  3 09:55:31 cz kernel: Out of memory: Kill process 45330 (mysql-proxy) score 1 or sacrifice child
May  3 09:55:31 cz kernel: Killed process 45331, UID 0, (mysql-proxy) total-vm:4917192kB, anon-rss:177720kB, file-rss:40kB
May  3 09:55:31 cz kernel: mysql-proxy: page allocation failure. order:0, mode:0x201da
May  3 09:55:31 cz kernel: Pid: 45331, comm: mysql-proxy Not tainted 2.6.32-431.5.1.el6.x86_64 #1
May  3 09:55:31 cz kernel: Call Trace:
May  3 09:55:31 cz kernel: [<ffffffff8112f9d7>] ? __alloc_pages_nodemask+0x757/0x8d0
May  3 09:55:31 cz kernel: [<ffffffff81167a8a>] ? alloc_pages_current+0xaa/0x110
May  3 09:55:31 cz kernel: [<ffffffff8111fd47>] ? __page_cache_alloc+0x87/0x90
May  3 09:55:31 cz kernel: [<ffffffff8111f72e>] ? find_get_page+0x1e/0xa0
May  3 09:55:31 cz kernel: [<ffffffff81120ce7>] ? filemap_fault+0x1a7/0x500
May  3 09:55:31 cz kernel: [<ffffffff8114a074>] ? __do_fault+0x54/0x530
May  3 09:55:31 cz kernel: [<ffffffff8114a647>] ? handle_pte_fault+0xf7/0xb00
May  3 09:55:31 cz kernel: [<ffffffff81527c20>] ? thread_return+0x4e/0x76e
May  3 09:55:31 cz kernel: [<ffffffff8109f23b>] ? __remove_hrtimer+0x3b/0xb0
May  3 09:55:31 cz kernel: [<ffffffff8109f7f1>] ? lock_hrtimer_base+0x31/0x60
May  3 09:55:31 cz kernel: [<ffffffff810a047f>] ? hrtimer_try_to_cancel+0x3f/0xd0
May  3 09:55:31 cz kernel: [<ffffffff8114b27a>] ? handle_mm_fault+0x22a/0x300
May  3 09:55:31 cz kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
May  3 09:55:31 cz kernel: [<ffffffff811d174e>] ? ep_poll+0x12e/0x330
May  3 09:55:31 cz kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
May  3 09:55:31 cz kernel: [<ffffffff8152d76e>] ? do_page_fault+0x3e/0xa0
May  3 09:55:31 cz kernel: [<ffffffff8152ab25>] ? page_fault+0x25/0x30

鉴于这种情况, 即便atlas做了高可用相关的架构, 在切换到另一台atlas之后也可能很快的被系统 OOM 掉, 可以临时禁止atlas进程参与系统评分, 尽量避免被系统杀掉.

for x in `pidof mysql-proxy`; do  echo '-17' > /proc/$x/oom_adj; done