一直在学习管理和维护mysql数据库服务器,上次写了一篇关系性能调优的文章,这篇是进一步做结果分析.
星期三的时候,数据库又出现问题了.反应网页打不开地图,开发过来询问原因.
1.通过监控cacti查看得到图片
分析:哇,可以看到系统负载猛增,而数据库事务处理数据异常,主要是因为查询导致出现大量的锁表.
结果:定位于mysql
2.通过top实时查看进程状态
分析:系统负载相比平常来说,已经很大了.一般为0.1左右.
进程数据正常
cpu的用户占用比导常,确认为mysql用户引起,磁盘IO读写对服务器的影响小之又小.
内存正常使用.
结果: mysql这个应用导致CPU利用高,再次指向mysql.
top - 17:22:59 up 63 days, 1:04, 3 users, load average: 0.66, 0.63, 0.72
Tasks: 140 total, 1 running, 139 sleeping, 0 stopped, 0 zombie
Cpu(s): 86.9%us, 0.1%sy, 0.0%ni, 13.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3925936k total, 1942804k used, 1983132k free, 290252k buffers
5026 mysql 20 0 3606m 1.0g 6512 S 388.0 26.6 4000:54 mysqld
top - 17:23:02 up 63 days, 1:05, 3 users, load average: 0.60, 0.62, 0.71
Tasks: 140 total, 1 running, 139 sleeping, 0 stopped, 0 zombie
Cpu(s): 92.7%us, 0.1%sy, 0.0%ni, 7.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3925936k total, 1950360k used, 1975576k free, 290252k buffers
5026 mysql 20 0 3606m 1.0g 6512 S 399.7 26.6 4001:06 mysqld
top - 17:23:05 up 63 days, 1:05, 3 users, load average: 0.60, 0.62, 0.71
Tasks: 140 total, 1 running, 139 sleeping, 0 stopped, 0 zombie
Cpu(s): 92.0%us, 0.1%sy, 0.0%ni, 7.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3925936k total, 1950360k used, 1975576k free, 290252k buffers
5026 mysql 20 0 3606m 1.0g 6512 S 399.7 26.6 4001:18 mysqld
3,通过dstat -tlpcsdny综合查看
分析: 负载: 蛮大的嘛.
进程数据: 有三个进程在同时跑,相对平时有导演.
cpu使用情况: 用户占用比率70%,怀疑是mysql用户的mysqld进程引起
缓存可是使用18M,通过分析得知是mysqld程序使用的.
磁盘读写和网络读写正常
CPU交互处理值高
结果:mysql占用缓存,是不是数据库服务器配置的缓存不够呢?导致数据频繁转换,引起CPU忙碌?
mysql可能正在运算什么,导致CPU值高.可能是语句,可能是缓存读写.
[root@testmysql1 ~]# dstat -tlpcsdny
----system---- ---load-avg--- ---procs--- ----total-cpu-usage---- ----swap--- -dsk/total- -net/total- ---system--
date/time | 1m 5m 15m |run blk new|usr sys idl wai hiq siq| used free| read writ| recv send| int csw
31-10 17:29:44|0.48 0.67 0.78|0.0 0 0.3| 1 0 98 0 0 0| 18M 5998M|8139B 25k| 0 0 | 169 157
31-10 17:29:45|0.48 0.67 0.78|3.0 0 0| 72 0 28 0 0 0| 18M 5998M| 0 0 | 60B 1226B|3678 963
31-10 17:29:46|0.52 0.67 0.78|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 474B|3525 925
31-10 17:29:47|0.52 0.67 0.78|3.0 0 0| 76 0 24 0 0 0| 18M 5998M| 0 0 | 11k 1021B|3654 1072
31-10 17:29:48|0.52 0.67 0.78|3.0 0 0| 67 0 33 0 0 0| 18M 5998M| 0 0 | 60B 474B|3724 1103
31-10 17:29:49|0.52 0.67 0.78|3.0 0 0| 72 0 28 0 0 0| 18M 5998M| 0 0 | 60B 474B|3668 1008
31-10 17:29:50|0.52 0.67 0.78|3.0 0 0| 72 0 28 0 0 0| 18M 5998M| 0 0 | 60B 474B|3732 1174
31-10 17:29:51|0.48 0.66 0.78|3.0 0 0| 70 0 30 0 0 0| 18M 5998M| 0 0 | 60B 474B|3700 1038
31-10 17:29:52|0.48 0.66 0.78|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 16k| 60B 474B|3749 1151
31-10 17:29:53|0.48 0.66 0.78|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 474B|3597 1063
31-10 17:29:54|0.48 0.66 0.78|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 474B|3712 1091
31-10 17:29:55|0.48 0.66 0.78|3.0 0 0| 72 0 28 0 0 0| 18M 5998M| 0 0 | 60B 474B|3688 1013
31-10 17:29:56|0.44 0.65 0.77|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 474B|3551 964
31-10 17:29:57|0.44 0.65 0.77|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 474B|3723 1093
31-10 17:29:58|0.44 0.65 0.77|3.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 582B|3678 1007
31-10 17:29:59|0.44 0.65 0.77|4.0 0 0| 71 0 29 0 0 0| 18M 5998M| 0 0 | 60B 528B|3563 989
4.iostat -dx sdb 1 用于查看磁盘的使用情况.
磁盘的读写不是很频繁呀,可能是摘取的数据量偏少吧。
sdb是用于存放数据的磁盘。
[root@testmysql1 ~]# iostat -dx sdb 1
Linux 2.6.32-220.el6.x86_64 (testmysql1.iosm.cn) 10/31/2012 _x86_64_ (4 CPU)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.34 0.01 0.50 0.89 13.73 28.65 0.00 0.76 0.46 0.02
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5.进入到服务器看看线程情况.
分析,这是服务器内部所有线程,几乎都是在锁表查询,蓝色那条是在创建临时表.
show full processlist可以列出所有线程,且显示整条执行命令的详细内容.
结果
SQL语句一定程度上给系统带来压力,这里可以验证,都锁表并且长时间查询。
后来知道主要问题是那个蓝色框起SQL语句引起.这会在后面说.
6.因为这个数据库主要使用innodb作为主要引擎,查看一下innodb的状态信息show innodb status\G;
结果:进一步确认SQL语句对主机造成的影响,但是不排除缓存呀。下一步我们来了解数据库的具体状态。
------------
TRANSACTIONS
------------
Trx id counter 0 145979792
Purge done for trx's n:o < 0 145958397 undo n:o < 0 0
History list length 7263
#事务挺多队列的.可惜我不会看.谁可以教教我.
LIST OF TRANSACTIONS FOR EACH SESSION:
MySQL thread id 411116, query id 154797197 192.168.209.101 iscreate Table lock
SELECT bs.* FROM associate_area_basestation AS aab,BaseStation AS bs WHERE basestationId = bs.id AND areaId =1492 AND basestationtype = 'BaseStation'
---TRANSACTION 0 145958402, not started, process no 5026, OS thread id 139880173119232
MySQL thread id 411112, query id 154748149 192.168.209.101 iscreate
---TRANSACTION 0 145958387, not started, process no 5026, OS thread id 139880161138432
mysql tables in use 35, locked 0
MySQL thread id 411113, query id 154822298 192.168.209.101 iscreate Table lock
SELECT bs.* FROM associate_area_basestation AS aab,BaseStation AS bs WHERE basestationId = bs.id AND areaId =1 AND basestationtype = 'BaseStation'
---TRANSACTION 0 145958444, not started, process no 5026, OS thread id 139880125728512
MySQL thread id 411110, query id 154863380 192.168.209.101 iscreate
---TRANSACTION 0 145958453, not started, process no 5026, OS thread id 139880044050176
mysql tables in use 35, locked 0
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
7603 OS file reads, 7256222 OS file writes, 189552 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
#不会看.
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5, seg size 7,
95 inserts, 95 merged recs, 45 merges
Hash table size 2212699, node heap has 491 buffer(s)
1156298.35 hash searches/s, 32730.13 non-hash searches/s
#不会看.
---
LOG
---
Log sequence number 2 63379809
Log flushed up to 2 63379809
Last checkpoint at 2 63379809
0 pending log writes, 0 pending chkp writes
6977921 log i/o's done, 0.00 log i/o's/second
#虽然我不会看,不过日志缓存是没有问题的,同步率高呀.
----------------------
BUFFER POOL AND MEMORY
----------------------
#这是innodb缓冲和内存的信息
Total memory allocated 1242263602; in additional pool allocated 23789312
Dictionary memory allocated 1183816
#innodb_buffer_size大约在1G左右,额外的内在也有23M吧.
Buffer pool size 65536
Free buffers 37175
Database pages 27870
Modified db pages 0
#innodb池有6万多page,空闲3万,缓存数据449M,
#而居然没有脏数据,因为都在查询呀.
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages read 25176, created 2694, written 361950
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000
#可以看到池命中率达100%,还好吧.
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
2 read views open inside InnoDB
Main thread process no. 5026, id 139880204908288, state: waiting for server activity
#存在两个视图在innodb,还好吧.
Number of rows inserted 75526, updated 6833124, deleted 24680, read 168674061904
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 1175467.27 reads/s
#这个dbs读取率吓人呀.插入,更新,删除几乎没有.
----------------------------
7.这是通过mysqlreport查看的状态信息
分析:我们已经判定是数据库进程给服务器造成性能方面的影响,可是数据库到底出了什么问题呢?
以下为分析内容。
结果:可以排除由于缓存不足给数据库带来的影响,很充足呀。所以现在所以的证据都指向sql语句了。
MySQL 5.1.41-log uptime 43 0:32:39 Wed Oct 31 16:19:07 2012
__ Key _________________________________________________________________
Buffer used 1.21M of 32.00M %Used: 3.77
Current 6.02M %Usage: 18.82
Write hit 91.19%
Read hit 100.00%
#key是专用于缓存myisam的主键信息,有充足的内存提供,读写命中率都很好。排除
__ Questions ___________________________________________________________
Total 149.93M 40.3/s
QC Hits 103.65M 27.9/s %Total: 69.13
Com_ 30.79M 8.3/s 20.53
DMS 16.13M 4.3/s 10.76
-Unknown 1.03M 0.3/s 0.69
COM_QUIT 390.79k 0.1/s 0.26
#可以看到mysql的主要操作都是有意义的,69%缓存命中,10%读写增删,
#只有20%非有用性操作,排除之前系统Com_比占多数的故障,因为有调优了。
Slow 10 s 358 0.0/s 0.00 %DMS: 0.00 Log: OFF
DMS 16.13M 4.3/s 10.76
SELECT 8.75M 2.4/s 5.84 54.25
UPDATE 7.14M 1.9/s 4.76 44.28
INSERT 211.21k 0.1/s 0.14 1.31
DELETE 26.92k 0.0/s 0.02 0.17
REPLACE 0 0/s 0.00 0.00
#再一次证明10%中的DMS有54%的读操作。
#这数据库服务器主要用于读操作。
Com_ 30.79M 8.3/s 20.53
set_option 14.90M 4.0/s 9.94
rollback 7.25M 1.9/s 4.83
commit 7.02M 1.9/s 4.68
__ SELECT and Sort _____________________________________________________
Scan 1.76M 0.5/s %SELECT: 20.16
Range 286.96k 0.1/s 3.28
Full join 132.61k 0.0/s 1.52
Range check 0 0/s 0.00
Full rng join 1.39k 0.0/s 0.02
Sort scan 55.06k 0.0/s
Sort range 54.10k 0.0/s
Sort mrg pass 352 0.0/s
__ Query Cache _________________________________________________________
Memory usage 35.49M of 128.00M %Used: 27.72
Block Fragmnt 10.13%
Hits 103.65M 27.9/s
Inserts 7.65M 2.1/s
Insrt:Prune 7.65M:1 2.1/s
Hit:Insert 13.55:1
#128M查询缓存充足,只占用27%,足矣。
__ Table Locks _________________________________________________________
Waited 239 0.0/s %Total: 0.00
Immediate 17.47M 4.7/s
__ Tables ______________________________________________________________
Open 1025 of 1600 %Cache: 64.06
Opened 58.92k 0.0/s
__ Connections _________________________________________________________
Max used 590 of 1500 %Max: 39.33
Total 411.36k 0.1/s
__ Created Temp ________________________________________________________
Disk table 80.57k 0.0/s
Table 1.84M 0.5/s Size: 16.0M
File 211 0.0/s
__ Threads _____________________________________________________________
Running 23 of 192
Cached 399 of 1024 %Hit: 99.86
Created 590 0.0/s
Slow 0 0/s
__ Aborted _____________________________________________________________
Clients 41.12k 0.0/s
Connects 6.12k 0.0/s
__ Bytes _______________________________________________________________
Sent 124.14G 33.4k/s
Received 13.01G 3.5k/s
__ InnoDB Buffer Pool __________________________________________________
Usage 443.14M of 1.00G %Used: 43.28
Read hit 100.00%
#innodb的缓存池也只用了43%,命中率为100%。
Pages
Free 37.17k %Total: 56.72
Data 27.87k 42.53 %Drty: 0.00
Misc 491 0.75
Latched 0.00
#空闲的PAG还有一半,脏page也没有,特定page也较少。
Reads 325.60G 87.6k/s
From file 6.70k 0.0/s 0.00
Ahead Rnd 14 0.0/s
Ahead Sql 292 0.0/s
Writes 44.83M 12.1/s
Flushes 362.08k 0.1/s
Wait Free 0 0/s
#说明还是读操作厉害呀,即使是写,也只是刷新而已。
#缓存得到充分使用,还有充足的空间。
__ InnoDB Lock _________________________________________________________
Waits 105825 0.0/s
Current 0
Time acquiring
Total 307271 ms
Average 2 ms
Max 3378 ms
__ InnoDB Data, Pages, Rows ____________________________________________
Data
Reads 7.60k 0.0/s
Writes 7.26M 2.0/s
fsync 189.61k 0.1/s
Pending
Reads 0
Writes 0
fsync 0
Pages
Created 2.69k 0.0/s
Read 25.18k 0.0/s
Written 362.08k 0.1/s
Rows
Deleted 24.68k 0.0/s
Inserted 75.53k 0.0/s
Read 169.00G 45.5k/s
Updated 6.84M 1.8/s
8.不好意思,现在你看到的只有两条线程在执行。 但即使只有两条语句,通过dstat iostat分析,
服务器的压力还在,嗯,是这些语句的问题了,因为调了一个钟,他们都一直存在。
我试图explain一条语句的时候,居然等了很久都没有解释出来,看来sql语句得进一步优化呀。
所以得到原来是这些开发人员编写的语句,造成服务器性能的问题,交给开发人员去解决吧。
结果:innodb引擎是基于行读锁的,当以下语句对多个表连表查询时,会进行笛卡儿积运算,消耗CPU资源,
如果再没有索引过滤,呵呵。
一定要小心,因为会产生大量的中转临时表,查询也会占用很久很久的时间。如果不小心,成了僵死线程。
以下就是一个很好的案例。
[root@testmysql1 ~]# mysql -u root -p -e "show full processlist" | grep -i SELECT
Enter password:
411436 xx 192.168.209.101:36671 iosm Query 2409 Copying to tmp table SELECT b.basestationid AS basestationid, b.bname AS basestationname,b.id AS stationid,b.sname AS stationname,u.urgencyrepairCount,r.routineCount FROM (SELECT station.id,station.name AS sname,aab.basestationId,aab.name AS bname,aab.areaId FROM ( SELECT * FROM networkresourcemanage.associate_area_basestation, networkresourcemanage.BaseStation_WLAN WHERE basestationType = 'BaseStation_WLAN' AND basestationId = id) AS aab,networkresourcemanage.station AS station WHERE aab.stationId = station.id AND aab.areaId IN (1035,1029,1034,1028,1037,1027,1036,1026,1039,1497,1038,190,1025,1024,1032,1033,1030,1040,1031,1041,1492)) AS b LEFT JOIN (SELECT COUNT(*) AS urgencyrepairCount,t1.networkResourceId AS uId,t1.stationId AS usId,t1.networkResourcetype AS uType FROM v_bizwf_taskorder t3 , urgencyrepair_workorderassnetresource t1,v_bizwf_commonworkorder t2 WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND t3.WOID = t2.WOID AND t1.woId IS NOT NULL AND t1.woId = t2.WOID GROUP BY t1.networkResourceId) AS u ON u.uId = b.basestationId AND u.uType = 'BaseStation_WLAN' LEFT JOIN (SELECT COUNT(*) routineCount,t1.networkResourceId AS rId,t1.stationId AS rsId,t1.networkResourcetype AS uType FROM v_bizwf_taskorder t3 , urgencyrepair_workorderassnetresource t1,v_bizwf_commonworkorder t2 WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND t3.WOID = t2.WOID AND t3.OPERATETIME < NOW() AND t1.woId IS NOT NULL AND t1.woId = t2.WOID GROUP BY t1.networkResourceId) AS r ON r.rId = b.basestationId AND r.uType = 'BaseStation_WLAN'
411746 xx 192.168.206.229:49627 iosm Query 102 Copying to tmp table SELECT b.basestationid AS basestationid, b.bname AS basestationname,b.id AS stationid,b.sname AS stationname,u.urgencyrepairCount,r.routineCount \n\tFROM (SELECT station.id,station.name AS sname,aab.basestationId,aab.name AS bname,aab.areaId \n\tFROM ( SELECT * FROM networkresourcemanage.associate_area_basestation, networkresourcemanage.BaseStation_WLAN WHERE basestationType = 'BaseStation_WLAN' AND basestationId = id) AS aab\n,networkresourcemanage.station AS station WHERE aab.stationId = station.id AND aab.areaId IN (1035,1029,1034,1028,1037,1027,1036,1026,1039,1497,1038,190,1025,1024,1032,1033,1030,1040,1031,1041,1492)) AS b\n\t LEFT JOIN \n\t (SELECT COUNT(*) AS urgencyrepairCount,t1.networkResourceId AS uId,t1.stationId AS usId,t1.networkResourcetype AS uType FROM v_bizwf_taskorder t3 , \n\t urgencyrepair_workorderassnetresource t1,v_bizwf_commonworkorder t2 WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND t3.WOID = t2.WOID \n\t AND t1.woId IS NOT NULL AND t1.woId = t2.WOID GROUP BY t1.networkResourceId) AS u ON u.uId = b.basestationId AND u.uType = 'BaseStation_WLAN' \n\t LEFT JOIN \n\t (SELECT COUNT(*) routineCount,t1.networkResourceId AS rId,t1.stationId AS rsId,t1.networkResourcetype AS uType FROM v_bizwf_taskorder t3 , urgencyrepair_workorderassnetresource t1\n\t ,v_bizwf_commonworkorder t2 WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND t3.WOID = t2.WOID AND t3.OPERATETIME < NOW() \n\t AND t1.woId IS NOT NULL AND t1.woId = t2.WOID GROUP BY t1.networkResourceId) AS r ON r.rId = b.basestationId AND r.uType = 'BaseStation_WLAN'
##################################################################
1.tomcat服务器的底层架构java内核是公司自己开发的,无论是增删改查操作,innodb都会产生
commit 和rollback,造成服务器很忙(Com_数值过50%),java内核在整改,这次不是主要原因。
2.之前服务器都是使用默认配置,数据库层面没有调优,以前事故频繁,现在已经解决了,这次不是主要原因。
3.开发人员处理sql语句经常会出错,忘记使用索引呀,语句臃肿,是这次事故的重要原因。
#################################################################