一直在学习管理和维护mysql数据库服务器,上次写了一篇关系性能调优的文章,这篇是进一步做结果分析.

星期三的时候,数据库又出现问题了.反应网页打不开地图,开发过来询问原因.
1.通过监控cacti查看得到图片
分析:哇,可以看到系统负载猛增,而数据库事务处理数据异常,主要是因为查询导致出现大量的锁表.
结果:定位于mysql       

 
2.通过top实时查看进程状态
分析:系统负载相比平常来说,已经很大了.一般为0.1左右.
      进程数据正常
      cpu的用户占用比导常,确认为mysql用户引起,磁盘IO读写对服务器的影响小之又小.
      内存正常使用.
结果:  mysql这个应用导致CPU利用高,再次指向mysql.
top - 17:22:59 up 63 days,  1:04,  3 users,  load average: 0.66, 0.63, 0.72
Tasks: 140 total,   1 running, 139 sleeping,   0 stopped,   0 zombie
Cpu(s):
 
86.9%us,  0.1%sy,  0.0%ni, 13.0%id,  
0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3925936k total,  1942804k used,  1983132k free,   290252k buffers
 5026 mysql     20   0 3606m 1.0g 6512 S
 
388.0
 26.6   4000:54 mysqld                                                                                                         
top - 17:23:02 up 63 days,  1:05,  3 users,  load average: 0.60, 0.62, 0.71
Tasks: 140 total,   1 running, 139 sleeping,   0 stopped,   0 zombie
Cpu(s):
 
92.7%us,  0.1%sy,  0.0%ni,  7.2%id,  
0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3925936k total,
 
 1950360k used,  1975576k free,   290252k buffers
 5026 mysql     20   0 3606m 1.0g 6512 S
 
399.7
 26.6   4001:06 mysqld                                                                                                         
top - 17:23:05 up 63 days,  1:05,  3 users,  load average: 0.60, 0.62, 0.71
Tasks: 140 total,   1 running, 139 sleeping,   0 stopped,   0 zombie
Cpu(s):
 
92.0%us,  0.1%sy,  0.0%ni,  7.9%id,  
0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3925936k total,  1950360k used,  1975576k free,   290252k buffers
 5026 mysql     20   0 3606m 1.0g 6512 S
 399.7
 26.6   4001:18 mysqld                                                                                                        
3,通过dstat -tlpcsdny综合查看
分析: 负载: 蛮大的嘛.
      进程数据: 有三个进程在同时跑,相对平时有导演.
      cpu使用情况: 用户占用比率70%,怀疑是mysql用户的mysqld进程引起
      缓存可是使用18M,通过分析得知是mysqld程序使用的.
     磁盘读写和网络读写正常
   CPU交互处理值高
结果:mysql占用缓存,是不是数据库服务器配置的缓存不够呢?导致数据频繁转换,引起CPU忙碌?
   mysql可能正在运算什么,导致CPU值高.可能是语句,可能是缓存读写.
[root@testmysql1 ~]# dstat -tlpcsdny
----system---- ---load-avg--- ---procs--- ----total-cpu-usage---- ----swap--- -dsk/total- -net/total- ---system--
  date/time   | 1m   5m  15m |run blk new|usr sys idl wai hiq siq| used  free| read  writ| recv  send| int   csw 
31-10 17:29:44|0.48 0.67 0.78|0.0   0 0.3|  1   0  98   0   0   0|  18M 5998M|8139B   25k|   0     0 | 169   157 
31-10 17:29:45|0.48 0.67 0.78|3.0   0   0| 72   0  28   0   0   0|  18M 5998M|   0     0 |  60B 1226B|3678   963 
31-10 17:29:46|0.52 0.67 0.78|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  474B|3525   925 
31-10 17:29:47|0.52 0.67 0.78|3.0   0   0| 76   0  24   0   0   0|  18M 5998M|   0     0 |  11k 1021B|3654  1072 
31-10 17:29:48|0.52 0.67 0.78|3.0   0   0| 67   0  33   0   0   0|  18M 5998M|   0     0 |  60B  474B|3724  1103 
31-10 17:29:49|0.52 0.67 0.78|3.0   0   0| 72   0  28   0   0   0|  18M 5998M|   0     0 |  60B  474B|3668  1008 
31-10 17:29:50|0.52 0.67 0.78|3.0   0   0| 72   0  28   0   0   0|  18M 5998M|   0     0 |  60B  474B|3732  1174 
31-10 17:29:51|0.48 0.66 0.78|3.0   0   0| 70   0  30   0   0   0|  18M 5998M|   0     0 |  60B  474B|3700  1038 
31-10 17:29:52|0.48 0.66 0.78|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0    16k|  60B  474B|3749  1151 
31-10 17:29:53|0.48 0.66 0.78|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  474B|3597  1063 
31-10 17:29:54|0.48 0.66 0.78|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  474B|3712  1091 
31-10 17:29:55|0.48 0.66 0.78|3.0   0   0| 72   0  28   0   0   0|  18M 5998M|   0     0 |  60B  474B|3688  1013 
31-10 17:29:56|0.44 0.65 0.77|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  474B|3551   964 
31-10 17:29:57|0.44 0.65 0.77|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  474B|3723  1093 
31-10 17:29:58|0.44 0.65 0.77|3.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  582B|3678  1007 
31-10 17:29:59|0.44 0.65 0.77|4.0   0   0| 71   0  29   0   0   0|  18M 5998M|   0     0 |  60B  528B|3563   989 
 
4.iostat -dx sdb 1 用于查看磁盘的使用情况.
磁盘的读写不是很频繁呀,可能是摘取的数据量偏少吧。
sdb是用于存放数据的磁盘。
[root@testmysql1 ~]# iostat -dx sdb 1
Linux 2.6.32-220.el6.x86_64 (testmysql1.iosm.cn) 
10/31/2012 
_x86_64_
(4 CPU)
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.34    0.01    0.50     0.89    13.73    28.65     0.00    0.76   0.46   0.02
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
 
5.进入到服务器看看线程情况.
分析,这是服务器内部所有线程,几乎都是在锁表查询,蓝色那条是在创建临时表.
        show full processlist可以列出所有线程,且显示整条执行命令的详细内容.
结果
        SQL语句一定程度上给系统带来压力,这里可以验证,都锁表并且长时间查询。
        后来知道主要问题是那个蓝色框起SQL语句引起.这会在后面说.
   

6.因为这个数据库主要使用innodb作为主要引擎,查看一下innodb的状态信息show innodb status\G;
结果:进一步确认SQL语句对主机造成的影响,但是不排除缓存呀。下一步我们来了解数据库的具体状态。
------------
TRANSACTIONS
------------
Trx id counter 0 145979792
Purge done for trx's n:o < 0 145958397 undo n:o < 0 0
History list length 7263
#事务挺多队列的.可惜我不会看.谁可以教教我.
LIST OF TRANSACTIONS FOR EACH SESSION:
MySQL thread id 411116, query id 154797197 192.168.209.101 iscreate Table lock
SELECT bs.* FROM associate_area_basestation AS aab,BaseStation AS bs WHERE basestationId = bs.id AND areaId =1492 AND basestationtype = 'BaseStation'
---TRANSACTION 0 145958402, not started, process no 5026, OS thread id 139880173119232
MySQL thread id 411112, query id 154748149 192.168.209.101 iscreate
---TRANSACTION 0 145958387, not started, process no 5026, OS thread id 139880161138432
mysql tables in use 35, locked 0
MySQL thread id 411113, query id 154822298 192.168.209.101 iscreate Table lock
SELECT bs.* FROM associate_area_basestation AS aab,BaseStation AS bs WHERE basestationId = bs.id AND areaId =1 AND basestationtype = 'BaseStation'
---TRANSACTION 0 145958444, not started, process no 5026, OS thread id 139880125728512
MySQL thread id 411110, query id 154863380 192.168.209.101 iscreate
---TRANSACTION 0 145958453, not started, process no 5026, OS thread id 139880044050176
mysql tables in use 35, locked 0
 
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
7603 OS file reads, 7256222 OS file writes, 189552 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
#不会看.
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5, seg size 7,
95 inserts, 95 merged recs, 45 merges
Hash table size 2212699, node heap has 491 buffer(s)
1156298.35 hash searches/s, 32730.13 non-hash searches/s
#不会看.
---
LOG
---
Log sequence number 2 63379809
Log flushed up to   2 63379809
Last checkpoint at  2 63379809
0 pending log writes, 0 pending chkp writes
6977921 log i/o's done, 0.00 log i/o's/second
#虽然我不会看,不过日志缓存是没有问题的,同步率高呀.
----------------------
BUFFER POOL AND MEMORY
----------------------
#这是innodb缓冲和内存的信息
Total memory allocated 1242263602; in additional pool allocated 23789312
Dictionary memory allocated 1183816
#innodb_buffer_size大约在1G左右,额外的内在也有23M吧.
Buffer pool size   65536
Free buffers       37175
Database pages     27870
Modified db pages  0
#innodb池有6万多page,空闲3万,缓存数据449M,
#而居然没有脏数据,因为都在查询呀.
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages read 25176, created 2694, written 361950
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000
#可以看到池命中率达100%,还好吧.
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
2 read views open inside InnoDB
Main thread process no. 5026, id 139880204908288, state: waiting for server activity
#存在两个视图在innodb,还好吧.
Number of rows inserted 75526, updated 6833124, deleted 24680, read 168674061904
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 1175467.27 reads/s
#这个dbs读取率吓人呀.插入,更新,删除几乎没有.
----------------------------
 
7.这是通过mysqlreport查看的状态信息
分析:我们已经判定是数据库进程给服务器造成性能方面的影响,可是数据库到底出了什么问题呢?
以下为分析内容。
结果:可以排除由于缓存不足给数据库带来的影响,很充足呀。所以现在所以的证据都指向sql语句了。
MySQL 5.1.41-log         uptime 43 0:32:39      Wed Oct 31 16:19:07 2012
__ Key _________________________________________________________________
Buffer used     1.21M of  32.00M  %Used:   3.77
  Current       6.02M            %Usage:  18.82
Write hit      91.19%
Read hit      100.00%
#key是专用于缓存myisam的主键信息,有充足的内存提供,读写命中率都很好。排除
__ Questions ___________________________________________________________
Total         149.93M    40.3/s
  QC Hits     103.65M    27.9/s  %Total:  69.13
  Com_         30.79M     8.3/s           20.53
  DMS          16.13M     4.3/s           10.76
  -Unknown      1.03M     0.3/s            0.69
  COM_QUIT    390.79k     0.1/s            0.26
#可以看到mysql的主要操作都是有意义的,69%缓存命中,10%读写增删,
#只有20%非有用性操作,排除之前系统Com_比占多数的故障,因为有调优了。
Slow 10 s         358     0.0/s            0.00  %DMS:   0.00  Log: OFF
DMS            16.13M     4.3/s           10.76
  SELECT        8.75M     2.4/s            5.84         54.25
  UPDATE        7.14M     1.9/s            4.76         44.28
  INSERT      211.21k     0.1/s            0.14          1.31
  DELETE       26.92k     0.0/s            0.02          0.17
  REPLACE           0       0/s            0.00          0.00
#再一次证明10%中的DMS有54%的读操作。
#这数据库服务器主要用于读操作。
Com_           30.79M     8.3/s           20.53
  set_option   14.90M     4.0/s            9.94
  rollback      7.25M     1.9/s            4.83
  commit        7.02M     1.9/s            4.68
__ SELECT and Sort _____________________________________________________
Scan            1.76M     0.5/s %SELECT:  20.16
Range         286.96k     0.1/s            3.28
Full join     132.61k     0.0/s            1.52
Range check         0       0/s            0.00
Full rng join   1.39k     0.0/s            0.02
Sort scan      55.06k     0.0/s
Sort range     54.10k     0.0/s
Sort mrg pass     352     0.0/s
 
__ Query Cache _________________________________________________________
Memory usage   35.49M of 128.00M  %Used:  27.72
Block Fragmnt  10.13%
Hits          103.65M    27.9/s
Inserts         7.65M     2.1/s
Insrt:Prune   7.65M:1     2.1/s
Hit:Insert    13.55:1
#128M查询缓存充足,只占用27%,足矣。
__ Table Locks _________________________________________________________
Waited            239     0.0/s  %Total:   0.00
Immediate      17.47M     4.7/s
 
__ Tables ______________________________________________________________
Open             1025 of 1600    %Cache:  64.06
Opened         58.92k     0.0/s
 
__ Connections _________________________________________________________
Max used          590 of 1500      %Max:  39.33
Total         411.36k     0.1/s
 
__ Created Temp ________________________________________________________
Disk table     80.57k     0.0/s
Table           1.84M     0.5/s    Size:  16.0M
File              211     0.0/s
 
__ Threads _____________________________________________________________
Running            23 of  192
Cached            399 of 1024      %Hit:  99.86
Created           590     0.0/s
Slow                0       0/s
 
__ Aborted _____________________________________________________________
Clients        41.12k     0.0/s
Connects        6.12k     0.0/s
 
__ Bytes _______________________________________________________________
Sent          124.14G   33.4k/s
Received       13.01G    3.5k/s
 
__ InnoDB Buffer Pool __________________________________________________
Usage         443.14M of   1.00G  %Used:  43.28
Read hit      100.00%
#innodb的缓存池也只用了43%,命中率为100%。
Pages
  Free         37.17k            %Total:  56.72
  Data         27.87k                     42.53 %Drty:   0.00
  Misc            491                      0.75
  Latched                                  0.00
#空闲的PAG还有一半,脏page也没有,特定page也较少。
Reads         325.60G   87.6k/s
  From file     6.70k     0.0/s            0.00
  Ahead Rnd        14     0.0/s
  Ahead Sql       292     0.0/s
Writes         44.83M    12.1/s
Flushes       362.08k     0.1/s
Wait Free           0       0/s
#说明还是读操作厉害呀,即使是写,也只是刷新而已。
#缓存得到充分使用,还有充足的空间。
__ InnoDB Lock _________________________________________________________
Waits          105825     0.0/s
Current             0
Time acquiring
  Total        307271 ms
  Average           2 ms
  Max            3378 ms
 
__ InnoDB Data, Pages, Rows ____________________________________________
Data
  Reads         7.60k     0.0/s
  Writes        7.26M     2.0/s
  fsync       189.61k     0.1/s
  Pending
    Reads           0
    Writes          0
    fsync           0
 
Pages
  Created       2.69k     0.0/s
  Read         25.18k     0.0/s
  Written     362.08k     0.1/s
 
Rows
  Deleted      24.68k     0.0/s
  Inserted     75.53k     0.0/s
  Read        169.00G   45.5k/s
  Updated       6.84M     1.8/s
 
8.不好意思,现在你看到的只有两条线程在执行。
但即使只有两条语句,通过dstat iostat分析,
服务器的压力还在,嗯,是这些语句的问题了,因为调了一个钟,他们都一直存在。
我试图explain一条语句的时候,居然等了很久都没有解释出来,看来sql语句得进一步优化呀。
所以得到原来是这些开发人员编写的语句,造成服务器性能的问题,交给开发人员去解决吧。
结果:innodb引擎是基于行读锁的,当以下语句对多个表连表查询时,会进行笛卡儿积运算,消耗CPU资源,
如果再没有索引过滤,呵呵。
一定要小心,因为会产生大量的中转临时表,查询也会占用很久很久的时间。如果不小心,成了僵死线程。
以下就是一个很好的案例。
[root@testmysql1 ~]# mysql -u root -p  -e "show full processlist" | grep -i SELECT
Enter password: 
411436
xx
192.168.209.101:36671
iosm
Query
2409
Copying to tmp table
SELECT b.basestationid AS basestationid, b.bname AS basestationname,b.id AS stationid,b.sname AS stationname,u.urgencyrepairCount,r.routineCount FROM  (SELECT station.id,station.name AS sname,aab.basestationId,aab.name AS bname,aab.areaId FROM (  SELECT * FROM networkresourcemanage.associate_area_basestation, networkresourcemanage.BaseStation_WLAN WHERE basestationType = 'BaseStation_WLAN' AND basestationId = id)   AS aab,networkresourcemanage.station AS station WHERE aab.stationId = station.id AND aab.areaId IN (1035,1029,1034,1028,1037,1027,1036,1026,1039,1497,1038,190,1025,1024,1032,1033,1030,1040,1031,1041,1492)) AS b   LEFT JOIN  (SELECT COUNT(*) AS urgencyrepairCount,t1.networkResourceId AS uId,t1.stationId AS usId,t1.networkResourcetype AS uType FROM v_bizwf_taskorder t3 ,   urgencyrepair_workorderassnetresource t1,v_bizwf_commonworkorder t2  WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND  t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND  t3.WOID = t2.WOID  AND t1.woId IS   NOT NULL AND t1.woId = t2.WOID GROUP  BY t1.networkResourceId) AS u ON u.uId = b.basestationId AND u.uType = 'BaseStation_WLAN'  LEFT JOIN  (SELECT COUNT(*) routineCount,t1.networkResourceId AS rId,t1.stationId AS rsId,t1.networkResourcetype AS uType  FROM v_bizwf_taskorder t3 , urgencyrepair_workorderassnetresource  t1,v_bizwf_commonworkorder t2  WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND  t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND  t3.WOID = t2.WOID   AND t3.OPERATETIME < NOW()  AND t1.woId IS NOT NULL AND t1.woId = t2.WOID GROUP  BY  t1.networkResourceId) AS r ON r.rId = b.basestationId AND r.uType = 'BaseStation_WLAN'
 
411746
xx 192.168.206.229:49627
iosm
Query
102
Copying to tmp table
SELECT b.basestationid AS basestationid, b.bname AS basestationname,b.id AS stationid,b.sname AS stationname,u.urgencyrepairCount,r.routineCount \n\tFROM  (SELECT station.id,station.name AS sname,aab.basestationId,aab.name AS bname,aab.areaId \n\tFROM (  SELECT * FROM networkresourcemanage.associate_area_basestation, networkresourcemanage.BaseStation_WLAN WHERE basestationType = 'BaseStation_WLAN' AND basestationId = id)   AS aab\n,networkresourcemanage.station AS station WHERE aab.stationId = station.id AND aab.areaId IN (1035,1029,1034,1028,1037,1027,1036,1026,1039,1497,1038,190,1025,1024,1032,1033,1030,1040,1031,1041,1492)) AS b\n\t   LEFT JOIN  \n\t   (SELECT COUNT(*) AS urgencyrepairCount,t1.networkResourceId AS uId,t1.stationId AS usId,t1.networkResourcetype AS uType FROM v_bizwf_taskorder t3 ,   \n\t   urgencyrepair_workorderassnetresource t1,v_bizwf_commonworkorder t2  WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND  t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND  t3.WOID = t2.WOID  \n\t   AND t1.woId IS   NOT NULL AND t1.woId = t2.WOID GROUP  BY t1.networkResourceId) AS u ON u.uId = b.basestationId AND u.uType = 'BaseStation_WLAN'  \n\t   LEFT JOIN  \n\t   (SELECT COUNT(*) routineCount,t1.networkResourceId AS rId,t1.stationId AS rsId,t1.networkResourcetype AS uType  FROM v_bizwf_taskorder t3 , urgencyrepair_workorderassnetresource  t1\n\t   ,v_bizwf_commonworkorder t2  WHERE t3.TOID IS NOT NULL AND t3.WOID IS NOT NULL AND  t2.`STATUS`<> 3 AND t2.`STATUS`<> 1 AND  t3.WOID = t2.WOID   AND t3.OPERATETIME < NOW()  \n\t   AND t1.woId IS NOT NULL AND t1.woId = t2.WOID GROUP  BY  t1.networkResourceId) AS r ON r.rId = b.basestationId AND r.uType = 'BaseStation_WLAN'
 

##################################################################

1.tomcat服务器的底层架构java内核是公司自己开发的,无论是增删改查操作,innodb都会产生

commit 和rollback,造成服务器很忙(Com_数值过50%),java内核在整改,这次不是主要原因。

2.之前服务器都是使用默认配置,数据库层面没有调优,以前事故频繁,现在已经解决了,这次不是主要原因。

3.开发人员处理sql语句经常会出错,忘记使用索引呀,语句臃肿,是这次事故的重要原因。

#################################################################