Since a few days ago when I had to restart one of our shared hosting servers (and failed to restart so the DC techs had to intervene - they've probably ran some manual fsck but I am just not sure what they've run) the IO performance on our server dropped dramatically right after reboot.
The HDDs are setup in Raid 1 behind a LSI MegaRAID 8704ELP Version: 1.20.
The HDD themselves are ST3500320NS.
The raid matrix seems to be in a NOT degraded state (status optimal, no media errors).
CPU is E5520 (quad core w. HT, 8MB cache)
The problem is that IO wait is 5 times bigger than it should (probably more) and iostat shows pretty weird data. For comparison I present 2 identical servers:
rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util OK one: 62.90 67.33 76.98 53.49 2469.66 985.81 26.48 0.10 3.83 0.22 2.84 Bad one: 12.67 78.06 74.31 49.04 2318.72 1017.19 27.04 2.19 17.73 4.38 54.03
While r/s and w/s are about the same (I believe this means they share a similar utilization) and avgrq-sz is virtually the same, rrqm/s is much lower in the bad system, avgrq-sz is much higher (it gets to about 75 times under higher load) and the await is also much higher (gets to ~ 50 times larger under load) and also service time (svctm).
Also while on the OK server kjournald is very discreet on the bad server kjournald takes the top through 2 different forks (out of 4) even after setting the ionice class to Idle for those 2 kjournald processes.
So what makes rrqm go down and avgqu-sz, await and svctm go up in a bad system? Is it a HDD, is it the card itself, is it some rogue mount option? What is busting the second server?
Thanks in advance for any suggestion!
Edited by AndyB78, 06 December 2012 - 07:21 PM.