[sheepdog-users] Unexpeted freeze of sheep on one node

Valerio Pachera sirio81 at gmail.com
Thu Nov 20 11:44:11 CET 2014


2014-11-20 7:30 GMT+01:00 Maxim Terletskiy <terletskiy at emu.ru>:
> I've had similar problems.
> In such cases good test will be iostat:
> iostat -dx 5 /dev/sd[a-z]

This is a very good suggestion.
My host has 2 devices: a single disk of 2T and a raid 5 (3 disk of
500G) managed by mdadm.

I run a write test on the raid device and check iostat

dd if=/dev/zero of=deleteme2 bs=4M count=$((2048/4)) oflag=direct
60,4 MB/s
(not bad)

iostat -dx 5 /dev/sd[a-c]

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               8,03    19,80    0,67    0,61    38,40    80,82
187,30     0,02   12,47   11,41   13,63   4,63   0,59
sdc               8,00    19,86    0,65    0,73    38,67    81,56
174,01     0,01    9,94    9,39   10,43   4,27   0,59
sda               7,89    19,77    0,70    0,75    38,46    81,26
164,85     0,01    5,80    5,93    5,67   3,04   0,44

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            2483,80  7573,40   51,20   95,00 10038,40 30672,80
556,92     1,58   10,80   20,92    5,35   4,52  66,08
sdc            2443,80  7578,80   42,60  105,40  9943,20 30736,00
549,72     1,20    8,14   18,69    3,88   3,72  55,04
sda            2461,60  7527,20   47,60  102,60 10137,60 30518,40
541,36     1,15    7,66   17,39    3,15   3,57  53,60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            2382,60  7384,20   50,00   95,60  9832,00 29744,00
543,63     1,94   13,35   29,10    5,11   4,57  66,48
sdc            2413,20  7416,40   46,60  106,20  9940,80 30063,20
523,61     1,26    8,28   18,03    4,00   3,58  54,64
sda            2419,60  7342,80   49,60  106,20  9876,80 29768,80
508,93     1,16    7,42   16,65    3,12   3,44  53,52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            2607,80  7957,00   54,60  107,60 10548,80 32383,20
529,37     1,76   10,86   23,75    4,31   3,87  62,80
sdc            2586,00  8013,60   52,80  111,80 10555,20 32506,40
523,23     1,35    8,18   17,32    3,86   3,49  57,44
sda            2598,00  8008,40   53,60  115,40 10556,00 32475,20
509,24     1,27    7,47   16,46    3,29   3,34  56,48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            2428,00  7481,00   57,20  108,40 10041,60 30409,60
488,54     1,73   10,44   22,08    4,30   3,91  64,80
sdc            2456,60  7494,60   52,20  118,20  9936,00 30476,80
474,33     1,34    7,80   17,93    3,32   3,26  55,60
sda            2425,20  7460,20   58,20  106,80  9883,20 30318,40
487,29     1,56    9,42   19,15    4,12   3,49  57,52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            2407,40  7340,20   52,40   99,40  9737,60 29757,60
520,36     1,95   12,81   27,30    5,17   4,43  67,28
sdc            2357,00  7323,80   50,20  103,80  9728,00 29709,60
512,18     1,39    9,14   18,76    4,49   3,60  55,44
sda            2409,00  7328,00   48,40  110,40  9828,80 29752,80
498,51     1,11    7,00   16,78    2,71   3,20  50,88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            2464,00  7550,00   47,00   91,00 10044,80 30460,80
587,04     1,48   10,50   20,66    5,26   4,62  63,76
sdc            2485,40  7532,00   48,80   91,80 10136,80 30493,60
577,96     1,42   10,09   19,20    5,25   4,31  60,56
sda            2470,20  7542,00   39,40  102,20 10140,00 30575,20
575,07     1,02    7,25   17,14    3,44   3,68  52,16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb            1483,20  4507,00   29,80   54,60  6152,80 18347,30
580,57     1,56   18,88   36,59    9,22   6,94  58,56
sdc            1433,40  4542,20   26,20   61,00  5838,40 18412,10
556,20     0,77    8,86   18,96    4,52   3,90  34,00
sda            1417,40  4441,20   28,60   59,80  5784,00 18002,50
538,16     0,69    7,76   16,92    3,37   3,58  31,68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               1,00     2,80    1,00    2,00     8,00    15,60
15,73     0,06   21,60   18,40   23,20   8,53   2,56
sdc               2,80     1,00    0,60    2,40    13,60    10,00
15,73     0,06   19,73   24,00   18,67  10,13   3,04
sda               0,00     3,80    0,40    2,60     1,60    22,00
15,73     0,01    4,00    0,00    4,62   4,00   1,20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00    0,00    0,00   0,00   0,00
sdc               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00    0,00    0,00   0,00   0,00
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00    0,00    0,00   0,00   0,00

SDB is clearly slower than the other two devices and it's the only one
with a smart issue:

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
Always       -       1

So it might be a good idea to change it.
But this isn't enough to cause sheep to hang up I think.
Also consider that it's much faster if I don't use oflag=direct
sync; dd if=/dev/zero of=deleteme3 bs=4M count=$((2048/4))
266 MB/s

I'm also testing the 2T disk (sdd)
fsck.ext4 -c /dev/sdd1  (with '-c' option a badblocks is run along with fsck).

You can see the disk is fully busy but the 'await' is very low, so i
consider the device to be healthy.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0,00     0,00  480,00    0,00 122880,00     0,00
512,00     0,96    2,02    2,02    0,00   2,01  96,48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0,00     0,00  482,80    0,00 123596,80     0,00
512,00     0,96    2,00    2,00    0,00   2,00  96,40

I need to test the net as last chance.
I'll report a.s.a.p.

Thank you.

PS: running atop and looking at 'avio' (Avarage Input Output) may give
an idea of a slow responding disk.



More information about the sheepdog-users mailing list