[sheepdog-users] Puzzled about bad performances

Wed May 20 13:23:44 CEST 2015

Hi Hitoshi, thanks for your quick reply.

Thanks to your help, I found the guilty and this is definitely Object Cache.
Here is my latest settings which gives me far better results : 

/usr/sbin/sheep -i host=172.16.0.101 port=7002 -y 10.1.0.101 --pidfile /var/run/sheep.pid /var/lib/sheepdog /var/lib/sheepdog/disc0,/var/lib/sheepdog/disc1,/var/lib/sheepdog/disc2

* -n : gives no performance boost (I would even say that it gives me worst results)
* Object Cache is what gaves me horrible results, I disabled it and now, it's flying !
* directio directive gives even worst results in combination with Object cache (I felt down to a mere 9MB/sec on cold start with no cache !)

Right now, at cold VM boot, I have 450MB/sec with hdparm then after a few passes, it caps at around 700MB/sec which is far enough for my needs on a per VM basis ;)
Even on the user experience side, everything is much much smoother now, in fact, it is just as I was expecting it to be.

Droping cache between each hdparm pass didn't changed anything.

Now my question : why Object Cache is such a pain ? I thought it would be the total contrary by providing a huge performance boost and I get completely the opposite.
That's not a real problem to me since I didn't invest in the SSDs especially for that (in fact, the node's OS is installed on the SSD so that I can dedicate my SAS disks to my storage cluster) but it's puzzling me that performances are that bad with this feature on.
Might be a point to investigate.

Anyway, many thanks for your help.

Best regards,

Walid

----- Mail original -----
De: "Hitoshi Mitake" <mitake.hitoshi at gmail.com>
À: "Walid Moghrabi" <walid.moghrabi at lezard-visuel.com>
Cc: "sheepdog-users" <sheepdog-users at lists.wpkg.org>
Envoyé: Mercredi 20 Mai 2015 12:35:44
Objet: Re: [sheepdog-users] Puzzled about bad performances

At Wed, 20 May 2015 11:54:51 +0200 (CEST),
Walid Moghrabi wrote:
> 
> Hi everyone, 
> 
> I finished installing my new Sheepdog cluster based on 0.9.2 and I'm really intrigued by the bad performances I get, maybe someone can help me in tweaking my settings in order to get better results because I really don't see why this is that bad compared to other clusters I have. 
> I'm sorry, this post is big but I have no other way to explain clearly what's going on so I would say sorry for the inconvenience and thank you if you take the time to go through all this. 
> 
> First, here are 3 different configurations, my latest is the 3rd one and the one that is puzzling me : 
> 
> 
> First : "The Good" ... this one is an "in betwwen" with "the bad" and "the ugly" ... This is a 3 nodes cluster with 1 SSD drive where I put Sheepdog's object cache and journal and 1 SATA drive where one LV is dedicated to Sheepdog. 
> The cluster is made on only one interface for both cluster communication and data replication (1Gbps) 
> * Running Sheepdog : 0.9.1 
> * Cluster format : -c 2 
> * Command line : /usr/sbin/sheep -n -w size=100G dir=/mnt/metasheep --pidfile /var/run/sheep.pid /var/lib/sheepdog -j dir=/var/lib/sheepdog/journal size=1024M 
> * Mount point : 
> /dev/vg0/sheepdog on /var/lib/sheepdog type xfs (rw,noatime,nodiratime,attr2,delaylog,noquota) <====== this is the SATA drive 
> /dev/sdb1 on /mnt/metasheep type xfs (rw,noatime,nodiratime,attr2,delaylog,noquota) <====== this is the SSD drive 
> /var/lib/sheepdog/journal --> /mnt/metasheep/journal <====== this is a symlink on a folder on the SSD drive 
> * hdparm -tT /dev/sda (SATA drive) : 
> /dev/sda: 
> Timing cached reads: 25080 MB in 2.00 seconds = 12552.03 MB/sec 
> Timing buffered disk reads: 514 MB in 3.01 seconds = 170.95 MB/sec 
> * hdparm -tT /dev/sdb (SSD drive) : 
> /dev/sdb: 
> Timing cached reads: 19100 MB in 2.00 seconds = 9556.21 MB/sec 
> Timing buffered disk reads: 2324 MB in 3.00 seconds = 774.12 MB/sec 
> 
> 
> Second : "The Bad" ... this one is very simple, this is a 2 node cluster with a dedicated 250Gb partition on a SATA2 drive to Sheepdog. 
> The cluster is made on only one interface for both cluster communication and data replication (1Gbps) 
> * Running Sheepdog : 0.9.2_rc0 
> * Cluster format : -c 2 
> * Command line : /usr/sbin/sheep -n --pidfile /var/run/sheep.pid /var/lib/sheepdog/ 
> * Mount point : 
> /dev/sdb1 on /var/lib/sheepdog type xfs (rw,noatime,nodiratime,attr2,delaylog,noquota) 
> * hdparm -tT /dev/sdb : 
> /dev/sdb: 
> Timing cached reads: 16488 MB in 2.00 seconds = 8251.57 MB/sec 
> Timing buffered disk reads: 424 MB in 3.01 seconds = 140.85 MB/sec 
> 
> 
> Third : "The Ugly" ... this is my latest attempt and it should normally be my most powerful settings but, I have pretty bad results I'll detail later on ... right now, here is the configuration. 
> This is a 8 nodes cluster with dedicated interfaces for cluster communication (eth0) and data replication (eth1), both 1Gbps, Jumbo frames enabled on eth1 (MTU 9000). 
> Each node has 1 SSD drive dedicated to Sheepdog's object cache and 3 600Gb SAS 15k dedicated drives for Sheepdog's storage. Journal is not enabled on your recommandation (and to be honest, I had a bad crash of the Journal during my test drives). 
> Drives are handled individually in MD mode and the cluster uses Erasure Code. 
> * Running Sheepdog : 0.9.2_rc0 
> * Cluster format : -c 4:2 
> * Command line : /usr/sbin/sheep -n -w size=26G dir=/mnt/metasheep -i host=172.16.0.101 port=7002 -y 10.1.0.101 --pidfile /var/run/sheep.pid /var/lib/sheepdog /var/lib/sheepdog/disc0,/var/lib/sheepdog/disc1,/var/lib/sheepdog/disc2 
> * Mount point : 
> /dev/pve/metasheep on /mnt/metasheep type ext4 (rw,noatime,errors=remount-ro,barrier=0,nobh,data=writeback) <===== this is a LV on the SSD drive 
> /dev/sdb on /var/lib/sheepdog/disc0 type xfs (rw,noatime,nodiratime,attr2,delaylog,logbufs=8,logbsize=256k,noquota) <===== this is a SAS 15k drive 
> /dev/sdc on /var/lib/sheepdog/disc1 type xfs (rw,noatime,nodiratime,attr2,delaylog,logbufs=8,logbsize=256k,noquota) <===== this is a SAS 15k drive 
> /dev/sdd on /var/lib/sheepdog/disc2 type xfs (rw,noatime,nodiratime,attr2,delaylog,logbufs=8,logbsize=256k,noquota) <===== this is a SAS 15k drive 
> * hdparm -tT /dev/sd{b,c,d} (SAS drive) : 
> /dev/sd{b,c,d}: 
> Timing cached reads: 14966 MB in 2.00 seconds = 7490.33 MB/sec 
> Timing buffered disk reads: 588 MB in 3.00 seconds = 195.96 MB/sec 
> * hdparm -tT /dev/sde (SSD drive) : 
> /dev/sde: 
> Timing cached reads: 14790 MB in 2.00 seconds = 7402.49 MB/sec 
> Timing buffered disk reads: 806 MB in 3.00 seconds = 268.28 MB/sec 
> 
> 
> So, as you can see, we have 3 pretty different configurations and "the ugly" would appear to give the best results but to be honest, that is not the case. 
> I have currently very few virtual machines on this last cluster, every VM I tested are based on the same setup (Debian Wheezy based, virtio-scsi drive with discard and writeback cache enabled). 
> I have a "virtual desktop" VM with a few tools like Eclipse, Firefox, LibreOffice, .... so big pieces of software which needs a bit of good i/o to run smoothly. 
> On "the ugly", on cold start, Eclipse is quite long to start with heavy i/o and cpu usage, so does Firefox and the others. 
> Close the apps and re-run them a few times, at each new start, it is a bit faster and smoother ... to me, that looks like the object cache on the SSD drive is doing its job but you need to keep the VM on the same node and keep it running for a while before getting any benefit from this mechanism. And this cache is pretty light (26GB in that case) so with many VMs, this won't help much by the time beeing. 
> On the other configurations, everything is much smoother from the cold start. 
> 
> I know hdparm is not a good benchmark but I did 3 runs passes on my different configurations, within the Virtual Machine, here are the results : 
> 
> "The Good" : 
> /dev/sda: 
> Timing cached reads: 20526 MB in 2.00 seconds = 10275.86 MB/sec 
> Timing buffered disk reads: 660 MB in 3.00 seconds = 219.85 MB/sec 
> /dev/sda: 
> Timing cached reads: 20148 MB in 2.00 seconds = 10087.56 MB/sec 
> Timing buffered disk reads: 2600 MB in 3.00 seconds = 865.85 MB/sec 
> /dev/sda: 
> Timing cached reads: 20860 MB in 2.00 seconds = 10443.10 MB/sec 
> Timing buffered disk reads: 3148 MB in 3.18 seconds = 989.06 MB/sec 
> 
> Conclusion : We can see that results are getting better for each pass (certainly due to the object cache mechanism) but the first pass still gives good results (even better than the physical underlying drive). 
> 
> 
> "The Bad" : 
> /dev/sda: 
> Timing cached reads: 13796 MB in 2.00 seconds = 6906.54 MB/sec 
> Timing buffered disk reads: 1084 MB in 3.01 seconds = 360.02 MB/sec 
> /dev/sda: 
> Timing cached reads: 13258 MB in 2.00 seconds = 6639.34 MB/sec 
> Timing buffered disk reads: 1218 MB in 3.01 seconds = 405.30 MB/sec 
> /dev/sda: 
> Timing cached reads: 12852 MB in 2.00 seconds = 6433.39 MB/sec 
> Timing buffered disk reads: 1306 MB in 3.00 seconds = 435.15 MB/sec 
> 
> Conclusion : This is bluffing ! Simple replication, bad disks, no object cache ... worst case scenario and ... extremely good performances ... I don't understand, this can't be ! 
> 
> 
> "The Ugly" : 
> /dev/sda: 
> Timing cached reads: 13762 MB in 2.00 seconds = 6886.51 MB/sec 
> Timing buffered disk reads: 90 MB in 3.03 seconds = 29.73 MB/sec 
> /dev/sda: 
> Timing cached reads: 13828 MB in 2.00 seconds = 6919.47 MB/sec 
> Timing buffered disk reads: 308 MB in 3.00 seconds = 102.58 MB/sec 
> /dev/sda: 
> Timing cached reads: 14202 MB in 2.00 seconds = 7106.75 MB/sec 
> Timing buffered disk reads: 1580 MB in 3.00 seconds = 526.62 MB/sec 
> 
> Conclusion : Cold start with empty cache is just horrible ... 30MB/sec !!! After a few runs, I guess Object Cache on SSD is doing its job and I'm up to more than 500MB/sec after 3 runs and I'm caping at about 750MB/sec after a few more runs but first pass is horrible and guess what ... I'm getting this bad results as soon as I'm migrating the VM on another node since the cache is empty then ... 
> 
> 
> It really puzzles me ... I should get extremely good results on this setup and this is in fact the worst ! 
> What did I do wrong ? 
> How can I trace what's going on ? 
> What can I try ? 
> 
> I really need help because I can't put this in Production with results that are far worst than what I tried on a quick and dirty lab test on 2 mainstream PCs :( 
> 
> Thanks in advance for any help you can provide. 

Hi Walid,

The below slide contains performance evaluation (from page 25):
http://sheepdog.github.io/sheepdog/_static/sheepdog_COJ14.pdf
# of course hardware is quite different from your deployment, but it would be informative

In the evaluation, 4:2 erasure coded VDIs provided better performance
than 3 replicated VDIs only in a case of large write. Could you try 3
replication + the ugly deployment? And could you test the ugly
deployment without object cache?

In addition, I need to check the point: did you drop page cache in
your VM between execution of hdparms?
You need to drop cache if you want to evaluate contribution of object
cache, because page cache in VM can also contribute to read
performance.

You can drop page cache with:
$ echo 3 > /proc/sys/vm/drop_caches
# http://linux-mm.org/Drop_Caches

Thanks,
Hitoshi