Thursday, April 7, 2011

Scaling-out OLTP load on Amazon EC2 revisited.

It's been long known that Galera optimistic replication and enterprise-size databases are a match made in heaven. Today we're going to get a little closer to testing this statement.We'll have look at how Galera can scale out Sysbench OLTP complex 60 million rows workload in EC2. This is a first proper benchmark for 0.8 series and also the first benchmark of MariaDB/Galera port, so I'll start modest, just to see how it goes. I chose m1.large instances with 7.8Gb of RAM for cluster nodes and c1.xlarge instance for a client — I don't want the client to be a bottleneck.

For comparison I have also measured performance of a stock standalone MariaDB 5.1.55 server. I used the standard my.cnf that comes with MariaDB Debian package with the following alterations:

max_connections=1024
innodb_buffer_pool_size=6G
innodb_log_file_size=512M

Galera nodes also add

innodb_flush_log_at_trx_commit=0
innodb_locks_unsafe_for_binlog=1
wsrep_slave_threads=16

That's right. One of the purposes if this study is to see how synchronous replication can be an alternative to on-disk durability.

EC2 environment is known for its inconsistent performance, so in order to get reliable figures each measurement (a 20-minute sysbench run) was performed 3 times with an interval of ~2 hours between each. Overall the main part of the benchmark ran non-stop for over 30 hours. Surprisingly consistency was very good. Most of transaction rate scores had standard deviations well below 1%, and latency figures — below 2%. (This essentially means that every shape you see on the charts is statistically significant.) Most inconsistent measurements came from standalone MariaDB server and in a short while we'll discuss possible causes of this.

60 million sysbench rows is more than 14Gb (UPDATE: in the course of benchmarking it grew up to 20Gb) in the table space and with a 6Gb buffer pool this makes the benchmark quite IO-bound. I suppose this is a fairly realistic setup. My initial guess was that it should be favorable for Galera, as any heavy load should, but IO-bound is not the same as CPU-bound, so I was not certain how exactly it would affect slave threads blocks lookup, for example.

Ok, enough with the chatter, here are the charts:


Three things (besides striking performance and scalability) are readily noticeable on that chart:

  1. Stock MariaDB performance is rather poor. 2-node MariaDB/Galera cluster more than doubles it and 4-node — triples. It also does not display a characteristic performance peak at a certain number of connections. Instead its throughput seem to be slowly, but growing with the number of threads, meaning that context switch penalty is negligible in this case. Something else is severely limiting its performance.

    The answer to it seems to be that the load is really IO-bound and the need to flush log after each transaction commit does not help it in any way. Well, that's the price one has to pay for not using synchronous replication when he can't afford losing any transactions. (Yeah, I know, rather than to implement synchronous replication, people learned to live with losing data.)
  2. Unsaturated part of 5-node curve lies below 4-node curve. This is most easily explained by a growing replication overhead. Indeed, with 8-7 connections per node, there is still plenty of CPU and IO resources. However, we get less direct connections per node (lower efficiency) plus 25% increase in replication latency (even lower efficiency) which requires more connections to compensate for it. In fact, if we look at a 4-node curve, we can see that it almost goes to under the 3-node curve at 32 connections.

  3. Scalability that took off so wonderfully somehow hits a ceiling at around 1100 trx/sec. What is it? Well, there are several suspects:

    1. Replication overhead has finally caught up with us. See above item.
    2. Slave workload has caught up with us, see this article about multi-master arithmetics. Indeed, sysbench OLTP load has about 25% of write queries, so 1/0.25 + 1 = 5 — pretty close to what we have. However, first, RBR event applying is far cheaper than query processing in the first place, so this estimate is faulty. Second, why so abrupt?
    3. Slave threads cannot handle all that replication traffic from 4 peer nodes (slave thread pool saturated). I've been using a pool of 16 slave threads and while in 2-node cluster that could be more than enough, in 5-node cluster we have 4 times more of slave traffic.

    Of course all three of the above reasons combined could (and no doubt will at some cluster size) end the scalability. However I did not expect it so soon and so firm. Just look at them red, orange, green and magenta curves - doesn't it tell you that there is a space for at least one more node?

Replication overhead, while well pronounced with small number of threads per server, is clearly negligible when saturation is reached. This can be easily seen from the latencies graph where bigger clusters actually display lower latencies due to the distribution of workload. To put it another way, in saturated server transaction latency is dominated by the progressive deficit of internal resources, rather than the constant replication overhead.

Slave workload saturation can be ruled out by a simple observation of CPU usage on the nodes. In 4 and 5 node configurations idle CPU time was considerably above 50% (in a 2-node case it was ~20%). The nodes clearly hadn't run out of CPU power. Something else was holding a performance. Normally that would be IO waits.

But IO waits come in two forms:

  1. IO waits caused by buffer pool misses, which are normally dealt with by increasing concurrency: while one thread is waiting for a block, another is performing useful work.
  2. IO waits like "cannot perform more than this amount of IO operations per second" - and that is what we see in case of a single MariaDB server.

So, which one we're dealing with here?

In addition to low CPU usage, another interesting observation was made there. Galera 0.8 has a status variable 'wsrep_flow_control_paused' which shows the fraction of time replication was paused due to waiting for other nodes. While I could not devise a method to chart it here, it was very easy to see that this fraction grows with the number of nodes and in the 5-node cluster is routinely around 60-70%. Essentially this means that slave writesets cannot be applied as fast as they arrive. But not because of lack of CPU power!

Slave thread pool saturation seems like a most likely smoking gun here, so I'm running the same benchmark in 4, 5, and 8-node configurations with 64 slave threads. It should put 5-node cluster on the equal footing with 2 nodes.

Another possible trick to try is Out-Of-Order-Committing (OOOC): even with several concurrent slave appliers we can encounter a transaction that takes especially long to apply and thus hold all other slave threads from committing. In this case OOOC would allow other slave threads to continue their work.

So I've rerun the benchmark with the aforementioned tricks - but on different set of EC2 instances (only the first node that was used for plain MariaDB testing was consistently present in all setups) and that yielded somewhat incomparable results... Which is in a sense good, because it explains some things.

First of all, overall throughput is considerably lower than in the initial benchmark (gray curves). Well, this can be explained by that not all m1.large instances in EC2 are equal, and they certainly are not. That it is not a random fluke is corroborated by the amazing consistency that, say, 4-node curves show in this benchmark.

And so, since m1.large instances can be so different in performance, a natural explanation for a lack of scalability when going from 4 to 5 nodes in the initial benchmark is that the 5th node simply happened to be too weak to add any extra throughput.

Moreover, it looks like we happened to have a more even set of instances here and we see a healthy 12% increase in throughput for 25% increase of nodes. Scalability is still there! Notice, that we're talking about scaling-out of sysbench OLTP complex benchmark which nobody else ever dared to do. And, surprise, even 8 nodes show some increase in performance. Maybe quite marginal, but it still does what it is supposed to do: more nodes can handle more clients. Isn't it what "scale-out" is about?

Now, the other thing is that apparently neither expansion of slave thread pool to 64 threads, nor OOOC resulted in any throughput improvement.

The first thing is easy to explain. From the first chart we see that single node is pretty much saturated at 16 connections. This is with client-server communication latency in play. Slave threads have no client-server communication, so they saturate the server even sooner. So adding more slave threads will more likely to harm performance than to improve it.

OOOC on the other hand sets existing threads totally loose and with practically no collisions between the writesets they should be able to apply them in a virtually tight loop. Still, this doesn't happen. And this last observation forces us to make the only possible conclusion: m1.large instance simply cannot handle more IO operations.

This can be further explained by that if the load is IO-bound and if reads and writes on average cause the same amount of buffer pool misses (on master end this probably does not hold, due to reads prefetching pages that might be used by writes, but on slave end it most probably does), then we arrive to the same old formula that places practical scalability limit at 4-5 nodes.

Conclusions and further work.


Well, m1.large instance IO seriously disappointed. In fact, measuring 8Gb file creation time has shown that /dev/sdb there is slower than my laptop hard drive, which is kinda hard to justify. Several things can be tried to improve it:

  1. Use EBS volumes for data and innodb log storage. They were said to be faster, however this is an extra expense and complexity. I was trying to show how you can have a highly-available MySQL on EC2 without resorting to non-volatile storage. Seriously, how hard is it to make local drive fast enough?
  2. Try a higher-end instances (yeah, cluster compute!). Well, as soon as we get budget for that I'll surely give it a try. Although I have little doubts that result will be nothing short of stellar.
  3. Try with an xfs/ext4 file system on /mnt instead. ext3 is as adequate a choice as fat16 these days.


Replication overhead. Yes, it is small, but we can see it and I'd prefer that we can't. The problem here is that EC2 does not support multicast and probably never will, so no luck here. Some experiments with multicast though have suggested that even using UDP unicast could considerably improve latencies compared to TCP. This could be out next optimization project.

As for the rest, we saw great performance, great scalability and a good reason to use Galera replication instead of innodb_flush_log_at_trx_commit=1.

Downloads


  • MariaDB-wsrep port is available from lp:codership-maria.
  • Galera wsrep provider is available from lp:galera.

To be continued with master/slave and WAN benchmarks.

Original.

2 comments:

  1. For the single node test, did you also have sync_binlog=1 and binlog writing on? (Probably yes, since you say you used the default MariaDB/debian my.cnf.) In other words, was group commit working or not? Without group commit, possibly that is what is limiting single node performance. In that case it would be interesting to see results from a single node with sync_binlog=0. Alternatively, you should use Kristian's MariaDB versions where group commit is fixed. That seems to me like a more relevant benchmark. Just showing that innodb_flush_log_at_trx_commit=1 and sync_binlog=1 together have poor performance is not news.

    Another question: What did you do to sysbench to make it distribute the load over a cluster?

    ReplyDelete
  2. Did you have binlog enabled for the benchmarks? Does Galera need the binlog enabled in order to work?

    By the way what does innodb_logs_unsafe_for_binlog=1 mean?

    Thanks.

    ReplyDelete