Friday, December 11, 2009

Galera Author Interviewed by Himself

We just made a major software release, but I still don't see journalists queuing outside our office. Looks like I have to do the hard work and interview myself. In the following, I'll give rough reporter treatment to me:

So, what are we talking about?
MySQL/Galera release 0.7 - synchronous multi-master clustering solution for InnoDB.

Downloads? Where?
.e.g. here:

Sure, here:
But, can't you ask any longer questions?

Oh, sorry, assumed that you geek people prefer not to talk with natural language. But, what is this Galera thingie good for? For whom would you suggest this release?
Practically any innodb user can potentially benefit of MySQL/Galera. There are no unnecessary tweaks in the MySQL behavior. Odds are good that your application will notice no difference when compared with vanilla MySQL.

If high availability is your need, Galera provides that out of the box, due to synchronous replication. After committing, the data is safe in every active cluster node, simple as that.

And, if more performance is needed, Galera can boost your data access considerably. Note that, Galera scales even write intensive workloads. However, hot spots are poison for this replication method. If workload contains focused hot spots, the number of write-accepting masters should be reduced.

Is it good for production, anything to worry about?
We have tested this during a focused test session after 0.7pre release, and we are quite happy with the stability. Two issues were postponed for future maintenance release. There is obvious issue when running DDL and DML concurrently in the cluster. That should be avoided, if it ever were in your plans.

But no matter how much we can test in laboratory, for production use, it is anyway essential to evaluate with the real application and with test load that closely simulates production use.

How stable is it, can I go in engine room and pull out cables wildly?
yes! 0.7 release was designed to be fault tolerant and can recover from most of the expected and un-expected situations. It tolerates even ad-hoc engine room visits.

Does it support innodb plugin?
This build is over MySQL 5.1.39 and innodb plugin is in there. We have enabled innodb plugin in the build and did also some compatibility tests with it. No issues surfaced, but our testing was quite minimal. .e.g. no performance testing has been run with plugin version.
MySQL/Galera will start by default with builtin innobase engine.There is configuration sample in the distribution showing how innodb plugin can be loaded, if you want to play with it.

Everybody is talking of this emerging MariaDB, any plans on supporting that?
Yes, plans and even actions. MariaDB version will be available here:

Everybody is talking of PostgreSQL, any plans on supporting that?
PostgreSQL has been in our roadmap from the very beginning. However, reality bites, and in practice MySQL development has eaten all our resources so far. We plan to get PostgreSQL development rolling in near future, but it sure would help if some experienced PostgeSQL partner would join in this development.

Is this a cry for help, or what?

So, what's next?
Next in schedule is maintenance release 0.7.1, ETA before end of the year. It will mostly address issues in running DDL and DML concurrently. In general, the maintenance release cycle will be kept as short as possible.
Next major release will be 0.8, which has features for considerably faster node join operation. (currently we are limited by mysqldump speed...).
Also MariaDB porting will continue with added effort. One more cup of coffee, and I will promise MariaDB port during December time frame.

Thanks! What are we?
You are welcome

And what was it?
It was a pleasure

Friday, May 15, 2009

MySQL/Galera Release 0.6

MySQL/Galera release 0.6 shipped out today.

MySQL/Galera is synchronous multi-master clustering solution for innodb storage engine, offering un-compromised performance and thanks to certification based replication model, scalability even with write intensive work loads.

We have tested MySQL/Galera 0.6 with a number of benchmarks. Here is a summary of sysbench oltp benchmark run on clusters of 1-4 nodes of Amazon EC2 large instances: sysbench results Scalability is remarkable here and many other benchmarks show similar performance gain.

The 0.6 release adds following new features over the earlier Demo-2 release:

  • Merged with MySQL 5.1.33
  • Full DDL replication using "total order isolation" mode
  • Workaround for drupal issue #282555. The fix is simply about retrying the failed autoinc insert query
  • ...and some bug fixes to go

The MySQL/galera 0.6 is binary linux release (both 32 and 64 builds available) and is available in: Codership Downloads. This release has passed a number of feature and performance tests .e.g. with Drupal benchmarks.

You can evaluate MySQL/Galera 0.6 with minimal effort. Just install and configure MySQL/Galera in each node in your cluster. Then start group communication daemon and all MySQL servers. MySQL/Galera cluster is functional at this point and you can load your data in one cluster node, data will replicate to whole cluster. Then start your application and connect to any node(s). You can also use load balancer in front to balance connections between nodes. We have good experience with Galera Load Balancer (glb: Codership Downloads), but in practice, any TCP level load balancer will do as well.

Next Galera release will be 0.7 and it is under R&D effort having deadline at the end of June. The 0.7 release will be open sourced and is functionally quite complete, offering.e.g. node join capabilities for the cluster. Galera is cooking good at the moment.

Wednesday, April 15, 2009

Clustering Drupal

We have been testing MySQL/Galera cluster with various benchmarks and one exercise in our test plan is to try clustering performance in web application level. We picked Drupal as our first target application and composed cluster from identical Drupal instances. Each Drupal node has local MySQL database and we cluster the databases with Galera synchronous multi-master replication system. As a result, the effects of http requests hitting any Drupal node, will be synchronously replicated to the whole cluster.

Alex wrote a detailed article about the benchmarking session. I present here just an executive summary and go directly to the final results.

Test Platform
We tested the Drupal cluster with Amazon EC2 small instances. Small instance is not particularly suitable for web platform due to long latencies, but we got our baseline figures from this setup, and decided to run more high end tests later on.

Running the test, shows strong unbalance between Apache/Drupal and MySQL CPU usages. MySQL consumes just 5-10% of the CPU and rest goes for Apache. Resource-wise, it would make sense to create separate farm for web servers and have a small MySQL cluster serving the farm. However, our test configuration has some advantages as well:
  • It is easy to setup, each node is identical
  • Local MySQL gives faster response to Drupal
  • It is possible to fallback on one node only

For the test session, we created clusters from 1-4 Drupal instances.

The Test
For testing, we used a jmeter test, which runs three thread groups:
  1. Posters - create new pages in the system
  2. Commenters - read pages and add comments to the stories
  3. Browsers - just keep on reading pages in the system
The jmeter http load goes through glb load balancer to the Drupal cluster. Each http request can hit any cluster node at will.


Final results show quite linear scalability:

Nodes Users Request rate Latency Error rate
(req/min) (ms) (%)
1 40 129 3950 0.07
2 80 259 3960 0.06
3 120 387 3700 0.05
4 160 514 3490 0.12

Scaling continues linearly up to four nodes and we did not try with larger cluster sizes.

Near Future
Alex promised to continue with Drupal testing and run the tests with EC2 large instances to get reasonable latencies. Results from these experiments should appear in the near future.

We are also presenting Galera clustering in the Percona Performance Conference and can provide ad-hoc demonstrations for anybody interested there.

Thursday, February 26, 2009

Managing Auto Increments with Multi Masters

MySQL has system variables auto_increment_increment and auto_increment_offset for managing auto increment 'sequences' in multi master environment. Using these variables, it is possible to set up a multi master replication, where auto increment sequences in each master node interleave, and no conflicts should happen in the cluster. No matter which master(s) get the INSERTs.

Logically auto increment sequence is a shared resource, which would require distributed locking to deal with. However, auto increment sequence interleaving circumvents the need to lock, it sort of splits the auto increment sequence to several node specific sequences, making it "not a shared resource" anymore.

auto_increment_increment and auto_increment_offset have been implemented as session variables, as opposed to being global. We felt at first a bit uncomfortable with this, as there is obvious risk of misconfiguration resulting in conflicts. Apparently, the idea is to let user make separation with tables which are shared in the cluster and tables which are local to each node. If some dedicated session(s) touch only local tables, they can have session specific increment and offset values set to 1. And for sessions needing access to cluster wide shared tables, proper cluster-aware auto increment settings should be used.

These auto increment controlling variables are suitable for our Galera replication model as well. We however, wanted to go one step further and prevent any possibility of auto increment conflicts in the cluster. Galera runs on top of group communication system, which has real time view of cluster memberships. It is therefore possible to adjust increment and offset variables on the fly, triggered by any changes in cluster configuration. We implemented cluster view handler, which the group communication calls whenever somebody joins or leaves the group. The handler code is passed the number of members in the group and group ID of the processing node. These translate nicely to increment and offset values, cluster size will be the auto_increment_increment and auto_increment_offset can be calculated from the node ID.

We wanted to play even more safe to avoid any possibilities for conflicts, and therefore decided to restrict user's access to auto increment variables to read only. And, in order to be transparent, we also added one more variable: wsrep_auto_increment_control to define if the automatic auto increment controlling is enabled or not. With auto increment controlling enabled, cluster will take fully care of setting the increment and offset variables and this will guarantee no conflicts. If auto increment controlling is disabled, system will behave as default MySQL, and user must specify increment and offset globally or by session to suitable values.

Our implementation is based to MySQL 5.1.30, which happens to suffer from a known problem with slave side applying: This bug is so severe, that we backported the fix for this issue. So current wsrep integration code is actually 5.1.30 + 41986 patch + wsrep related changes.

The auto increment controlling among other new features are present in next Galera Demo-2 Release. The release is under testing and will be released as soon as our paranoid QA manager is done with all his remaining manoeuvres.

Wednesday, February 4, 2009

Replicating Locking Sessions

We were running Drupal benchmarks to measure the performance of Drupal/Galera cluster and were surprised to find locking sessions (LOCK TABLES...UNLOCK) in the SQL profile. Locking sessions were originally left out of Galera supported feature set, but now we need to re-consider our policy a bit. Apparently, we are going to encounter more applications, which were originally written for MYISAM usage, but were later migrated to INNODB. As a rule of thumb, it seems that if application can be configured to both MyISAM and INNODB usage, it quite probably uses locking sessions as well.

Eager Replication
We have in the past, implemented one pretty effective method for replicating locking sessions in synchronous cluster. This, "eager replication" method, used transaction sequencing from group communication level to order the table locks. However, the implementation required eventually complete re-write of thr locking (thr_lock.c) and this effort was sure not a joy ride. thr_lock.c module contains adult only content.

We are now looking for a more light weight way to support locking sessions, instead. Galera is a replication system for transaction processing applications and anything we implement beyond that will be a hack (or add-on feature, to translate it to sales-talk). Locking sessions require up-front locking of resources and that makes them complicated to synchronize.

Managing Read Locks
First observation is that read and write locks can be treated differently in Galera cluster. Read locks do not replicate anything, because they are pure read only sessions by definition. Therefore we can leave them processing un-interrupted in local state. Slave applier must acknowledge read locking sessions and wait for them to complete. If application has lengthy read locking sessions, it will obviously delay cluster processing. But that is more or less application's problem, and it could be helped with a bit of re-design in application side.

Converting Locking Sessions to Transactions
Transactions and locking sessions have different semantics, but some applications might, nevertheless, work well with (write) locking sessions replaced by transactions. We can implement this quite simply in parsing level and no application changes are needed for this. For the application, this change in processing, means that "locking sessions" could be aborted due to deadlocks. If deadlocks will happen with application's work load and there is no comprehensive exception management in the application code, then this approach is not viable anymore.

We will add new MySQL option variable for defining if (write) locking sessions should be converted to transactions, and try how Drupal benchmarks works with this method.
Un-interrupted read locks and optional write lock session to transaction conversion will be the first attempts in supporting locking sessions. Hopefully, we don't need to go any further in this path. Otherwise, we are quite close to eager replication model again.

Wednesday, January 28, 2009

Experimenting with Write Set Replication

During the past year, we have been developing a write set replication system for MySQL/innodb engine, called Galera. Now, our project has reached milestone where we can run benchmarks to get performance metrics and also give out releases for public evaluation. In this blog, I'll give a short introduction to Galera and related projects.

Some Technology
Galera is generic multi-master synchronous replication system, which replicates transaction write sets at the commit time and resolves conflicts by comparing keys in write sets. Replication happens through group communication system, which (among other tasks) defines the order of committing transactions in the cluster. The write sets can carry original SQL statements or for best performance: row based replication events, available in MySQL 5.1 and onwards.

Galera replication method leaves the actual SQL statement processing to happen uninterrupted, and quite close to the native MySQL way. This makes client interaction with the cluster fast and for the application, Galera cluster looks just like any native MySQL server. Only difference is commit processing, where certain delay is caused by synchronization with the cluster.

Galera replication has been integrated with MySQL/Innodb 5.1.30 providing a full fledged multi-master MySQL database cluster. We call this first version "demo release" and it is available for downloads in our website.

Some Benchmarking
We have benchmarked Galera with different benchmarks (sysbench, dbt2, DOTS, osdb, sqlgen) using different load profiles to find out the constraints for the feasibility of Galera replication. Our observation is, that Galera cluster provides good performance and scalability even with write intensive work loads.

Here is one summary gained with dbt2 benchmark (resembles TPC-C), run in amazon EC2 environment. The graph shows how 1-4 node Galera cluster compares against pristine MySQL 5.1.30 server.
Dbt2 load contains hot spots and is not favorable for clustering. You can see the deadlock rate growing when more cluster nodes are added. However, total performance still gets better even with 4 nodes.

One Roadmap
We just released MySQL/Galera demo release. It should be stable enough for evaluating with real applications. You can download the demo release from here: Galera demo.

Our next task will be to implement all missing features, we plan to have in beta release. The major task there is providing a way to bring a new node in the cluster. In essence, this means implementing DB snapshot transfer for joining nodes. We assume, that feature complete version is possible during Q2 this year.

And All Those Projects
Galera communicates with DBMS engine through an API, which we call: wsrep API (wsrep as: "write set replication"). We started one open source project just for defining this API and another project for implementing the API integration in MySQL/innodb engine.

Here's our current project list:
  • wsrep API defines the wsrep API only.
  • mysql patches by Codership is open source wsrep integration in MySQL code base.
  • openrep will be open source implementation of wsrep API replication system. We just started working on this, no deliverables yet.
  • galera is wsrep API implementation, optimized for best performance
We have investigated postgres source code quite a bit, and wish to be able to start "wsrep integration patches for postgres" project as well. But, we don't have enough hands and heads to go ahead with this plan in the near future. Technically however, postgres integration should be within easy reach.

This is the state of Galera development in a nutshell. Feel free to visit our website, there is plenty of more information available, for the interested reader.