Thursday, February 26, 2009

Managing Auto Increments with Multi Masters

MySQL has system variables auto_increment_increment and auto_increment_offset for managing auto increment 'sequences' in multi master environment. Using these variables, it is possible to set up a multi master replication, where auto increment sequences in each master node interleave, and no conflicts should happen in the cluster. No matter which master(s) get the INSERTs.

Logically auto increment sequence is a shared resource, which would require distributed locking to deal with. However, auto increment sequence interleaving circumvents the need to lock, it sort of splits the auto increment sequence to several node specific sequences, making it "not a shared resource" anymore.

auto_increment_increment and auto_increment_offset have been implemented as session variables, as opposed to being global. We felt at first a bit uncomfortable with this, as there is obvious risk of misconfiguration resulting in conflicts. Apparently, the idea is to let user make separation with tables which are shared in the cluster and tables which are local to each node. If some dedicated session(s) touch only local tables, they can have session specific increment and offset values set to 1. And for sessions needing access to cluster wide shared tables, proper cluster-aware auto increment settings should be used.

These auto increment controlling variables are suitable for our Galera replication model as well. We however, wanted to go one step further and prevent any possibility of auto increment conflicts in the cluster. Galera runs on top of group communication system, which has real time view of cluster memberships. It is therefore possible to adjust increment and offset variables on the fly, triggered by any changes in cluster configuration. We implemented cluster view handler, which the group communication calls whenever somebody joins or leaves the group. The handler code is passed the number of members in the group and group ID of the processing node. These translate nicely to increment and offset values, cluster size will be the auto_increment_increment and auto_increment_offset can be calculated from the node ID.

We wanted to play even more safe to avoid any possibilities for conflicts, and therefore decided to restrict user's access to auto increment variables to read only. And, in order to be transparent, we also added one more variable: wsrep_auto_increment_control to define if the automatic auto increment controlling is enabled or not. With auto increment controlling enabled, cluster will take fully care of setting the increment and offset variables and this will guarantee no conflicts. If auto increment controlling is disabled, system will behave as default MySQL, and user must specify increment and offset globally or by session to suitable values.

Our implementation is based to MySQL 5.1.30, which happens to suffer from a known problem with slave side applying: This bug is so severe, that we backported the fix for this issue. So current wsrep integration code is actually 5.1.30 + 41986 patch + wsrep related changes.

The auto increment controlling among other new features are present in next Galera Demo-2 Release. The release is under testing and will be released as soon as our paranoid QA manager is done with all his remaining manoeuvres.

Wednesday, February 4, 2009

Replicating Locking Sessions

We were running Drupal benchmarks to measure the performance of Drupal/Galera cluster and were surprised to find locking sessions (LOCK TABLES...UNLOCK) in the SQL profile. Locking sessions were originally left out of Galera supported feature set, but now we need to re-consider our policy a bit. Apparently, we are going to encounter more applications, which were originally written for MYISAM usage, but were later migrated to INNODB. As a rule of thumb, it seems that if application can be configured to both MyISAM and INNODB usage, it quite probably uses locking sessions as well.

Eager Replication
We have in the past, implemented one pretty effective method for replicating locking sessions in synchronous cluster. This, "eager replication" method, used transaction sequencing from group communication level to order the table locks. However, the implementation required eventually complete re-write of thr locking (thr_lock.c) and this effort was sure not a joy ride. thr_lock.c module contains adult only content.

We are now looking for a more light weight way to support locking sessions, instead. Galera is a replication system for transaction processing applications and anything we implement beyond that will be a hack (or add-on feature, to translate it to sales-talk). Locking sessions require up-front locking of resources and that makes them complicated to synchronize.

Managing Read Locks
First observation is that read and write locks can be treated differently in Galera cluster. Read locks do not replicate anything, because they are pure read only sessions by definition. Therefore we can leave them processing un-interrupted in local state. Slave applier must acknowledge read locking sessions and wait for them to complete. If application has lengthy read locking sessions, it will obviously delay cluster processing. But that is more or less application's problem, and it could be helped with a bit of re-design in application side.

Converting Locking Sessions to Transactions
Transactions and locking sessions have different semantics, but some applications might, nevertheless, work well with (write) locking sessions replaced by transactions. We can implement this quite simply in parsing level and no application changes are needed for this. For the application, this change in processing, means that "locking sessions" could be aborted due to deadlocks. If deadlocks will happen with application's work load and there is no comprehensive exception management in the application code, then this approach is not viable anymore.

We will add new MySQL option variable for defining if (write) locking sessions should be converted to transactions, and try how Drupal benchmarks works with this method.
Un-interrupted read locks and optional write lock session to transaction conversion will be the first attempts in supporting locking sessions. Hopefully, we don't need to go any further in this path. Otherwise, we are quite close to eager replication model again.