Wednesday, December 16, 2009

Better replication from drizzle to cassandra



Introduction
This article describes how one of the replication appliers work in rabbitreplication, namely the HashTransformer which transforms each INSERT/UPDATE into a hashmap which is then stored in a column-family based storage, currently Cassandra. For a better overview of RabbitReplication, go check out earlier posts on the subject here: http://developian.blogspot.com


Configuration
This example replicates changes done to a table called test1 in the schema called unittests. RabbitReplication is configured to only replicate the columns id and test (yes, good example, I know...). The column id is used as a key. The following slave configuration is used for this use case:


replication_role = hashslave


rabbitmq.host = 10.100.100.50
rabbitmq.queuename = ReplicationQueue
rabbitmq.exchangename = ReplicationExchange
rabbitmq.routingkey = ReplicationRoutingKey
rabbitmq.password =
rabbitmq.username =
rabbitmq.virtualhost =
hashstore.host = localhost:9160
hashstore.type = cassandra


hashreplicator.replicate.unittests.test1 = id,test
hashreplicator.key.unittests.test1 = id
hashreplicator.keycolseparator.unittests.test1 = .


The hashreplicator rows are the interresting ones, they describe what columns to replicate, what columns are the primary key and what separator to use between the columns when the key is multi column.


Example
Replicating an insert:
drizzle> use unittests
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A


Database changed
drizzle> desc test1;
+---------+-------------+------+-----+---------+-------+
| Field   | Type        | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| id      | int         | NO   | PRI | NULL    |       | 
| test    | varchar(10) | YES  |     | NULL    |       | 
| ignored | varchar(10) | YES  |     | NULL    |       | 
+---------+-------------+------+-----+---------+-------+
3 rows in set (0.02 sec)


drizzle> insert into test1 (id, test) values (300, "firstins");
Query OK, 1 row affected (0 sec)


Results in the following on the Cassandra side:
cassandra> get unittests.test1['300']
  (column=test, value=firstins; timestamp=1260985298391)
  (column=id, value=300; timestamp=1260985298385)
Returned 2 rows.


Update:
drizzle> update test1 set test = "updated" where id = 300;
Query OK, 1 row affected (0 sec)
Rows matched: 1  Changed: 1  Warnings: 0


Gives this in cassandra:
cassandra> get unittests.test1['300']
  (column=test, value=updated; timestamp=1260985526210)
  (column=id, value=300; timestamp=1260985298385)
Returned 2 rows.


Note that the timestamp for the id column is not updated (only changes are updated, not entire rows).


Delete:
drizzle> delete from test1 where id = 300;
Query OK, 1 row affected (0 sec)
And in Cassandra:
cassandra> get unittests.test1['300']
Returned 0 rows.


That is it, go to http://launchpad.net/rabbitreplication to check out the code, report bugs or suggest features!

No comments: