Cassandra: a key puzzle piece in a design for failure

Puzzle.. Photo credit Thomas Leuthard
Photo credit Thomas Leuthard

When building out a data center in the cloud (AWS in particular) Cassandra can play a crucial role in the design for failure.

SQL and NoSQL databases have drastically different redundancy profiles:

A NoSQL database (and I hate the term NoSQL with the passion of a billion white hot suns) trades off data consistency for something called partition tolerance. The layman’s description of partition tolerance is basically the ability to split your data across multiple, geographically distinct partitions. A relational system can’t give you that. A NoSQL system can’t give you data consistency. Pick your poison.

Neither Amazon RedShift, nor RDS, nor DynamoDB offer a convenient built-in mechanism for real time cross-region replication. Cassandra, on the other hand, does. So the question now becomes – how do we deal with relational queries ?

Well, write into both. The term NoSQL is atrocious and implies some sort of a zero-sum game where if you use one you cannot use the other. As it turns out, a data store like Cassandra can offer real value in a disaster recovery scenario when using a relational database as a primary querying mechanism.

Both Amazon RDS and Redshift offer multiple availability zones, typically two. Perusing Amazon’s own post morterms it is clear that the probability of Amazon losing an entire region is extremely low. However, as Hurricane Sandy has shown, it is certainly possible.

One approach to cross-region redundancy is as follows:

  1. Configure a Cassandra ring that spans both Virginia and California regions. In your primary region you can have a bigger cluster than you do in your backup region.
  2. Configure RDS as you normally would in a multi-AZ configuration.
  3. Write your data into both, RDS and Cassandra. Cassandra writes are very fast and so performance impact of this extra work is minimal.
  4. Proactively build a mechanism to restore RDS out of Cassandra.

So, now, in this setup you have a real-time data backup mechanism. You may go for years without your multi-zone RDS failing but should a catastrophe happen and the entire region is lost – you can quickly recover your RDS in another region in a matter of hours.

As an added bonus you now have a convenient mechanism for data structures and access patterns that may either be inappropriate or put too much of workload on your SQL database. If that is your plan the way I would set this up is to have at least one compute instance backed by fast SSDs per zone in your primary region and a single smaller instance in your backup region backed by a larger slower EBS volume.

Should an emergency arise this set up will allow you to build out a new compute cluster and recover your RDS from its real-time data in a matter of hours. In the meantime you are paying lower costs by not having a complete replica of your primary cluster.

One thought on “Cassandra: a key puzzle piece in a design for failure

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s