Ordered Sets and Logs in Cassandra vs SQL

I’ve written before that Cassandra’s achilles’ heel is devops:

Storage, redundancy and performance are expanded by adding more nodes. This can happen during normal business hours as long as consistency parameters are met. Same applies to node replacements.

As the number of servers grows be prepared to hire a devops army or look for a managed solution. Datastax offering helps but still not enough. Even in the cloud there is no good managed solution that we found. Cassandra.io requires you give up Thrift and CQL, and Instaclustr as of this moment does not use third generation SSD-backed instance types.

Since I have a total of about two dozen Cassandra nodes to upgrade and regression test the application I am not really looking forward to that exercise. So, one of the things I am working on is finding out whether we actually need Cassandra at all.

Having eliminated all relational-style queries (which are not a use case for Cassandra) by moving some of the data to PostgreSQL (which by the way, in some cases outperforms Cassandra by at least a factor of two on larger data sets1), I am now looking at some of the more obscure data structures that may be perfect for Cassandra but do not map directly onto SQL easily.

Frequently Updated Ordered Sets of Tuples

In Cassandra, there is a very convenient mechanism for maintaining an ordered set. You create a column family for a collection of ordered sets, with row key being a set identifier and column name being the name of an element in the set. Since columns are sorted and column names are unique within a row, you get an ordered set. In Cassandra a column family with ordered sets of tuples looks like this:

OrderedSetsCF   ->
    SetA    ->  (X=1),(Y=2),(Z=3)
    SetB    ->  (Y=10),(Z=4)

In SQL such a data structure would look like a table where set identifier and set element name are parts of the primary key, and value of an element is the third column. In this approach, the set identifier becomes denormalized and repeated for each set element. For large sets this can become a very big table, especially if you are using UUIDs for primary keys (which is a common technique with Cassandra):


SetID(PK)   ElementName(PK) ElementValue
SetA        X               1
SetA        Y               2
SetA        Z               3
SetB        Y               10
SetB        Z               4

Note that you can mitigate the problem of primary key denormalization in a situation where your keys are much bigger than is actually needed by the universe of your data (i.e. UUIDs). You can create a separate look up table where you store an integer code mapped onto your UUIDs and then use the integer code for your primary keys in the sets table.

Setting a value of a tuple in Cassandra is an O(1) operation due to the fact that Cassandra writes into a commitlog and a memtable and returns immediately. If you are writing into your set very rapidly, the values may not be readable until relatively much later especially if you have a high replication factor and you are not always hitting the same node for reads as the one you are using for writes. Eventually, over some period of time, your data becomes consistent across the cluster.

Since Cassandra is not ACID, the column does not need to be first inserted before you can do an update. Effectively, in Cassandra a create is the same as update. Not a single SQL database provides standard UPSERT operation and there is a lot of academic debate on why.

In SQL an insert into such a table is an O(log(n)) operation (roughly speaking, depending on the type of an index used) where nis a multiple of SetID and ElementName. As the number and size of such sets grows, and if the size of the keys grows, inserts become increasingly slower. Depending on the rate of your data ingestion this may or may not be much of an issue, considering that the upside of using SQL is that your data is immediately readable and you don’t have an eventual consistency problem2.

Since there is no UPSERT in SQL, and even if there was one it would require a read-before-write pattern, if you are to update a value you will incur a 2*O(log(n)) performance penalty — one time to look up the existing row, second time to update it, depending on how you do this and what your RDBMS query planner does. Again, this is the price you pay for ACID.

Updates and deletes complicate the situation for both Cassandra and an SQL database but differently. Cassandra uses append-onlySSTable files, meaning that all writes are O(1). However, in the background it must perform a compaction and until that happens you may use more disk space than you actually need and your reads become orders of magnitude slower because they have to skip over tombstones.

To illustrate the tombstone penalty, let’s pretend that value X in SetA above has been updated 3 times and value Y was deleted. Until compaction happens, what is actually stored on disk is this:

OrderedSetsCF   ->
    SetA    ->  (X=1),(Y=2),(Z=3)
            ->  (X=4) //(X=1) tombstone
            ->  (X=5) //(X=4) tombstone
            ->  (X=6) //(X=5) tombstone
            ->  {Y=2} //(Y=2) tombstone

Your set of 3 elements now takes up 4 extra units of storage until compaction happens. Furthermore, to get to the latest value of X and to tell that Y has been deleted Cassandra has to skip over the tombstones. So, what could have been an O(1)+O(log(n))operation (to first get to SetA using hash key and then to get to X using column index) now becomes an O(1)+O(log(n))+O(m) where n is the number of columns (elements in your set) and m is the number of times you updated X. If you just did a few hundred thousand updates to the same value over a relatively short period of time, you just created a serious bottleneck for your read operations.

A Cassandra row compaction itself requires additional storage and significant I/O capacity. Without getting into intricacies of the mechanism, consider the fact that in Cassandra SSTables are write-only. So, in order to compact the SetA row from the above example, Cassandra scans the row and skips over tombstones and writes it out into a new SSTable file. Once that file is written, the original fragmented SSTable is deleted. As a result, Cassandra may temporarily double the storage it requires. Furthermore, you have little control over when compaction actually happens.

Now, SQL databases are not immune to the problem of frequently updated data. For example, PostgreSQL VACCUM operation does exactly that — a very similar mechanism to what Cassandra does with compactions. There is really no escaping that problem. SQL databases like PostgreSQL may give you better control over storage reclamation than Cassandra, though, and because of the differences between VACUUM and compactions you are not incurring the tombstone penalty.

The reason why one would want to use SQL for this type of structure is simple: if you have a requirement to perform some real-time analytics over your data. It is as simple as that — are you accumulating rapidly changing data to process it later (in which case Cassandra or DynamoDB are very appropriate) or are you accumulating same data in order to gain meaningful insights out of it in real-time (in which case SQL is more appropriate) ?

Log-style Structures

Let’s suppose you want to store events from a multitude of remote sensors. This can be temperature sensors in different locations, market data by ticker symbol, or clicks in the apps across thousands or millions of users. Suppose you only want to retain this data for 2 days.

Each event is identified by its source, time (in milliseconds), event type, event id, and some event value.

In Cassandra column family one would store it like this. Source is your row key, time+eventType+eventId are your composite column name, and eventValue is your column value. Each column will have a ttl of 2 days (expressed in seconds). It would look like this:

EventsCF >
    SourceA -> (0:TypeA:1, X, ttl=172800), (1:TypeA:2,Y, ttl=172800), (1:TypeB:3,X, ttl=172800)
    SourceB -> (0:TypeB:1, X, ttl=172800), (1:TypeA:2,Y, ttl=172800)

and so on and so forth. In SQL, it would look like this:


source(PK)  time(PK)    eventType(PK)   eventId(PK) eventValue
SourceA     0           TypeA           1           X   
SourceA     1           TypeA           2           Y   
SourceA     1           TypeB           1           X       

and so on. You’d need a batch job to run regularly to delete rows older than 2 days.

In Cassandra, all writes are O(1). In SQL, all writes are O(log(n)) where n is the number of PK combinations. As the size of your log grows, so will the time it takes to insert a row. One could mitigate this in SQL by not using PK or indeces, but then querying this table will become nearly impossible.

Cassandra has a concept of a ttl on values, meaning that they logically disappear when ttl is up. However, it does not mean that the disk space is reclaimed. This too suffers from the compaction problem, and until compaction happens this data structure may consume an enormous amount of disk space. Suppose you accumulate 1 million log entries per day per source. Five days days later, unless compaction happened, you are actually storing 3 days more of data than you actually require.

Retrieval of this data out of Cassandra becomes a bit tricky. If you naively assume that by reading the entire row you are only reading the last two days worth of data you are wrong. Until compaction happens Cassandra will have to scan over tombstones – and in this example three days worth of them! Even if you optimize a bit and use a slice query starting at two days ago the best you will get out of Cassandra will be O(log(n)) where n is the total number of log entries you made in the last five days (until compaction happens).

The disk storage problem is further exacerbated here. Since the data with expired ttl won’t actually get deleted until compaction happens, and compaction itself may temporarily double the disk storage requirement you need to make sure you leave extra space on each node. Furthermore, this type of a structure in Cassandra may create an imbalance in the cluster if the amount of data varies a lot between sources.

Cassandra is a clear winner here from the performance perspective if the goal is to collect immense amount of data, especially if that data never expires. However, in the cloud environment like AWS I’d use Amazon’s facilities such as DynamoDB, EMR, or RedShift. Cassandra, as it grows, does become a devops nightmare. Over time you may end up with dozens, or hundreds of nodes if you never expire or delete data.


So what am I really getting at here ? Well, Cassandra really is a devops nightmare. I know I am going to stir some debate up on twitter with what I just said. I’d love nothing more than to stop using it. However, it continues to be a useful tool for some of the use cases I deal with, and for all its flaws I have not found a better option yet. As I keep saying, all I want is Cassandra that is a managed SaaS like DynamoDB where I don’t have to worry about devops.

  1. Yes I know I need to provide a benchmark. In this post I wanted to spark a conversation and then if I find the time I’ll post a benchmark. 
  2. This is negated by the use of read replicas that may experience a lag. 

2 thoughts on “Ordered Sets and Logs in Cassandra vs SQL

Comments are closed.