If you didn’t read Choosing Cassandra consistency levels, I recommend to do it. It’s the most practical advice I could give to choose the most proper Cassandra consistency level for your particular use cases.
But I feel something is missing there. Some users want to be sure that every change they make is propagated to all nodes, and the queries they run are completely accurate.
Let me say, if this is your case you shouldn’t be using Cassandra. It’s not what it was designed for. The way Cassandra writes changes locally, and the way it propagates writes, are designed to maximise the thoughput and latency at the expense of diminished accuracy. I would argue that Cassandra is not even eventually consistent, because certain events prevent it from eventually having the same data in all nodes – even if you completely stop writing.
Changing the Consistency Level
That said, Cassandra allows to theoretically achieve the maximum accuracy – the keyword here being theoretically, and I can explain the reason in a separate article if there is enough interest. If you want to do it for a subset of your data and queries, it probably makes sense. Actually Cassandra allows to define Consistency Levels (CL) with great granularity:
- By default, all writes and reads use
- You can change the default reads CL for the current session.
- You can change the default writes CL for the current session.
- You can change the CL at query level.
Check your driver’s documentation to find out how to change the CL.
Consistency Levels patterns for high accuracy
Your data have a Replication Factor (RF). It can be set globally, or at keyspace level.
For our convenience, let’s define a new term: Involved Nodes (IN). This number is the number of nodes involved in a read or in a write, and depends on the Consistency Level and, sometimes, on the Replication Factor. If the CS is
ALL, IN is equal to the number of nodes in the cluster. If the CS is
QUORUM, IN is the lowest odd integer equal or higher than the RF.
Assuming that you want every relevant write to be seen by every read, which is not necessarily the case, you simply follow both these rules:
in(reads) + in(writes) > rf AND in(writes) >= ceiling(quorum / 2)
With this formula in mind, the patterns we can choose are the following, ordered from the strictest to the loosest:
- Write Everywhere (WE pattern);
- Double Quorum (2Q pattern);
- Read Everywhere (RE pattern).
ONE for writes and
ALL for reads.
- Crashes don’t delay committed writes propagation to any node that could be queries.
- If any node crashes, writes will fail.
ALLfails applications could retry the same write with
QUORUM, and maybe even a more relaxed CS if even
- In other words, using
ALLreduces Cassandra availability for writer applications; with the above trick you can increase availability, but then you give up accuracy in case of a crash.
- Higher latency and lower throughput for writes.
This pattern is generally suitable if there are more reads than writes.
QUORUM for both reads and writes.
- Crashes don’t delay committed writes propagation to the majority of nodes, and any query will involve at least one node that owns the most recent data.
- If the majority of nodes crashes, reads and writes will fail.
QUORUMfails, applications can retry with a more relaxed CL.
THREEstill guarantee some redundancy. You can do it with reads, writes or both, depending on which tradeoffs are acceptable for your use case.
- Higher but usually reasonable latency for reads and writes.
This pattern is suitable if latency and throughput are relevant but not critical for both reads and writes.
ALL CS for writes and
ONE for reads.
- This pattern violates the formula above, because there is only one IN for writes. A node loss may permanently cancel committed writes. A node crash may delay committed writes until the node restarts.
- If any node crashes, reads will fail.
ALLfails, applications may retry the same read with
QUORUM, and maybe even a more relaxed CS if even
- This means that, in case of crashes, both reads and writes give up some degree of accuracy.
- Higher latency and lower throughput for reads.
This pattern is generally more suitable if there are sensibly more writes than reads.
Paranoia won’t help
Remember the formula above: to get theoretical high accuracy, the number of IN for reads plus the number of IN for writes needs to be higher than the RF. This happens in the case of WE and RE (IN=RF+1) and 2Q (IN=RF or RF+1).
You can have IN=RF*2 by using
ALL for both reads and writes. But this won’t achieve a higher accuracy. Nor will it achieve a higher availability – actually I would say that if the application can’t tolerate a single node crash the availability is very low.
Cassandra was not designed to guarantee a high level of accuracy, but a theoretical high level of accuracy can be achieved for particular use cases. To do that, we choose a consistency level for writes and a consistency level for reads. Here we discussed the pro’s and con’s of the three patterns that I call Write Everywhere, Double Quorum, and Read Everywhere.
If you have more ideas or more considerations concerning query accuracy and performance, please drop a comment. Comments contribute to make this website great, and they are very welcome.