Choosing Cassandra consistency levels

A spot the difference quiz
With certain consistency levels, Cassandra nodes read the same data
and then play “spot the difference”

It is very important to understand Cassandra consistency levels. In this article I’ve put some quick notes for persons who are new to Cassandra, and some good practice that in my opinion it’s better to follow.

Changing the consistency level

The default consistency level is ONE, for both writes and reads.

We cannot set a different default. Not even at session level.

We can set the consistency level for each query we run. We need to do that at driver level – check the documentation of your driver to find out how to do it. Note that the CONSISTENCY statement is not CQL: it only works when manually entered in cqlsh.

More quick facts

  • Reads and writes support different consistency levels. Most of them however are shared.
  • Consistency levels behavior depends on your replication factor. If you have multiple datacentres (even if they are just logical datacentres) draw a map of your keyspaces redundancy.
  • Read Cassandra documentation carefully. You really need to understand consistency levels to avoid slow queries and failures.

Consistency levels speed

Some general facts to keep in mind:

  • Stricter consistency levels produce more accurate results but they are slower (obviously).
  • Queries that return many rows are more affected by strict consistency levels.
  • Slow or badly configured networks increase the cost of strict consistency levels.

Consistency levels accuracy

  • Consistency levels based on a quorum require a consensus from the majority of nodes. The majority of two nodes is two nodes. If you use quorums, you need an odd number of nodes. If you use local quorums (majority of nodes in a specific datacentre), you need an odd number of nodes in each datacentre.
  • Most of the times you don’t want to use local consistency levels. Cassandra is smart enough to contact nodes in the same cluster, when possible. But if it’s not possible, you have two choices: contact nodes in other datacentres, or fail. Inter-datacentre traffic can be slow, but it’s the price to pay to avoid application failures. Compare the two costs (failures and slowdowns), and opt for the lowest.

Performance and accuracy tradeoff

Ideally, one would always prefer to have the most accurate result. But this is not how Cassandra works. You can achieve that result, but this implies communication between nodes and datacentres. Generally, you do that only for some specific queries – or not at all. If we think that this is a problem for your use case, we should ask ourselves why you are not using a relational DBMS.

The default isolation level, ONE, is usually fine. The node(s) that receives the query will contact the node that owns that relevant rows, and will return the results to the client. But, other than that, there is no network traffic.

If we are considering a stricter isolation level for a certain query, weshould ask yourself the following questions:

  • Does Cassandra actually return data that is not enough accurate for our use case? If there is no such problem, we should not change the consistency level.
  • How often are Cassandra nodes unavailable, and for how much time? You should have this information and a SLO. Together, they answer the question if a strict consistency level is too risky.
  • If we are considering to change the consistency level for many queries, or queries that are executed often, or queries that return big results, or queries on secondary indexes or ALLOW FILTERING, we should do a benchmark. The result of this test and the SLO determine if the change is fine.
An electric telegraph
This electric telegraph reminds us that network communication can be slow

More specific advice

The previous tips are, in my opinion, general rules that we should always follow. However I also have practical advice for more specific cases.

Abort or retry?

Old days MS-DOS most famous message: "Abort, Retry, Fail?"

Application developers, after choosing a consistency level for a given query, should ask themselves: what should happen if the consistency level cannot be satisfied because some servers are down?

Should the query simply fail, and an error message be returned to the users, asking them to come back later? Fine.

But you can also retry the query with a more relaxed consistency level. The idea behind this is quite simple: ideally we’d like the query to be executed with, say, QUORUM; but if that’s not possible, instead of failing, we can run it with TWO or even ONE.

Sound like a good idea? Good, but these retries should not create a too high traffic between the client and Cassandra. Consider “remembering” that a certain consistency level cannot be satisfied, and don’t retry it until a certain timeout has expired. Something similar to the following pseudo-code:

set last_fail = '01-01-1970 00:00:00'

function run_query:
    if last_fail < (now() - timeout):
        set consistency = 'QUORUM'
        try query
        if query fails:
            set last_fail = now()
    else:
        set consistency = 'ONE'
        try query
        if query fails:
            return error to user

When QUORUM is too demanding

To decrease the chances that certain consistency levels cannot be satisfied because not enough nodes are running, we can add new nodes. With two caveats:

  1. The ALL consistency level will become harder to satisfy, not easier. It’s better to avoid it in this situation.
  2. The QUORUM and LOCAL_QUORUM consistency levels require that the majority of nodes is running, so adding more nodes could increase the number of nodes that must run for certain queries to be satisfied.

A solution is to avoid ALL and replace QUORUM with other consistency levels. For example, suppose you have three nodes, and some queries use QUORUM. Two nodes need to be running. But if we add one node, QUORUM will require three nodes to be running. So in this case, it is better to replace QUORUM with TWO.

Load balancing via logical datacentres

Suppose you have a limited number of queries returning many rows. Maybe they it’s OK if they are consuming resources, but you don’t want them to slow down other queries, to avoid that the application used by customers slows down.

You can achieve this by adding nodes in a different datacentre. But this implies two problems: you need to have a second datacentre but maybe you don’t, and you will have traffic between different datacentres.

The solution is to use a logical datacentre. If you use GossipingPropertyFileSnitch, you can set the datacentre’s name by editing the cassandra-rackdc.properties file. Even if all nodes run in the same physical datacentre, you can tell Cassandra that they run on different datacentres. These are logical datacentres.

Toodle pip,
Federico

Image sources: Wikimedia Commons, Wikipedia, Wikimedia Commons

Leave a Reply

Your email address will not be published. Required fields are marked *