Fixing performance issues is important. But it’s even more important to decide what a performance issue is. Startups think that this is a simple problem: when customers complain about their website slowdowns, or when an employee notices that a certain page is too slow, there is a problem.
Well, some startups actually know that this is not a simple matter. They know that there are more complex ways to define and deal with performance problems, but they think that being simplistic just works for them, because they don’t have the resources to handle everything in a great way. Usually they are right, or partly right, about the lack of resources. Unfortunately, they are wrong about the just works part.
This is not a guide on how to set good SLOs, but we’re going to discuss what a SLO is, why it’s so important for you, and gives you a high-level understanding of how to do things right.
The cost of slowness
It’s interesting to search the web for the costs of a slow website. Many web pages show random numbers that would barely make sense if compared with something else, but are not compared with anything. X seconds of slowness costs Y% of your earnings, they think thats the problem is that simple. This is just another proof that the general approach to performance problems is foggy and nebulous. I will not link any of those articles.
Let’s loook at more credible resources. An article by The Social Media Monthly states that, if a page takes more than 3 seconds to load, 40% of users will abandon the site. A 2009 Google research shows that delaying the page load by 100ms to 400ms decreases searches by 0.2% to 0.6%. Amazon loses 1% of sales at every 100ms delays. Facebook sees a 3% decrease of traffic when pages slow down by 500ms.
This should give you an idea of the problem, but don’t believe too much in these numbers. Depending on the services you sell and your customers expectations, your situation could be better or worse than this. If you are in the position to do so, you may want to ask your analysts to find out the right numbers for your company. If your startup doesn’t have any analyst yet, devops should discuss how to find such numbers and add them to some dashboard. An approximation will be much better than nothing. But in doing so, keep in mind that the numbers are usually different for different web pages. A slow or unavailable sale page is usually worse than a slow or unavailable informative page. But an unavailable registration or contact page may cause some customers to leave you forever. Also, the cost may vary depending on other factors that are specific to your business. For example, some types of services are only critical near to the end of each month or at the end of the fiscal year, causing customers to be in hurry in those periods.
Other costs are much more difficult to quantify. But analysts could provide some hint about them – talking to those people is often extremely useful. Some of those costs are:
- Negative advertising. If people is angry with you, they could tell their friends, or write it on social media.
- Search engines. Google and its competitors value the speed of websites, and try not to show slow ones at the beginning of a SERP.
- Lost hiring opportunities. Candidates will not be positively impressed by a slow website. This may not be one of the main criteria, but it plays a role.
First of all, let’s define speed and available in this context. Without this step, we cannot measure anything and our talks will be nebulous.
What is speed?
The users don’t care about the web server speed, the database speed, or even the global speed of server side processes. They care about the time lapse that occurs between two instants. The moment their mouse produces the typical click noise (or their finger touches a link, or…) to the moment the page is on the screen ready to be used. This is called the latency. There are slightly different opinions on what ready to be used means, but we will not dig into this topic here.
It’s important to understand that measuring the latency from a user point of view is entirely possible. It just requires some front-end script that triggers an HTTP call when the page is ready.
What is a downtime?
Downtime. Is it the situation when someone screams “The site is down!”, and devops start to murmur scary things and look at graphs and strange logs, and a boss magically appears behind their shoulders, and every developer starts to think how to prove that their last commit is not the culprit for the imminent company collapse?
If you’re thinking “no”, you’re probably lying to yourself. But this will remain a secret between you and me!
Availability is something that can be measured, just like latency. But first, it is necessary to define what a downtime is.
- Any HTTP response to the browser whose status is not 400.
- Notice that, if the response is for another microservice and not for the browser, it may or may not have an effect on the final user.
- Still, each microservice should have its own availability and performance objectives. If your users actions are not logged properly, they will notice nothing, but the company will still suffer a loss.
- Lack of useful contents. Note that not all contents are equally important.
- If the user wants to see the product catalogue and it doesn’t show, this is almost as bad as a 500 HTTP error.
- If the users’ last login time doesn’t show, well, they will probably don’t mind.
- Again: different microservices should have different obhectives.
- Keep retries into account. If the page doesn’t show at the first attempt but it shows at the second or third, it is not a real downtime.
- Did you notice that I just lied? Measure the number of users that retry and those who don’t to define what a downtime is.
- If you noticed the lie, you have the right mindset.
- Actually you won’t consider the number of retries directly. You will consider the time instead. Define a minimum length for a downtime, for example 3 seconds or 1 minute.
- Keep number of users into account. Don’t consider a downtime something that affects a single user. That would go against math and logic. How many users must leave before the company has a damage that is not irrelevant? 10? 100? 5k? It really depends. Any answer that doesn’t involve analysts work or – in absence of analysts – at least 5 minutes of calculations, is probably too random to be useful.
After deciding what latency and downtime are, we can start to think about measuring their costs
We will not consider damages that are too difficult to measure here, even if it’s important to be aware that they exist. But analysts – or devops, if there are no analysts – should try to measure things like:
- How many users leave the site after how much time, on the critical pages;
- How many of them will return to use the website in the future.
Given these numbers, analysts (or devops, this time with some help from marketing) should find out the cost of slow down, in terms of:
- Lost revenues for the day/month/year;
- Lifetime value of permanently lost customers.
sIf these numbers are not available, someone from business should at least be able to provide some guesstimates.
Aggregating data hides information
Data must be aggregated to be useful. We all agree on this. But this doesn’t simply mean that you need to look at latency average. Or at least, if you do that, please be aware that a lot of important information is not represented by those averages.
Averages are everywhere
Usually, the monitoring of any metrics is based on averages. Of course latency is no exception. There is however a big problem with averages: in IT metrics, values sensibly higher or slower than the average are a big portion of the total. You don’t believe me? Just try a query like this on your system:
(COUNT(some_series) / 100) * COUNT(some_series < (AVG(some_series) - some_margin) OR some_series > AVG(some_series) + some_margin)
This (vaguely Prometheus-like) pseudo-code tells you the percentage of values that are sensibly lower or higher than the average.
When considering an average, you should at least also consider the standard deviation. This is far from being a mathematically precise analysis, but you will get at least a feel of how precise your average is.
Don’t rely on graphs for this. What average does is precisely this: it hides peaks. The more you zoom out, the less peaks you will see in your panes.
Maximums are vital
Maximums are more significant. People tend to just label them as outliers and therefore implicitly useless. They are not.
A latency maximum is an interaction that actually happened in the real world, and may correspond to a frustrated user. Which probably means, a money loss, as we discussed above.
Graphing latency maximums by second would not be that useful because, as we discussed, the most important peaks would be hidden as you zoom out. Graphing them by 30 minutes will give you a much more significative visualisation, but you will not know how often important peaks occurs. You could have such a graph, plus another by minute, that you occasionally inspect zooming in enough to see the peaks.
95th percentiles are wrong
The hints above try to answer the question “how significant are these maximums”. But implementing such graphs is definitely not something that companies often do. So, what do they do instead? Well, those who care about the average problem at all, often have graphs with averages of the 95th percentile. This is the idea adopted by several monitoring solutions.
Unfortunately, from a mathematical point of view, aggregating a percentile makes no sense at all. There is a great article by Baron Schwartz about this: Why percentiles don’t work the way you think.
Simplifying problems is fine. But please, try not to oversimplify complex problems. It is easy to fall in this trap, but it makes your statistics wrong.
Cost of disasters
We will not discuss here the cost of non-ordinary disasters. But you need to be aware that such disasters are possible, and assume that at some point they will happen. You should have an idea of their costs, because this will tell you how much money it is acceptable to spend to reduce the risk. Or, how much money is not acceptable to save if this means that the risk is not reduced in a reasonable way.
Some of those risks are:
- Data loss (1 second, 1 minute, 1 day… all data);
- Theft or loss of strategic business information (BI statistics, etc);
- Theft or loss of business internal information (company plans, HR data, etc)
- Theft of users personal data.
Other potential disasters depend on the nature of your business. The point here is that you need a list of them. And for each of them, you need to know how likely the disaster is, and its potential cost.
The cost of disasters could be the topic of a future article, if there is interest.
Maintaining a service alive in the long term ,occasionally requires some maintenance operations that cause a temporary disruptions. They should happen withing so called maintenance windows. These windows must be communicated to users in advance, and reported in the website while they are happening.
Maintenance windows should allow to upgrade technologies and improve the architecture, for example. Assuming that they are handled properly, they should not considered as incidents.
Maintenance windows could be the topic for a future article.
Rationality versus panic
Panicking when you randomly realise that something is slow (and even more when something crashes) may “just work” for very young startups, but I believe that they should abandon this approach as soon as they can. Is it easy? No, I never said that. Setting proper Service-Level Objectives (SLO) requires setting up proper business and IT monitoring. But this will allow to answer questions like:
- How much money are we losing because of slowdowns and unavailability?
- What is the long term cost of a lost customer, and how many of them are we losing?
- How many missed sales and lost customers can we afford?
- What is the cost of technologies and activities that can be taken to reduce slowdowns and unavailability?
I think that we all agree that hoping that these things don’t happen is a natural behavior, but in the long term it has a dangerous cost.
However, the last two questions are vital. Once you ask them, you found your way out irrational panicking. You are accepting that missed sales and customer losses simply happen. And that’s true: you can do your best to reduces slowdowns (and outages), but there is nothing you can do to completely avoid them.
If this sounds scary or cruel, just think that IT is not different from other departments. I am not familiar on how sales, marketing or customer support work; but I have no doubts that they also see money losses. Sometimes it’s because of a human mistake, sometimes because the company processes are not perfect, sometimes because of external factors. All things that they could take care about, but they don’t because a company cannot address any single problem you it have. Normally, this is no drama and no one yells at them for that. The important thing is that their overall performance is good. And the same should apply to IT.
Now that you know what you should look at, my hope is that you’ll start to think how to gather the information you need. And once you have that information, my hope is that you write a meaningful SLO, which could look like the following (probably with completely different numbers, depending on your specific business and what you can realistically achieve):
- Latency should be < 1 second;
- But it is ok to exceed the limit for up to 10 seconds;
- It is also ok to exceed the limit for a longer time if no more than 200 users are affected;
- There must be an average distance of 2 hours between 2 excesses;
- In case of a disaster, the service must be available again in no more than 30 minutes;
- In case of a disaster, up to 1 second of recent data can be lost.
Again: each microservice should have different rules.
You will have a clear objective, and people will learn what to do to comply with it. You will know the cost of SLO violations (approximately). The SLO will be periodically reviewed. If it turns out to be unrealistic, you should relax the thresholds. If it turns out to be too pessimistic, you can make it more strict. Another reason to review the SLO is that the costs of unavailability and downtime vary with time.
A great goal, right? Being more aware of the money you waste because of disservices, how to avoid that, and what disservices can be tolerated.
But we already mentioned that setting up proper monitoring and objectives is hard. A startup may simply not have the skills to do that, and the most common solution is to pretend that the problem doesn’t exist. Please, don’t do that. Companies and individual consultants are here to help. Yes, I’m writing this because I’m one of them. But this is not advertising, this is content marketing, I’m not showing you sexy teenagers looking at graphs on a screen, I shown you rational motivations.
Think about it. I, for one, agree with myself.