Introduction

In statistics and data science, an outlier is an observation that differs significantly from the rest of the dataset. Simply put, it is a value that is either too high or too low.

Meeting someone who is 2.3 m (7’5”) tall, is very rare. Experiencing a snowstorm in the middle of the summer is unexpected. When only 2 out of 100 business units generate most of the profit in your organization, you can tell something strange is going on. These are all examples of outliers.

In this post, you’ll find some scenarios with ready-to-use expressions that can help you start identifying outliers in your Qlik apps.

Is my KPI biased?

Since an outlier is a value located abnormally far from the rest of the dataset, any measure based on the distance between observations could be significantly biased. When a measure can be influenced by outliers, we call it a non-resistant measure. On the other hand, if the presence of the outlier doesn’t have any impact on the result, then the measure is categorized as resistant.

Biased KPIs

Some examples of non-resistant measures are:

  • Mean
  • Standard deviation
  • Pearson’s correlation coefficient
  • Skewness
  • Kurtosis

We use non-resistant measures on a daily basis and, most of the times, we ignore the presence of outliers. We usually don’t have time to check our data thoroughly or we simply don’t think about it. Therefore, we might be relying on incorrect values when making decisions.

Think about the average function. How many times have you used it as a KPI in your reports and applications? And how often you added information on whether or not there are outliers in the data model?

The average is the most commonly used formula when we want one number to represent the whole dataset. However, look at the charts below. Do you think that the blue dot really summarizes our data? From a different point of view, if someone told you there are 30 business units within a company with an average profit of $35M USD, would you imagine that 22 of them had negative results?

In some cases, we need to know the real value of the measure calculated above all the data including outliers, e.g. year-over-year comparisons.

Non-biased KPIs

The results of resistant measures like these are not affected by outliers:

  • Modus
  • Median
  • Spearman’s correlation coefficient (the distance in Pearson’s correlation coefficient is replaced by ranking that ignore absolute distances – it leads to change from non-resistant to the resistant measure)

It is not possible to say, in general, whether it’s better to have a dataset with or without outliers. Still, from a developer’s perspective, it’s vital to know about their presence so we can decide if we will ignore them or not based on our specific objectives.

How to identify outliers? Mathematical methods

The most commonly used method for outlier identification is a simple mathematical formula. The limit between normal and abnormal observations is derived by using an interquartile range.

In order to calculate this formula correctly, we must define the quantiles, quartiles, and interquartile range first. But don’t be frightened, it’s not as hard as it seems to be… and you can always scroll down and copy the expressions 😉

The x% quantile splits the whole dataset into 2 subsets so that x% of the values are lesser than the x quantile value (if we have dataset where 15% quantile = 123, that means that 15% of all observations are smaller than 123).

Quartiles are simply quantiles splitting the dataset into 4 subsets. Nothing more and nothing less.

  • 25% quantile = 1stquartile (Q1) – one quarter of all observations are smaller than the value
  • 50% quantile = 2ndquartile (Q2) – one half of all observations are smaller than the value = median
  • 75% quantile = 3rdquartile (Q3) – three quarters of all observations are smaller than the value = one quarter of all observations are higher the value

Interquartile range (also know as IQR) is the difference between Q3 and Q1.

A specific observation X is classified as outlier if the following is true:

  • X > Q3 + 1.5*IQR or X < Q1 – 1.5*IQR   =>   X > 2.5*Q3 – 1.5*Q1 or X < 2.5*Q1 – 1.5*Q3

A specific observation X is classified as extreme if the following is true:

  • X > Q3 + 3*IQR or X < Q1 – 3*IQR   =>   X > 4*Q3 – 3*Q1 or X < 4*Q1 – 3*Q3

Naturally, IQR isn’t the only method for outlier detection. Instead, you can use ± 2*standard deviation from average as the limits. There are also other methodologies that are completely different since they use density or classification instead of distances.

How to identify outliers in Qlik Sense? IQR methodology

We can start by generating our own dataset:

[data]:
load
   rowno() as X,
   exp(num#(rowno())) as Y
autogenerate 10;

In order to visualize our data, let’s use a combo chart with X as the dimension and only(Y) as measure:

If you want to create a table that identifies outliers and extremes with binary flags, just refer to the following formulas:

  • @IsOutlier:
if(
(Y > 2.5*fractile(total Y,0.75)-1.5*fractile(total Y,0.25))
or
(Y < 2.5*fractile(total Y,0.25)-1.5*fractile(total Y,0.75))
,1,0)
  • @IsExtreme:
if(
(Y > 4*fractile(total Y,0.75)-3*fractile(total Y,0.25))
or
(Y < 4*fractile(total Y,0.25)-3*fractile(total Y,0.75))
,1,0)

Adding background colors can help the users identify abnormal observations easily:

We can also precalculate the IQR in the script in order to use simpler expressions. However, it will cause static identification of outliers (the algorithm will not respond to the current filters). In some cases, this is exactly what we need, so it is good to keep that in mind.

By the way, a visual representation of this algorithm is available in Qlik Sense since June 2017.

Of course, since we’re working with Qlik Sense, there are a lot of different ways to do it. It’s your job to decide which one is the best fit for each use case.

Thanks for reading this article. If you have any questions or comments feel free to use the section below.

PS. You can download the QVF file we used in this post here.