Introduction

In this article I will explain how create an Qlik Sense App that runs the Machine Learning Gradient Descent Algorithm.

As you reach the end of this rather loong post, I hope you will be able to write a Qlik Sense application using only Qlik Script, that can be very useful for your business, like predicting if your customers are about the leave you, or predict if your patients are at risk of getting diabetes.

How to create Linear and Logistical Gradient Descent Algorithm in Qlik Script

 

I will show you here how easy it is to run the gradient descent algorithm on datasets with multiple columns, large amount of rows, and also how to run the algorithm when there is not a linear relation between the columns (Logistic Gradient Descent). When you have done the app yourself, you will also be able to understand how the math is working. No previous skills but Qlik Coding is required, but you might learn and understand more about the math by also reading though my LinkedIn on this very same topic. I have some practical examples such as Predicting Survival Rate on Titanic, Predict Stock Index Prices. But also some more loose reasoning about the maths behind the gradient descent algorithm.

You also need to read my previous post about Data Generator, because in this blog post I will use self-generated-data for the algorithms.

Enough talking, let us jump right into the basics of the code (to download or try out 4 variations of this code, go to the end of this article).

Create the data

As usual, in supervised learning algorithms, we split the data in “learning” and “testing”.  This is what the learning data can look like:

This data with two columns X1 and Y, was created using the following code:

DataSet:
Load
  rowno() as Row,
  1   as  X0,
  $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as  X1,
  $(initialθ0) as θ0,
  $(initialθ1) as θ1,
  if(rowno()<=$(NoOfLearningRows),'Learn','Test') as DataType
AutoGenerate($(NoOfLearningRows)+$(NoOfTestingRows));

left join (DataSet)
LOAD 
  Row,
  round( $(LinearHypothesis) - 
  if(DataType='Learn',
    ($(GaussianRandom) * $(LinearHypothesis)) 
    ,0)
    ,$(LimitMaxError))  as Y 
resident DataSet;

using the following variable definitions that must have been declared first:

LET NoOfLearningRows=500;
LET NoOfTestingRows=500;
LET RandomVariationSize = 0.5;
LET HighestInteger=2;
LET LowestInteger=-2;
LET FeaturePrecision=0.01;
LET HighestWeight=5;
LET LowestWeight=-5;
LET WeightPrecision=0.1;
SET GenerateRandomNumber  = 'round(Rand()*($2-$1)+$1,$3)';
SET LinearHypothesis = '((X0*θ0) + (X1*θ1))';
Let initialθ0=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ1=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
SET GaussianRandom = $(GenerateRandomNumber(-0.5,0.5,0.1)) * sqrt(-2*log(rand())) * cos(2* pi() * rand());

Brief code explenation

We create two sets of data in the table Dataset with two data types “Learn” & “Test”, 500 rows each so a total number of rows is 1000. (FYI I have tested the code on as large datasets as 5’000’000 rows and it works great too.)

The dataset will have 2 X-columns (X0 and X1 where X1 is a random number between -2 and 2 with a precision of 0.01) and one calculated Y column which is based on the formula defined in the variable LinearHypothesis as Y = X0*θ0 + X1*θ1. (The reason why I create X0 = 1 is for the mathematics, which you can read more about in my LinkedIn posts).

Θ0 and θ1 are constants (called Weights) that is randomly generated once in the beginning. The value of those two constants are kept secret for the Gradient Descent, and what we are basically doing is to try to train the algorithm on this data and make it figure out what those values actually were.

We are also adding a blur (gaussian blur) / “dirty data” to the Y calculation, to make it more challenging and “realistic”. The randomness of Y is only added for the training dataset. We keep the Test dataset “perfect” because this is where we are testing the result of our algorithm.

With real world data, of course, the test dataset also normally contains dirty data, but for this project we just want to learn about the algorithm and how it is working and it makes sense for us to have full control over the test result.

 

This how the data can look with both learning and testing data plotted:

The orange dots forms the shape of a perfect line (because we have no gaussian blur added to the test data), and the green dots are the training data that has some kind of randomness.

It makes sense to keep test data perfect – for verification reasons. Have a look at this result of one algorithm where training data are the blue dots, test data are represented with yellow dots, and the training algorithm result is represented as an orange line:

Here we see the same thing, but it is clear that the algorithm did not succeed to find the right weights very well, as the line does not follow the expected test data (yellow dots).

The app we coding in this blog post will have the following KPI’s and charts:

This is the result of a bad trimmed algorithm. We only reached 43% correct predictions.

Here is the result when a training session ended better:

And here with almost perfect result:

Continuing on the code, let’s look at the Gradient Descent Algorithm!

Setting up initial weights

The fundamental idea with Gradient Descent is to mathematically gradually adjust the weights in the formula to better and better fit towards the expected result. It does not know the formula but it knows when it is not correct and it knows how far away it is from being correct. So for the first round of calculations it need to start with something. Either you want to feel lucky and start with some random initial values, or you want to play it safe and just start with 0.

On the first sheet we define some variables:

LET StartAtZero='No';

or

LET StartAtZero='Yes';

and

LET HighestWeight=5;
LET LowestWeight=-5;
LET WeightPrecision=0.1;

and

LET α_max = 1;

Play around with different values on the weights and precision. You will learn something from that.

If '$(StartAtZero)' <> 'Yes' then
  Join (DataSet)
  LOAD 
      $(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))) as θ0,
      $(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))) as θ1
  AutoGenerate(1);
else
 Join (DataSet)
  LOAD 
      0 as θ0,
      0 as θ1
  AutoGenerate(1);
End if; 

Join (DataSet)
LOAD     
    $(α_max) as α0,
    $(α_max) as α1
AutoGenerate(1);

Code in Words:

With the table DataSet – add two columns θ0 and θ1 (inital weights) using either zero as value or a random number. Then add two more columns α0 and α1 (learning rates) and set those two the desired max value. Those 4 new columns are added to all 1000 rows in the table using this join where there is no shared key.

Now we start!

The Gradient Descent is basically a simple for  … loop.

For every iteration, the algorithm is “gradiently” getting closer and closer to it’s goal – to minimize the error.

Here is the code:

For i = 1 to MaxIterations
temp:
LOAD
  Row,θ0,θ1,α0,α1,
  ($(LinearHypothesis)-Y) * X0 as deviation_X0,
  ($(LinearHypothesis)-Y) * X1 as deviation_X1
Resident DataSet where DataType='Learn';

 

We create a new table called “temp” that is reading all “learning” rows from the DataSet table, adding a two new calculated fields called deviation_X0 and deviation_X1.

We are using the predefined variable called “LinearHypothesis“, the we defined as ((X0*θ0) + (X1*θ1)) so what we are doing in reality is calculating the new fields as  (((X0*θ0) + (X1*θ1)) – Y )* X0  as deviation_X0 and(((X0*θ0) + (X1*θ1)) – Y )* X1 as deviation_X1.

Next we summarize the deviations for X0 and X1 like this:

sum_errors:
Load DISTINCT
  0 as CrossTableDummy, 
  FirstValue(θ0) as prev_θ0, 
  FirstValue(θ1) as prev_θ1,
  FirstValue(α0) as α0,
  FirstValue(α1) as α1,
  sum(deviation_X0)/$(NoOfLearningRows) as error_θ0, 
  sum(deviation_X1)/$(NoOfLearningRows) as error_θ1 
Resident temp;

With this new table sum_errors we now have the total sum error for X0 and X1. We will use this table to calculate new weights, but we will also use it to store data to our log so that we can visualize in the charts later what is going on inside the algorithm.

Remember that we joined the table with θ0,θ1,α0 and α1, so all rows have the same value. We just pick the first rows values as they are all the same.  The creation of CrossTableDummy will be clear as we create the log later.

drop table temp; 
drop fields θ0,θ1,α0,α1 from DataSet;

We can now safely drop the temp table, as the summarized table sum_errors has all values needed to calculate new weights. Our new weights, and new values of alfa will be re-joined to the DataSet table soon, so we need to drop those too.

left join (DataSet) 
LOAD 
  prev_θ0-(α0*error_θ0) as θ0,
  prev_θ1-(α1*error_θ1) as θ1,
  (1/( pow((1/fabs(error_θ0) / $(α_max_error)) + $(i),2))) + $(α_min) as α0,
  (1/( pow((1/fabs(error_θ1) / $(α_max_error)) + $(i),2))) + $(α_min) as α1
resident sum_errors;

 

Here we calculate our new weights and we calculate a new, slightly smaller value on the two alfas, and we join them back to our orginal DataSet table.

 

Let finalθ0 = round(peek('prev_θ0',-1,'sum_errors'),$(FeaturePrecision));
Let finalθ1 = round(peek('prev_θ1',-1,'sum_errors'),$(FeaturePrecision));

Store the latest weight values. This is only for the user interface and our eager to test how well we are in the end. Values will of course be overwritten in every loop until we reach the end.

temp_error:
CrossTable(A,B,1) LOAD CrossTableDummy,error_θ0 as θ0,error_θ1 as θ1 resident sum_errors;
Concatenate(log) LOAD $(i) as log.row,dual(A,AutoNumber(A)-1) as Feature,B as ErrorValue resident temp_error;
drop table temp_error;

This is for our user interface. We must have created a log table in the start using for instance this:

log:Load * INLINE [log.row];

So that Concatenate(log) does not fail. This log row holds all current errors. You will see soon how it is getting smaller and smaller for every iteration.

temp_weights:
CrossTable(A,B,1) LOAD CrossTableDummy,prev_θ0 as θ0, prev_θ1 as θ1 resident sum_errors;
Concatenate(log) LOAD $(i) as log.row,dual(A,AutoNumber(A)-1) as Feature,B as WeightValue resident temp_weights;
drop table temp_weights;

This is the log post for the current weights.

temp_alfa:
CrossTable(A,B,1) LOAD CrossTableDummy,α0 as θ0,α1 as θ1 resident sum_errors;
Concatenate(log) LOAD $(i) as log.row,dual(A,AutoNumber(A)-1) as Feature,B as AlfaValue resident temp_alfa;
drop table temp_alfa;

And this is the log post for the current learning rate. All these three log posts are only for understanding the algorithm are not needed for the algorithm itself.

MaxError:
Load 
  rangemax(fabs(error_θ0),fabs(error_θ1)) as MaxError
Resident sum_errors;

We don’t want to continue to iterate in the for loop when the error is so small that it does not matter anymore to us. This calculation is to understand what is the current biggest error.

drop table sum_errors;

Now we can drop the sum_error table and get ready to return to start…. but first:

Let MaxError = peek('MaxError');
drop table MaxError;
if MaxError<=LimitMaxError then 
  exit for;
end if;

Quit the loop in case we have a good enough result. Variable LimitMaxError should be defined as something in the beginning. This is my suggestion:

LET LimitMaxError=$(FeaturePrecision)*$(WeightPrecision);

Because the error should be taking in consideration how much precision you have in weights and features,

next i;
Let vRequiredIterations = $(i);

And we are done! Return to start in case error is still high! When loop is over we store how many iterations was needed in the variable vRequiredIterations 

So what is the result?

To make it easy to see how well the algorithm has worked, we finish the script with this code:

left join (DataSet)
LOAD
  Row,
  $(finalθ0) as θ0,
  $(finalθ1) as θ1
resident DataSet;

left join (DataSet)
LOAD 
  Row,
  round($(LinearHypothesis),$(LimitMaxError)) as Predicted_Y
resident DataSet;

We use the last calculated weights finalθ0 and finalθ1, with the variable LinearHypothesis to calculate a “Predicted_Y” on every row in the DataSet table.

This result in the possibility to verify how this Predicted_Y compares to the calculated and expected Y we did in the beginning.

We run the script, and then let’s have a look at the app!


First Sheet: Training Log

 

On this sheet you can understand what was going on during the loop. You can see how the errors got smaller and smaller, and the two weights got closer and closer to stable values, and how the learning rates (The alfas) started at 1 and got smaller and smaller.

Second Sheet: Prediction vs Actual

On this sheet you see the actual result - how good predictions did the algorithm do.


Adding more features (more X-columns)

The sweet thing with the gradient descent algorithm is that it is very scalable. You can easily add more columns, without much changes in the code. Below you see the changes needed to go from one column (X1) to 5 columns (X1,X2,X3,X4,X5).

 

First, add the features to the LinearHypothesis variable like this:

 

From

SET LinearHypothesis = '((X0*θ0) + (X1*θ1)';

to

SET LinearHypothesis = '((X0*θ0) + (X1*θ1) + (X2*θ2) + (X3*θ3) + (X4*θ4) + (X5*θ5))';

To run the algorithm for 5 columns instead of 1, obviously, you also need to create those new columns of data in the DataSet:

More initial weights:

Let initialθ0=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ1=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ2=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ3=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ4=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ5=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));

more columns in the table:

DataSet:
Load
rowno() as Row,
1 as X0,
$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X1,
$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X2,
$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X3,
$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X4,
$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X5,

$(initialθ0) as θ0,
$(initialθ1) as θ1,
$(initialθ2) as θ2,
$(initialθ3) as θ3,
$(initialθ4) as θ4,
$(initialθ5) as θ5,
if(rowno()<=$(NoOfLearningRows),'Learn','Test') as DataType
AutoGenerate($(NoOfLearningRows)+$(NoOfTestingRows));

and more rows of calculation in the algorithm itself:

($(LinearHypothesis)-Y) * X0 as deviation_X0,
($(LinearHypothesis)-Y) * X1 as deviation_X1,
($(LinearHypothesis)-Y) * X2 as deviation_X2,
($(LinearHypothesis)-Y) * X3 as deviation_X3,
($(LinearHypothesis)-Y) * X4 as deviation_X4,
($(LinearHypothesis)-Y) * X5 as deviation_X5

basically just add more rows for every x, every θ, and every α.

The result can be fascinating to look at:

 

 


Predicting a binary value

You can use Gradient Descent also to calculate something “binary“.

If you want to predict if something is true or false, or to what degree of probability something is true or false. For instance, you want to predict if a patient is at risk of getting diabetes based on some blood and lifestyle data, or you want to predict if a customer is about to leave you and go to your competitor based on their sales data.

You can do this with Gradient Descent, and hence, you can do this directly in Qlik Sense using only Qlik Sense Scripting.

To do this we need to wrap a non-linear-function around our linear hypothesis. Sounds complicated perhaps, but in fact it is not! I will try to explain:

Going back to our non linear, one dimensional code, above, we only need to change two line of code!

From

($(LinearHypothesis)-Y) * X0 as deviation_X0,
($(LinearHypothesis)-Y) * X1 as deviation_X1

To

((1/(1 + pow( e(),-$(LinearHypothesis) )))-Y) * X0 as deviation_X0,
((1/(1 + pow( e(),-$(LinearHypothesis) )))-Y) * X1 as deviation_X1

 

And that is it!

 

The rest of the algorithm code remains exactly the same

(If you generate your own data, you also must change how you create your DataSet, because your training and testing data also need to have a binary profile)…

This is an example of how to create a binary training sample:

As you see – still almost exactly the same as for a linear data sample:

Let initialθ0=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));
Let initialθ1=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));

DataSet:
Load
rowno() as Row,
1 as X0,
$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X1,
$(initialθ0) as θ0,
$(initialθ1) as θ1,
if(rowno()<=$(NoOfLearningRows),'Learn','Test') as DataType
AutoGenerate($(NoOfLearningRows)+$(NoOfTestingRows));

left join (DataSet)
LOAD 
Row,
if(match(Row,$(NewRandomNumberList)),round(rand()),
round((1/(1 + pow( e(),-$(LinearHypothesis) ))))) as Y,
if(match(Row,$(NewRandomNumberList)),'Randomized','Calculated') as RowType
resident DataSet;

The difference is mostly in the last part, where we calculate the Y based on the new “(1/(1 + pow( e()…” code, and the fact that we randomize Y sometimes instead of calculating it using a variable created called NewRandomNumberList.

The code “(1/(1 + pow( e()….” is coming from something called a “Sigmoid Function”. A sigmoid function is a “trigger function”, and it basically works like a “binary converter”. It takes any value, and return a value between 0 and 1.

The function can be plotted like this:

The Sigmoid function takes the result from the linear hypothesis, and outputs a value between 0 and 1. The more negative value from the linear hypothesis, the closer Sigmoid returns 0, and the more positive values from the hypothesis, Sigmoid will return 1.

So if we say that a customer will leave = 1, and a customer will stay = 0 – the result of the algorithm will do its best to understand what values the linear hypothesis return that give back a correct sigmoid output.

You don’t need to understand the mathematics really – just be reassured it works!

 

So now you have an app that can help you predict if your customers are going to keep buying from you, or if they are on a path of buying from someone else – perhaps even before they know it! And you see! You did not have to learn Python!

Go ahead! Call them asap, what are you waiting for?


Final words

In the beginning of this project I was not expecting that running Machine Algorithms like this to be practically useful – it was more the way for me to learn about this and understand how the algorithms work.

After running this Logistical Descent Algorithm on a data set containing 50’000 rows with 5 input columns – on my laptop – it took 3 minutes to reach close to 99.5% “correctness”, I let it iterate 400 times. Then I tested running the algorithm on 500’000 rows of test data. After 500 iterations and about 30 minutes of calculation the prediction was 99.2%. And this is just when running on my laptop using Qlik Sense Desktop. I started experimenting in running the the code on 5’000’000 rows of data. Still, the same result – very high rate of correct predictions, and in some cases I reached 100% correct predictions for 500 new test values after just 7 minutes of calculations! The is very scalable!

The different versions of Gradient Descent called Batch gradient descent, Stochastic gradient descent and Mini-batch gradient descent are all different ways to handle Gradient Descent when the data sets are really large. You basically split up the original data set in smaller sets and run the algorithm many ways in “batches” of data. The end result is usually a faster calculation (because you don’t actually end up calculating on every row) but with the negative effect of a less good prediction model.

 

But when we do the code inside Qlik Script, there is no need to do this kind of batch processing, because at the end of the day – the power of the Qlik Engine and the In-Memory-Processing we have in Qlik- it just fits perfectly! No need to invest in new Machine Learning Servers if all you need is Gradient Descent!  – You already have the power in your current Qlik Sense Platform!