## Introduction

In this article I will explain how create an Qlik Sense App that runs the Machine Learning **Gradient Descent Algorithm**.

As you reach the end of this rather loong post, I hope you will be able to write a Qlik Sense application **using only Qlik Script,** that can be very useful for your business, like predicting if your customers are about the leave you, or predict if your patients are at risk of getting diabetes.

**How to create Linear and Logistical Gradient Descent Algorithm in Qlik Script**

**How to create Linear and Logistical Gradient Descent Algorithm in Qlik Script**

I will show you here how easy it is to run the gradient **descent algorithm** on datasets with **multiple columns**, **large amount of rows**, and also how to run the algorithm when there is not a linear relation between the columns (**Logistic Gradient Descent**). When you have done the app yourself, you will also be able to understand how the math is working. No previous skills but Qlik Coding is required, but you might learn and understand more about the math by also reading though my LinkedIn on this very same topic. I have some practical examples such as Predicting Survival Rate on Titanic, Predict Stock Index Prices. But also some more loose reasoning about the maths behind the gradient descent algorithm.

You also need to read my previous post about Data Generator, because in this blog post I will use self-generated-data for the algorithms.

Enough talking, let us jump right into the **basics of the code** (to download or try out 4 variations of this code, go to the end of this article).

### Create the data

As usual, in supervised learning algorithms, we split the data in “**learning**” and “**testing**”. This is what the learning data can look like:

This data with two columns **X1** and **Y**, was created using the following code:

DataSet: Load rowno() as Row, 1 as X0, $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X1, $(initialθ0) as θ0, $(initialθ1) as θ1, if(rowno()<=$(NoOfLearningRows),'Learn','Test') as DataType AutoGenerate($(NoOfLearningRows)+$(NoOfTestingRows)); left join (DataSet) LOAD Row, round( $(LinearHypothesis) - if(DataType='Learn', ($(GaussianRandom) * $(LinearHypothesis)) ,0) ,$(LimitMaxError)) as Y resident DataSet;

using the following variable definitions* that must have been declared firs*t*:*

LET NoOfLearningRows=500; LET NoOfTestingRows=500; LET RandomVariationSize = 0.5; LET HighestInteger=2; LET LowestInteger=-2; LET FeaturePrecision=0.01; LET HighestWeight=5; LET LowestWeight=-5; LET WeightPrecision=0.1; SET GenerateRandomNumber = 'round(Rand()*($2-$1)+$1,$3)'; SET LinearHypothesis = '((X0*θ0) + (X1*θ1))'; Let initialθ0=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); Let initialθ1=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); SET GaussianRandom = $(GenerateRandomNumber(-0.5,0.5,0.1)) * sqrt(-2*log(rand())) * cos(2* pi() * rand());

### Brief code explenation

We create two sets of data in the table Dataset with two data types “**Learn**” & “**Test**”, 500 rows each so a total number of rows is 1000. (FYI I have tested the code on as large datasets as **5’000’000 rows** and it works great too.)

The dataset will have 2 X-columns (**X0** and **X1** where **X1** is a random number between -2 and 2 with a precision of 0.01) and one calculated **Y** column which is based on the formula defined in the variable *LinearHypothesis as Y = X0*θ0 + X1*θ1*. (The reason why I create X0 = 1 is for the mathematics, which you can read more about in my LinkedIn posts).

**Θ0** and **θ1** are constants (called **Weights**) that is randomly generated once in the beginning. The value of those two constants are kept secret for the Gradient Descent, and what we are basically doing is to try to train the algorithm on this data and make it figure out what those values actually were.

We are also adding a blur (*gaussian blur*) / “dirty data” to the Y calculation, to make it more challenging and “realistic”. The **randomness** of Y is only added for the training dataset. We keep the Test dataset “perfect” because this is where we are testing the result of our algorithm.

With real world data, of course, the **test** **dataset** also normally contains dirty data, but for this project** we just want to learn about the algorithm and how it is working** and it makes sense for us to have **full control** over the test result.

#### This how the data can look with both learning and testing data plotted:

The *orange dots* forms the shape of a perfect line (*because we have no gaussian blur added to* the test data), and the green dots are the training data that has some kind of randomness.

It makes sense to keep test data perfect – for verification reasons. Have a look at this result of one algorithm where training data are the blue dots, test data are represented with yellow dots, and the training algorithm result is represented as an orange line:

Here we see the same thing, but it is clear that the algorithm did not succeed to find the right weights very well, as the line does not follow the expected test data (yellow dots).

The app we coding in this blog post will have the following KPI’s and charts:

This is the result of a bad trimmed algorithm. We only reached 43% correct predictions.

Here is the result when a training session ended better:

And here with almost perfect result:

Continuing on the code, let’s look at the Gradient Descent Algorithm!

### Setting up initial weights

The fundamental idea with Gradient Descent is to mathematically gradually adjust the weights in the formula to better and better fit towards the expected result. It does not know the formula but it knows when it is not correct and it knows how far away it is from being correct. So for the first round of calculations it need to start with something. Either you want to** feel lucky** and start with some random initial values, or you want to **play it safe** and just start with 0.

On the first sheet we define some variables:

LET StartAtZero='No';

or

LET StartAtZero='Yes';

and

LET HighestWeight=5; LET LowestWeight=-5; LET WeightPrecision=0.1;

and

LET α_max = 1;

Play around with different values on the weights and precision. You will learn something from that.

If '$(StartAtZero)' <> 'Yes' then Join (DataSet) LOAD $(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))) as θ0, $(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))) as θ1 AutoGenerate(1); else Join (DataSet) LOAD 0 as θ0, 0 as θ1 AutoGenerate(1); End if; Join (DataSet) LOAD $(α_max) as α0, $(α_max) as α1 AutoGenerate(1);

### Code in Words:

With the table DataSet – add two columns **θ0** and **θ1** (inital weights) using either zero as value or a random number. Then add two more columns **α0** and **α1** (*learning rates*) and set those two the desired max value. Those 4 new columns are added to all 1000 rows in the table using this join where there is no shared key.

## Now we start!

**The Gradient Descent is basically a simple for … loop.**

For every iteration, the algorithm is “*gradiently”* getting closer and closer to it’s goal – to **minimize the error.**

Here is the code:

For i = 1 to MaxIterations temp: LOAD Row,θ0,θ1,α0,α1, ($(LinearHypothesis)-Y) * X0 as deviation_X0, ($(LinearHypothesis)-Y) * X1 as deviation_X1 Resident DataSet where DataType='Learn';

We create a new table called “temp” that is reading all “learning” rows from the DataSet table, adding a two new calculated fields called **deviation_X0 **and **deviation_X1**.

We are using the predefined variable called “*LinearHypothesis*“, the we defined as *((X0*θ0) + (X1*θ1)) so what we are doing in reality is calculating the new fields as (((X0*θ0) + (X1*θ1)) – Y )* X0 as deviation_X0 *and*(((X0*θ0) + (X1*θ1)) – Y )* X1 as deviation_X1.*

Next we summarize the deviations for X0 and X1 like this:

sum_errors: Load DISTINCT 0 as CrossTableDummy, FirstValue(θ0) as prev_θ0, FirstValue(θ1) as prev_θ1, FirstValue(α0) as α0, FirstValue(α1) as α1,sum(deviation_X0)/$(NoOfLearningRows) as error_θ0, sum(deviation_X1)/$(NoOfLearningRows) as error_θ1Resident temp;

With this new table **sum_errors** we now have the total sum error for **X0** and **X1**. We will use this table to calculate **new weights**, but we will also use it to store data to our **log** so that we can visualize in the charts later what is going on inside the algorithm.

Remember that we joined the table with** θ0,θ1,α0** and **α1**, so all rows have the same value. We just pick the first rows values as they are all the same. The creation of **CrossTableDummy** will be clear as we create the log later.

drop table temp; drop fields θ0,θ1,α0,α1 from DataSet;

We can now safely drop the temp table, as the summarized table sum_errors has all values needed to calculate new weights. Our new weights, and new values of alfa will be re-joined to the DataSet table soon, so we need to drop those too.

left join (DataSet) LOAD prev_θ0-(α0*error_θ0) as θ0, prev_θ1-(α1*error_θ1) as θ1, (1/( pow((1/fabs(error_θ0) / $(α_max_error)) + $(i),2))) + $(α_min) as α0, (1/( pow((1/fabs(error_θ1) / $(α_max_error)) + $(i),2))) + $(α_min) as α1 resident sum_errors;

Here we calculate our new weights and we calculate a new, slightly smaller value on the two alfas, and we join them back to our orginal DataSet table.

Let finalθ0 = round(peek('prev_θ0',-1,'sum_errors'),$(FeaturePrecision)); Let finalθ1 = round(peek('prev_θ1',-1,'sum_errors'),$(FeaturePrecision));

Store the latest weight values. This is only for the user interface and our eager to test how well we are in the end. Values will of course be overwritten in every loop until we reach the end.

temp_error: CrossTable(A,B,1) LOAD CrossTableDummy,error_θ0 as θ0,error_θ1 as θ1 resident sum_errors; Concatenate(log) LOAD $(i) as log.row,dual(A,AutoNumber(A)-1) as Feature,B as ErrorValue resident temp_error; drop table temp_error;

This is for our user interface. We must have created a log table in the start using for instance this:

log:Load * INLINE [log.row];

So that *Concatenate(log) *does not fail. This log row holds all **current error**s. You will see soon how it is getting smaller and smaller for every iteration.

temp_weights: CrossTable(A,B,1) LOAD CrossTableDummy,prev_θ0 as θ0, prev_θ1 as θ1 resident sum_errors; Concatenate(log) LOAD $(i) as log.row,dual(A,AutoNumber(A)-1) as Feature,B as WeightValue resident temp_weights; drop table temp_weights;

This is the log post for the **current weights**.

temp_alfa: CrossTable(A,B,1) LOAD CrossTableDummy,α0 as θ0,α1 as θ1 resident sum_errors; Concatenate(log) LOAD $(i) as log.row,dual(A,AutoNumber(A)-1) as Feature,B as AlfaValue resident temp_alfa; drop table temp_alfa;

And this is the log post for the **current learning rate**. All these three log posts are only for understanding the algorithm are not needed for the algorithm itself.

MaxError: Load rangemax(fabs(error_θ0),fabs(error_θ1)) as MaxError Resident sum_errors;

We don’t want to continue to iterate in the for loop when the error is so small that it does not matter anymore to us. This calculation is to understand what is the **current biggest error**.

drop table sum_errors;

Now we can drop the sum_error table and get ready to return to start…. but first:

Let MaxError = peek('MaxError'); drop table MaxError;

if MaxError<=LimitMaxError then exit for; end if;

**Quit the loop in case we have a good enough result.** Variable LimitMaxError should be defined as something in the beginning. This is my suggestion:

LET LimitMaxError=$(FeaturePrecision)*$(WeightPrecision);

Because the error should be taking in consideration how much precision you have in weights and features,

next i; Let vRequiredIterations = $(i);

**And we are done!** Return to start in case error is still high! When loop is over we store how many iterations was needed in the variable *vRequiredIterations *

## So what is the result?

To make it easy to see **how well the algorithm has worked**, we finish the script with this code:

left join (DataSet) LOAD Row, $(finalθ0) as θ0, $(finalθ1) as θ1 resident DataSet; left join (DataSet) LOAD Row, round($(LinearHypothesis),$(LimitMaxError)) as Predicted_Y resident DataSet;

We use the last calculated weights **finalθ0** and **finalθ1**, with the variable LinearHypothesis to calculate a “**Predicted_Y**” on every row in the DataSet table.

This result in the possibility to verify how this Predicted_Y compares to the calculated and expected Y we did in the beginning.

**We run the script, and then let’s have a look at the app!**

### First Sheet: Training Log

On this sheet you can understand what was going on during the loop. You can see how the** errors got smaller and smaller**, and the **two weights got closer and closer to stable values**, and how the **learning rates (The alfas) started at 1 and got smaller and smaller.**

### Second Sheet: Prediction vs Actual

On this sheet you see the actual result - how good predictions did the algorithm do.

## Adding more features (more X-columns)

The sweet thing with the gradient descent algorithm is that it is **very scalable**. You can **easily add more columns**, without much changes in the code. Below you see the changes needed to go from one column (**X1**) to 5 columns (**X1,X2,X3,X4,X5**).

First, add the features to the LinearHypothesis variable like this:

From

SET LinearHypothesis = '((X0*θ0) + (X1*θ1)';

to

SET LinearHypothesis = '((X0*θ0) + (X1*θ1)+ (X2*θ2) + (X3*θ3) + (X4*θ4) + (X5*θ5))';

To run the algorithm for 5 columns instead of 1, obviously, you also need to create those new columns of data in the DataSet:

More initial weights:

Let initialθ0=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); Let initialθ1=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));Let initialθ2=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); Let initialθ3=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); Let initialθ4=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); Let initialθ5=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision)));

more columns in the table:

DataSet: Load rowno() as Row,1 as X0, $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X1,$(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X2, $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X3, $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X4, $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X5,$(initialθ0) as θ0, $(initialθ1) as θ1,$(initialθ2) as θ2, $(initialθ3) as θ3, $(initialθ4) as θ4, $(initialθ5) as θ5,if(rowno()<=$(NoOfLearningRows),'Learn','Test') as DataType AutoGenerate($(NoOfLearningRows)+$(NoOfTestingRows));

and more rows of calculation in the algorithm itself:

($(LinearHypothesis)-Y) * X0 as deviation_X0, ($(LinearHypothesis)-Y) * X1 as deviation_X1,($(LinearHypothesis)-Y) * X2 as deviation_X2, ($(LinearHypothesis)-Y) * X3 as deviation_X3, ($(LinearHypothesis)-Y) * X4 as deviation_X4, ($(LinearHypothesis)-Y) * X5 as deviation_X5

basically just add more rows for every x, every θ, and every α.

**The result can be fascinating to look at:**

## Predicting a binary value

You can use Gradient Descent also to calculate something “*binary*“.

If you want to predict if something is **true** or **false**, or to **what degree of probability** something is true or false. For instance, you want to predict if a patient is at risk of getting diabetes based on some blood and lifestyle data, or you want to predict if a customer is about to leave you and go to your competitor based on their sales data.

**You can do this with Gradient Descent**, and hence, you can do this directly in Qlik Sense using only Qlik Sense Scripting.

To do this we need to **wrap** a **non-linear-function** around our** linear hypothesi**s. Sounds complicated perhaps, but in fact it is not! I will try to explain:

Going back to our non linear, one dimensional code, above, we only need to change two line of code!

From

($(LinearHypothesis)-Y) * X0 as deviation_X0, ($(LinearHypothesis)-Y) * X1 as deviation_X1

To

((1/(1 + pow( e(),-$(LinearHypothesis) )))-Y) * X0 as deviation_X0, ((1/(1 + pow( e(),-$(LinearHypothesis) )))-Y) * X1 as deviation_X1

**And that is it!**

The rest of the algorithm code remains exactly the same

*(If you generate your own data, you also must change ***how you create your DataSet,**** because your training and testing data also need to have a binary profile**)…

### This is an example of how to create a binary training sample:

As you see – still almost exactly the same as for a linear data sample:

Let initialθ0=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); Let initialθ1=$(GenerateRandomNumber($(LowestWeight),$(HighestWeight),$(WeightPrecision))); DataSet: Load rowno() as Row, 1 as X0, $(GenerateRandomNumber($(LowestInteger),$(HighestInteger),$(FeaturePrecision))) as X1, $(initialθ0) as θ0, $(initialθ1) as θ1, if(rowno()<=$(NoOfLearningRows),'Learn','Test') as DataType AutoGenerate($(NoOfLearningRows)+$(NoOfTestingRows)); left join (DataSet) LOAD Row,if(match(Row,$(NewRandomNumberList)),round(rand()), round((1/(1 + pow( e(),-$(LinearHypothesis) ))))) as Y, if(match(Row,$(NewRandomNumberList)),'Randomized','Calculated') as RowTyperesident DataSet;

The difference is mostly in the last part, where we calculate the **Y** based on the new “** (1/(1 + pow( e()…**” code, and the fact that we randomize Y sometimes instead of calculating it using a variable created called

*NewRandomNumberList*.

The code “*(1/(1 + pow( e()….” is coming from something called a “***Sigmoid** *Function”. A sigmoid function is a “trigger function”, and it basically works like a “binary converter”. It takes any value, and return a value between 0 and 1.
*

The function can be plotted like this:

The **Sigmoid** function takes the result from the **linear hypothesis**, and outputs a value between 0 and 1. The more negative value from the linear hypothesis, the closer Sigmoid returns 0, and the more positive values from the hypothesis, Sigmoid will return 1.

So if we say that a customer will leave = 1, and a customer will stay = 0 – the result of the algorithm will do its best to understand what values the linear hypothesis return that give back a correct sigmoid output.

You don’t need to understand the mathematics really – just be reassured it works!

So now you have an app that can help you predict if your customers are going to keep buying from you, or if they are on a path of buying from someone else – perhaps even before they know it! And you see! You did not have to learn Python!

Go ahead! Call them asap, what are you waiting for?

## Final words

In the beginning of this project I was not expecting that running Machine Algorithms like this to be **practically** **useful** – it was more the way for me to learn about this and understand how the algorithms work.

After running this Logistical Descent Algorithm on a data set containing 50’000 rows with 5 input columns – on my laptop – it took 3 minutes to reach close to 99.5% “correctness”, I let it iterate 400 times. Then I tested running the algorithm on 500’000 rows of test data. After 500 iterations and about 30 minutes of calculation the prediction was 99.2%. And this is just when running on my laptop using Qlik Sense Desktop. I started experimenting in running the the code on 5’000’000 rows of data. Still, the same result – very high rate of correct predictions, and in some cases I reached 100% correct predictions for 500 new test values after just 7 minutes of calculations! **The is very scalable**!

The different versions of Gradient Descent called **Batch** gradient descent, **Stochastic** gradient descent and **Mini-batch** gradient descent are all different ways to handle Gradient Descent when the data sets are** really large**. You basically split up the original data set in smaller sets and run the algorithm many ways in “batches” of data. The end result is usually a **faster calculation** (because you don’t actually end up calculating on every row) but with the negative effect of a** less good prediction model.**

But when we do the code inside Qlik Script, there is no need to do this kind of batch processing, because at the end of the day – the power of the Qlik Engine and the In-Memory-Processing we have in Qlik- it just fits perfectly! No need to invest in new Machine Learning Servers if all you need is Gradient Descent! – You already have the power in your current Qlik Sense Platform!

### Article Resources

If you are interested in trying or downloading gradient descent apps, you find them here: