Introduction

This blog post was supposed to be about the Gradient Descent Algorithm. But now it is not. That will be my next post. 😊

Because, in my passion to understand how various Machine Learning Algorithms are working and how to implement them in Qlik Script, I am always in need of lots of different data sets. And there are many good data set out there, free to use, but I realized that I spent to much time understanding the data rather than understanding the algorithms. I want to be able to “ignore” the data in a way, but still understand if the result of the algorithm is good or not… I wanted to understand, test, and run the algorithm but avoid spending time on understanding the data.

Dataset Generator Makes Sense!

So, to make things easier for me I decided to create an app that can generate the data I need. I created a “Dataset Generator”!

To my big surprise, I am find my self not only use Qlik Sense Script for Machine Learning, I am now also using Qlik Script to generate the data I need to run the scripts for Machine Learning. 😊And this makes Sense!

The great thing with the “data set generator” is that I can decide what kind of data I want to try the algorithm on – how many fields, minimum and maximum values, how much “randomness” and “outliners”, etc, and also very important, I have complete understanding of what is the expected “truth”, since I am also first creating the “truth”, and I can instead focus on the result of the algorithm.

How to Code it

The basic functionality behind the dataset generator is very simple. I use the Qlik Function Autogenerate() to generate as many rows I need, then some math functions and particularly the rand() function to generate some random input data.

Let’s say I want to generate some product related data like this:

Product number Weight Shipping Cost
1 100kg 19$
2 12kg 2$
etc

What I want to do later is to train a Machine Learning algorithm on this data to be able to estimate a good Shipping Cost for new items:

Product number Weight Shipping Cost
1 67kg ?

Creating the table

First, we define a variable with input values like this to make it easier to generate random number with an interval and flexible decimal precision:

SET GenerateRandomNumber = 'round(Rand()*($2-$1)+$1,$3)';

Creating a random integer between 1 and 100 is then written like this:

$(GenerateRandomNumber(1,100,1))

Then we create a table like this:

DataSet:
LOAD
  Rowno() as ProductNumber,
  $(GenerateRandomNumber(1,100,1)) as Weight
autogenerate(10);

So we get 10 rows with random integers for weight spanning between 1 and 100.

After this we need to generate the “Shipping Cost”.

To make things easy at this point, let us assume that the Shipping Cost has a direct linear relation to the products weights. The heavier a product is, the higher Shipping Cost.

So we “invent” a formula that will calculate the Shipping Cost:

Shipping Cost = 1.15 * Weight + 0.17 * Volume

The values 1.15 and 0.17 I just choose out of the air … or rather so that they give this following result for 100kg and 12kg, which is similar enough to what I wanted in the beginning:

Product number Weight Calculation Shipping Cost
1 100 kg =1.15+0.17*100 $ 18.15
2 12 kg =1.15+0.17*12 $ 3.19

So that is all great – we have a formula that can generate Shipping Cost. We join this back to our DataSet with weights:

Left join (DataSet)
LOAD
  ProductNumber,
  1.15 + 0.17 * Weight as [Shipping Cost]
Resident DataSet;

This is my result:

And if I plot this data on a chart it will look like this:

Extremely linear…. And as I am sure you are well aware; this data is far to “clean”. We do not see this simple linear relation in real life. The data is not realistic.

Shipping Costs will probably not always follow such a simple formula. Perhaps some items have extreme dimensions and requires a different kind of transportation, but we don’t have the data for that… and we don’t want to make it over complicated to create it, but somehow we need to make the data more look like real data.

One simple way to do it is to add a random factor to the formula.

We replace the previous join with this one:

Left join (DataSet)
LOAD
  ProductNumber,
  (1.15 + 0.17 * Weight) * $(GenerateRandomNumber(0.95,1.05,0.01))  as [Shipping Cost]
Resident DataSet;

This will add a random factor of +-5% to every Shipping Cost.

But we are still missing some outliers! Let us add those with some “completely” random values:

concatenate(DataSet)
LOAD
  rowno() as ProductNumber,
  $(GenerateRandomNumber(1,100,1)) as Weight,
  $(GenerateRandomNumber(10,15,0.01)) as [Shipping Price]
autogenerate(3);

If You run this code for 1000 rows instead of just 10 – this is the result:

Now I think it looks like realistic data!

As a side note – in case you want to do the opposite – remove outliers, check out the blog by Mária Šándorová‘s post about identifying and removing outliers!

Gaussian Blur Data

With this method you can create all sorts of data. Using some something called “gaussian blur” you can even create data for classification algorithms like this:

This data above is created with this code:

Set vRandA1_Max = $(GenerateRandomNumber(20,100,0.001));
Set vRandA1_Min = $(GenerateRandomNumber(20,60,0.001));
Set vRandB1_Max = $(GenerateRandomNumber(100,120,0.001));
Set vRandB1_Min = $(GenerateRandomNumber(10,30,0.001));
Set vRandA2_Max = $(GenerateRandomNumber(40,60,0.001));
Set vRandA2_Min = $(GenerateRandomNumber(70,80,0.001));
Set vRandB2_Max = $(GenerateRandomNumber(100,120,0.001));
Set vRandB2_Min = $(GenerateRandomNumber(50,70,0.001));
Set fxGaussianRandomA = sqrt(-2*log(rand())) * cos(2* pi() * rand());
Set fxGaussianRandomB = sqrt(-2*log(rand())) * cos(2* pi() * rand());

Gaussian:
LOAD
Rowno() as Bubble,
  round(($(fxGaussianRandomA) * $(vRandA1_Max)) + $(vRandA1_Min)) AS A,
  round(($(fxGaussianRandomB) * $(vRandB1_Max)) + $(vRandB1_Min)) AS B,
  $(GenerateRandomNumber(2,100,1))                                AS BubbleSize,
  1                                                               AS BubbleColor
AUTOGENERATE(500);

concatenate(Gaussian)
LOAD
  Rowno() as Bubble,
  round(($(fxGaussianRandomA) * $(vRandA2_Max)) + $(vRandA2_Min)  AS A,
  round(($(fxGaussianRandomB) * $(vRandB2_Max)) + $(vRandB2_Min)) AS B,
  $(GenerateRandomNumber(2,100,1))                                AS BubbleSize,
  2                                                               AS BubbleColor
AUTOGENERATE(400);

Seasonal Data

This concept can of course also be used to create Seasonal Sales data.

And this is how to create that data:

map_seasonal:
Mapping Load
  RowNo() as Month,
  1 + sin(RowNo())/2 as SeasonalFactor
autogenerate(12);

Seasonal:
LOAD
  Rowno() as Order,
  $(GenerateRandomNumber(1,100,1)) as Customer,
  $(GenerateRandomNumber(1,12,1)) as Month,
  $(GenerateRandomNumber(1,10000,1)) as [Sales dummy]
autogenerate(10000);

Left join (Seasonal)
LOAD
  Order,
  Customer,
  ApplyMap('map_seasonal', Month,1)  * [Sales dummy] as Sales
Resident Seasonal;

More to come

With this kind of code, you can release yourself from spending time understanding the data, and focus instead on the actual, machine learning code.

We know precisely what part of the data that will be “tricky” for the algorithm to figure out, and what we can expect as result – how good the algorithm is working. If we generate data with very little randomness, we can expect the algorithm to be very close to 100% predictions, and the more randomness we use, the harder it will be for any algorithm to predict anything.

More use cases!

What is another cool thing, is that we can also use this method to generate data in our normal BI business! How many times have you as a BI developer been asked to make a report before you have access to data? With this method, you can just generate the data you expect to have, build the report, then wait for the real data to arrive!

We can create master tables with Item numbers, customer number, sales transactions, time stamps, shipments, inventory values, etc etc.


Article Resources

If you are interested in downloading the example app, you can get it here: