Preparing Your Unstructured Data In Qlik

Based on my previous blog post about unstructured data in general, we already know that data preparation is one of the 4 main steps in unstructured data analysis. Having clean and optimized data can shorten processing time enormously and also lead to more accurate results.

Parse data, normalize words, skip the punctuation, ignore short words if they aren’t relevant. Naturally, you can do all this outside of Qlik Sense and load results in your app. However, you can also consider doing it directly in your Qlik Sense. And no, I’m not talking about the SSE implementation, I mean the Qlik script as it is.

PARSING TEXTS

If you don’t use the unstructured data connector or another approach that parses the text for you, you need to parse the text during the reload of the app. Do you know the subfield() function? I’m pretty sure, you do. But did you know its third parameter is not required? If you use subfield(‘A;B;C’,’;’) in the script, it returns you 3 rows with values A, B, and C.

subfield('A;B;C',';') used in Qlik script returns 3 rows

NORMALIZING WORDS

Keeping the difference between London, LONDON, and london isn’t important very often. Analyzing all these words separately will multiple the processing time… not to mention memory utilization during the reload of the app. If you’re now thinking “really? Is this sooo big difference?” my answer is clear: “Yes. Because you don’t develop a solution for analyzing tens of words. You don’t need a solution for it.” Once you decide to develop a tool for analyzing unstructured data you need to think about the big data right from the beginning.

Once you decide to develop a tool for analyzing unstructured data you need to think about the big data right from the beginning.

In Qlik you can use both functions, lower() or upper() for normalizing words. It’s only up to you if you prefer having LONDON or London in the result. For end-user, you can use the capitalize() function when representing data.

lower(word) = lower_letters
upper(word) = UPPER_LETTERS
capitalize(word) = 'Capitalized Words'

INGORING PUNCTUATION

In the Slovak language, we have all these types of a simple lowercase a: a, á, ä. In a world where so many people ignore the punctuation in their communication, does it make sense to analyze München and Munchen? For many use cases, it doesn’t. Sometimes, it does.

If you want to ignore the punctuation, here is a Qlik trick – mapsubstring(). Yes, there is a function that can do the applymap() magic not on a word level but on a letter level instead! It can replace a letter for a longer word, but for our use case, we will just replace a letter with a letter.

Obviously, we need to define a mapping table for this letter conversion. If you want to save time, use this magic in combination with normalizing words – you don’t need to define the mapping for both, upper and lowercase versions.

[M_letters]:
mapping
load * inline
[letter,letter_norm
à,a
á,a
â,a
ã,a
ä,a
å,a
ă,a
č,c
ď,d
è,e
é,e
ê,e
ë,e
ě,e
í,i
ĺ,l
ľ,l
ň,n
ó,o
ô,o
ö,o
ō,o
ŏ,o
ő,o
ŕ,r
ř,r
ś,s
ŝ,s
š,s
ť,t
ú,u
ü,u
ũ,u
ū,u
ŭ,u
ů,u
ű,u
ý,y
ź,z
ž,z
];
mapsubstring('M_letters','München') = 'Munchen'
mapsubstring('M_letters','guľôčka') = 'gulocka'

IGNORING SHORT OR SPECIFIC WORDS

Based on the assumption we’re developing the solution with a big data approach, think about all types of words we don’t need to analyze all the words. ‘a’, ‘and’, ’however’ can be irrelevant. Use where clause in your script as soon as possible to optimize memory utilization.

Where len(word) > 3 is an example for length restriction. If you want to ignore some specific words you can use match(), wildmatch(), or exists().

…where len(word)>3
…where not(match(word,’and’,’however’,’therefore’))
…where not(wildmatch(word,’ther*’,’?owever’))

[words_to_ignore]:
Noconcatenate
Load * inline
[word
and
or
however
];

…where not exists(word);

Combining all these approaches together you are a big step closer to analyzing unstructured data in your Qlik app in an efficient way.

Stay tuned, the analysis part will be the topic of my next blog! 😉