^[[:alnum:]._-]+@[[:alnum:].-]+$

Although the first row of this article may seem like the outcome of a keyboard rage that was overlooked by our editors, it is actually a pretty useful and logical sequence of characters. With this little fella, I could check if the email address I received is valid or if somebody tried to fool me. How many times have you thought…

Man, I really wish my favourite BI tool could regex!

Whether your answer is “Too many times!” or “Reg… what?”, this miniseries is just for you! It will explain what regular expressions are, why you should learn about them and how to implement them in Qlik Sense. We will also cover some basic use cases in terms of data analysis. Let’s dive in!

Imagine…

It’s Monday. After your morning coffee and the classic chit-chat with some colleagues about your weekend, you finally go through your unread emails and there it is:

Your password will expire in 1 day. To change your password go to…

You really should stop ignoring the reminder, the time has come. Get ready for the following days when you simply won’t login on the first try. OK, let’s do this. You already came up with perfect password: irock2, because the last one was irock1 and you (and all hackers) already know the next one will be irock3. But something is wrong. The password change is not allowed.

Your password does not satisfy the current requirements. Your password should:

1. Be at least 8 characters long
2. Contain at least 2 upper-case letters
3. Contain at least 1 lower-case letter
4. Contain at least 1 number
5. Contain at least 1 special character !, @, #, $, & or *

How do they know if the password meets all those requirements? Do they have some database of all possible passwords? Any number? Isn’t that like infinite number of options? What kind of sorcery is this? The answer is regular expressions or RegEx.


What are regular expressions?

Regular expressions are sequences of characters which define the pattern of a string. They follow a specific syntax where there are metacharacters with special meaning (e.g. *, ?) and regular characters with literal meaning. Let me explain it in Qlik language: You are probably familiar with using ?, ^ and * when searching for specific values or using the WildMatch() function.

? represents a single character, including blank.
* represents zero or more characters, including blank.
^ represents beginning of word within field value.

If I want to see all companies starting with an ‘A’, I should write ‘A*’ in the search menu.

A is the regular character and we are actually looking for the capital A in the company name.
* is the metacharacter. Letter ‘A’ can be followed by 0 or any number of any characters (Abc, Alpha1, Ateam, etc.).

Unfortunately, that is about all you can do with the native Qlik Sense features. However, if you get full access to RegEx with R or Python integration, a whole new universe opens  up right in front of you! Try this simple password evaluator which is done in Qlik Sense using R functions. Write some sequence of characters into the input box, hit Enter and see if you would meet the password requirements specified above.



RegEx decomposition

We used the stringr library and the str_detect() function in the previous example. In our next article, we will explain them in detail. For now, let’s analyze all the elements of regular expressions that we used to evaluate a password.

1. Password should be at least 8 characters long: .{8,}
. matches any character
{8,} is quantifier — matches between 8 and unlimited times

2. Password contains at least 2 upper-case letters: [A-Z].*[A-Z]
[A-Z] matches a single character in the list A-Z (case sensitive)
. matches any character
* quantifier – matches between 0 and unlimited times

3. Password contains at least 1 lower-case letter: [a-z]
[a-z] matches a single character in the list a-z (case sensitive)

4. Password contains at least 1 number: \d
\d matches a digit, equal to [0-9]

5. Password contains at least 1 special character !@#$&*: [!@#$&*]
[!@#$&*] matches a single character present in the list !, @, #, $, & and *


Hyperlinks extraction

The password use case is a common one, but you are not very likely to need it in your Qlik applications. Shall we move to something more useful? Let’s say we have a long unstructured piece of data and you’d like to extract all the hyperlinks into a table.

A valid regular expression for detecting http(s) hyperlinks is:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

In the following example I copied some text from Wikipedia. There are 3 hyperlinks hidden. With the regular expression above, I am able to extract them into a table and make them interactive. Try changing the regular expression to \d+ to extract first 10 numbers in text.



Why should I care for RegEx as a data professional?

Data is wild and messy in its natural environment and you have to tame it before you use it. Regular expressions are great for:

Data errors detection: Emails, IP addresses, phone numbers, etc. All those data have specific structure that can be checked using RegEx.
Unstructured data analysis: Maybe you need to review log files? Or scripts? RegEx is your friend!
Sensitive data detection: Addresses, emails, phone numbers, IBANs, etc. should be detected, hidden by default and shown only if necessary.


Article Resources

If you are interested in downloading the example app, you can get it here:

Please note that the app contains R scripts which will be correctly evaluated only if you have R and Qlik correctly integrated. For more details, follow this link: SSE R-plugin.

 

In the following articles, we will explain how the 2 examples embedded in this post work from more technical point of view. See you then!