How to create complex Regular Expressions (regex) conditions

When working with MessageBird products, you might need to use text matching, for example, if you are detecting a phone number, an email address, or a specific pin code.

Sometimes, basic conditions like “equals” and “starts with” are limiting your use case. For example, conditions like these might only match with a single value, or not match because they are case sensitive.

This is why Flow Builder also handles Regular Expressions (regex). Regex is a syntax that lets you define search patterns for text (string) values. That may sound boring, but it is kind of like a superpower! You only need a bit of syntax to define a custom pattern for your exact use case. This document will provide some guidance and use cases to show how you can use regex to create complex conditions and superpower your flow!

There are a few different flavors of regex. In Flow Builder, we use the RE2 flavor/standard. You can find the full syntax overview for it on the RE2 GitHub page. There are also several online tools, such as Rego, where you can test your regular expressions to see if they match the intended values.

Case sensitivity

If you want to match a piece of text case insensitive you can use the case-insensitive flag. Since we want to apply this flag globally to our entire expression we need to define it as a separate group. The syntax for this is `(?i)` where `i` means we want to enable the case-insensitivity flag. A full expression would look like this for example:

(?i)(helloworld)

This regex would match in the text “hello world” but also on “Hello world” and “hEllO WoRLd”.```

OR logic

To match multiple values you can use a logical OR operator. The syntax for it is the pipe character `|`. An example of a logical OR expression would look like this:

hello|world

This expression would match on text containing either “hello” or “world”.```

You can also apply OR expressions to subparts of your text. This can be useful when you notice users tend to make small typos when sending messages and exact matches fail. An example regex would be:

direcci(o|ó)n

This matches both on the text “direccion” as well as on “dirección”. We need to wrap the OR part in brackets to group it, otherwise the or would apply to the full regex and we’d be checking for “direccio” or “ón”.

Optional characters

To mark a character of the regex as optional you can use the `?` syntax. For example, a regex could look like this:

renouvell?emm?ent

This will match on “renouvellemment”, “renouvelemment” and “renouvelement”.

Exact matches

The above examples will all match if the text that they are run on contains the value we’re looking for. If you want to look for a more exact match you can specify how the text has to start and end. For example, consider the following two regular expressions:

hello|world
^(hello|world)$

Both will match on “hello” and “world” but the first one will also match on text such as “hellooo” and “worlds”. In the second regex we specify a specific start (“^”) and end (“$”) point. If the text does not exactly match the start and end condition, we get no match. We also need to wrap the subpart “hello|world” in brackets to group it. If we omit this we would check for text starting with “hello” or ending in “world”, making “hellooo” a match but “worlds” not.

Combining syntax

Of course, you can also combine the syntax to create more complex regex conditions. An example of a combination case insensitivity, or logic and optional character would look like this:

(?i)(hell?o|world)

The above regex would match, amongst others, on “hello”, “WORLD”, and “HeLo”.

Email address example

We can use regex syntax to do some basic email address validation. Characteristics of an email address are that it must contain the @ sign with some text in front of it as well as after it. Also, the text after the @ needs to contain a “.” for the domain. We can define the following regex based on these rules:

.+@.+\..+

The combination of “.” (any character) and “+” (one or more) means a row of any character, then the @ sign followed by another row of any characters, then with the “\.” we escape the period, meaning it needs to be an actual period character. We need to use “\” to escape any characters that also have meaning in the regex syntax. Lastly, we can have another row of any characters for the domain name. This is a very broad email validator that would also work in invalid email addresses like a@b.c. We can improve the validator a bit. For example, we could only accept valid characters instead of any and only accept domain extensions of 2 or 3 characters (.co or .com for example). If we do this we get the following regex:

^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,4}$

This regex breaks down like this:

  • Start with a row of valid email address characters (basic alphanumerical in lowercase + a small set of special characters)

  • Then the @ sign

  • Then another set of valid email domain name characters (notice no “%” and “+” in the domain)

  • Followed by a period (“\.”)

  • Lastly a top-level domain of 2 to 4 alphabetic characters

Note: It is important to consider your customers here since the above regex would not work for all valid email addresses. For example, if it’s in a different alphabet such as Greek or Japanese the above regex would consider the email address invalid. Also, it still accepts short but invalid top-level domains. We could substitute with a list of those but it would make the regex too long to show here.

Phone number example

Phone numbers are even more complex than email addresses. It is practically impossible to accurately validate international phone numbers using a regular expression, yet if we know our customers we can make a good attempt. For example, if we know we only need to recognize Dutch phone numbers in an international format we know they would be constructed as follows: they start with a + followed by the Dutch country code 31. After that, we get a 2 or 3 digit area code and we end with a 6 or 7 digit subscriber number. Without whitespace, it would mean something like this: +31201234567 (landline) and +31612345678 (mobile). Which would translate to the following basic regex

^\+[0-9]{11}

It breaks down like this:

  • “^\+” means it needs to start with a plus character

  • [0-9]{11} means it is followed by 11 digits

Of course, this has some problems in that it will not match if the “+” or country code is missing or if there is any whitespace in between. Some of this we could try to solve by instructing the person entering the value, but we can also improve the regex a bit.

According to Wikipedia, Dutch people commonly use the following formats for their phone numbers: 0xx xxxxxxx / 0xxx xxxxxx (geographical), 06 xxxxxxxx (mobile). Based on this we can make an educated guess where to expect whitespace we need to take into account. If we also make the + optional but the country code requires we get the following regex:

(^\+?)(31)(\s?)(0?)([0-9]{1,3})(\s?)([0-9]{6,8}$)

All separate parts are wrapped in “()” to create groups which helps to break it down:

  • “^\+?” - optionally start with a +

  • “31” - followed by the Dutch country code 31

  • “\s?” - optionally followed by a whitespace character (a space)

  • “0?” - optionally followed by a 0

  • “[0-9]{1,3}” - 1 to 3 numeric characters

  • “\s?” - again an optional whitespace character

  • “[0-9]{6,8}$” - ends with 6 to 8 numeric characters

As you can see from the above example it helps to know a bit of context before constructing your regex. Alternatively, you can keep it very broad to not mark valid phone numbers as invalid.

Also, it is important to note that the regex will not alter the value. For example, when we mark whitespace as optional, it will not be taken out when it is there.

Lastly, when trying to recognize entities such as email addresses, phone numbers, or postcodes also consider using the Entity Recognition step. This step uses machine learning to automatically detect entities in messages.

Last updated