Monday, 11 December, 2017 UTC


Summary

I still remember my doomed encounters with regular expressions back when I tried to learn them. In fact, I took pride in not using regular expressions. I always found a long workaround, a code snippet that was quite long. I projected and blamed my own lack of expertise on the hard readability of regular expressions. This process continued until I was ready to face the truth: regular expressions are very powerful, and they can save you a lot of time and headache.
Fast forward a couple of years. People I worked with encountered the same problems. Some knew regular expressions, others hated them. Among the haters of regular expressions, it was quite common that they actually liked the syntax and grammar of their first programming language. Some developers even took courses on formal languages. Therefore, I took it my priority to show everyone a path towards their disowned knowledge to master regular expressions.

What are regular expressions?

Regular expressions, or in short, regexes come from the theory of formal languages. In theory, a regex is a finite character sequence defining a search pattern. We often use these search patterns to
  • test whether a string conforms to a search expression
  • to find the first, last subsequence, or all subsequences in a string,
  • to replace substrings in a string matching a regex,
  • processing user input,
  • extracting information from server logs, configuration files, and text files,
  • input validation in web applications and command line.
A typical Regular Expression task is matching. We will use JavaScript to illustrate the usage of Regular Expressions, because most of my readers have access to a browser. In the browser, you have to open the Developer Tools. In Google Chrome, you can do this by right clicking on a website and selecting Inspect. Inside the developer tools, select the Console tab to enter and evaluate your JavaScript expressions.
Suppose there is a JavaScript regular expression /re/. This expression looks for a pattern inside a string, where there is an r character, followed by an e character. For the sake of simplicity, suppose our strings are case sensitive. Suppose we have two test strings, s1 and s2:
const s1 = 'Regex';
const s2 = 'regular expression';
In JavaScript, strings have a match method. This method expects a regular expression, and returns some data on the first match.
> s1.match( /re/ )
null

> s2.match( /re/ )
["re", index: 0, input: "regular expression"]
Notice that 'Regex' does not contain the substring 're', therefore there are no matches.
The string 'regular expression' contains the substring 're' twice: once at position 0, and once at position 11. For the sake of determining the match, the JavaScript regular expression engine only returns the first match at index 0 and terminates.
JavaScript allows us to turn the syntax around by testing the regular expression:
> /re/.test( s1 )
false

> /re/.test( s2 )
true
The return value is a simple boolean. Most of the time, we don’t need anything more, so testing the regular expression is sufficient.
Each programming language has different syntax for built-in regex support. You can either learn them, or generate the corresponding regex code using an online generator such as https://regex101.com/.

Frustrations with regular expressions arise from lack of taking action

According to most people, regular expressions are
  • hard to understand,
  • hard to write,
  • hard to modify,
  • hard to test exhaustively,
  • hard to debug.
As I mentioned in the introduction, lack of understanding often comes with blame. We tend to blame regular expressions for all these five problems.
In order to figure out why this blaming exists, let’s discover the journey of a regular developer, no pun intended, with regexes. Many of us default to this journey of discovery when it comes to playing around with something we don’t know well. With regular expressions, the task seems just too easy: we just have to create a short expression, right? Well, often times, this point of view is very wrong.
Trial and error often times takes more time than getting the pain handled, and getting lack of knowledge cured. Yet, most developers do this over and over again.
This is because learning regular expressions seem to be too hard at first glance. Therefore, my mission is to show you that
  • learning regular expressions is a lot easier than you thought,
  • knowing regular expressions is fun,
  • knowing regular expressions is very beneficial in many areas of your software developer career.
My promise to you is that you can easily master regular expressions to the extent that they will do exactly what you intended them to do.

Regular Expressions are imperative

Regular expressions are widely misunderstood. People who taught you regular expressions either come from a theoretical point of view using formal languages and computer science, or they developed their understanding using trial and error.
Whenever you hear that regular expressions are declarative, run from that tutorial or blog as far as you can. A regex is an imperative language. It’s like JavaScript, except that the syntax is different. If you want to understand regexes as declarative, chances are, you will fail.
According to the theoretical definition above, regexes specify a search pattern. Although this is a true statement, it is easy to misinterpret it, because we are not specifying a declarative structure. In the real world, we specify a sequence of instructions acting like a function in an imperative programming language. We use commands, loops, we pass arguments to our regex, we may pass arguments around inside our regex, we return a result, and we may even cause side-effects.
If you have dealt with at least one programming language in your life, chances are, you know almost everything to understand regular expressions. You are just not yet proficient in this weird language describing regular expressions. As soon as you familiarize yourself with this weird language, everything will fall into place.

The Language Family of Regular Expressions

When we talk about regular expressions, in practice, we mean a family of different dialects. Similarly to genetics, regular expressions keep evolving, and new mutations surface on a regular basis. Although the principles stay the same in most languages, every single dialect brings something different.
Standardization of regular expressions began with BRE (Basic Regular Expressions) inside the POSIX standard 1003.2. This standard is used in the editors ed and sed, as well as in the grep command.
The first major evolution of regular expressions came with the ERE (Extended Regular Expressions) syntax. This syntax is used in e.g. egrep and notepad++.
For completeness, we can also mention the SRE (Simple Regular Expressions) dialect, which has been deprecated in favor of BRE.
Some editors such as EMACS and VIM have their own dialects. In case of VIM, the dialect can be customized with flags, which provides even more variations. All dialects are built on top of ERE.
The regular expressions used in most programming languages are based on the PCRE (Perl Compatible Regular Expressions) dialect. Each programming language has its own abbreviations and differences. These programming languages include perl up to version 5.
To make matters more complicated, Perl 6 comes with a completely different set of rules for regular expressions. The Perl 6 syntax is often easier to read, but in exchange, we have to learn a different language.
As an example, let’s write a regex for matching strings that contain at least one non-numeric character.
BRE:     /[^0123456789]/
ERE:     /[^0123456789]/
EMACS:   /[^0123456789]/
VIM:     /[^0123456789]/
PCRE:    /[^0123456789]/
Perl 6:  /<-[0123456789]>/
As you can see, all dialects but Perl 6 look identical. Without getting lost in the details too much, I invite you to understand what this expression means in the top 5 syntaxes:
  • [0123456789] matches one single character from the character set.
  • ^ inside an enumeration negates the character list. This means, [^0123456789] matches any character that’s not a digit
  • As the regular expression may match any character of our test string, a match is determined as soon as we find at least one character in our test string that’s not a digit. Therefore, 123.45 matches the regular expression, while 000 does not.
The Perl 6 syntax can be explained in the same way.
Let’s now write a regular expression that matches the 0, 1, or 2 characters, using the or operator of regular expressions.
BRE:    or operator is not supported
ERE:    /0|1|2/
EMACS:  /0\|1\|2/
VIM:    /0\|1\|2/
PCRE:   /0|1|2/
Perl 6: /0|1|2/
An equivalent BRE expression would be /[012]/, using a character set. We will study character sets in detail at a later stage.
As studying six groups and many different variations would take a long time, I highly recommend that you stick to one specific dialect, and practice your skills focusing on the one and only dialect you use in practice. You can come back to study other dialects later. When it comes to the PCRE dialect, different languages give you different variations. I have personally found it beneficial to build and execute regular expressions in multiple programming languages. This way, I had an easier time solidifying my regex knowledge from different angles.

Summary

We have defined a regular expression as a finite character sequence defining a search pattern. As an example, you have seen a test execution of a simple JavaScript regular expression in the console. Although the tested regular expression was very simple, often times we have a very hard time constructing and understanding regular expressions. This is because regular expressions represent a compact imperative language, and therefore, they are often not intuitive to understand. To make matters more complicated, regular expressions consist of multiple languages, which means that the JavaScript syntax is completely different than the syntax used in Perl 6.
In the next section, we will discover how to design and run regular expressions using different programming languages.