JSFeeds: zsoltnagy.eu - ES2018 Regular Expression Updates

Monday, 14 May, 2018 UTC

ES2018 Regular Expression Updates – an Opinionated Summary

Summary

The last few weeks made me think about renaming my book. Because ES6 was ES2015. Of course when I started writing ES6 in Practice, I included those two small ES2016 updates. Then I promised a bonus chapter on ES2017 including the famous async-await update.

Now we are in 2018, and even though I didn’t make a promise to include ES2018 updates in the book, some information on ES2018 will end up there.

But that’s not all. Given that I happened to launch a book on Regular Expressions containing a JavaScript Regex video bundle update, I thought about publishing an article on the ES2018 updates on Regular Expressions.

Regarding the renaming of ES6 in Practice, let it be my concern for now. If you continue reading, your concern will be some major improvements to the usability of JavaScript regular expressions.

Your concern may be whether this article is right for you. If you are a complete beginner, you may use this article to widen your understanding about what regular expressions are used for. I encourage you to read some of my other regular expression articles first to get started though. If you have at least an average knowledge of regular expressions, you can use this article to get up-to-date with the new capabilities in JavaScript. If you are an expert in regexes, you may disagree with me on my opinion about what is missing from ES2018 regexes.

Do we finally get a powerful regex engine for JavaScript?

I cannot promise you everything. With the ES2018 updates though, we have significantly closed the gap between the JavaScript regex engine, and other PCRE-based regex engines.

These updates are geared towards practical use cases of regexes in JavaScript:

Lookbehinds
Named Capture Groups
Dotall Flag
Unicode Escapes

Let’s see the four updates one by one.

Lookbehind assertions

A lookbehind is the mirror image of a lookahead. It walks backwards from the regular expression and checks if the specified pattern matches the string before the current position. If the lookbehind match succeeds, the match is reverted. The syntax is as follows:

Lookbehind type	PCRE syntax
positive	`(?<=pattern)`
negative	`(?<!pattern)`

One implicit lookbehind construct exists before ES2018: the \b word boundary anchor. So technically, you could use word boundaries without using lookbehinds. We will explore the role of the word boundary anchor in an example:

Example 1: Determine if the string str contains a non-whitespace character sequence starting with list, lost, or lust.

Solution:

/\bl[iou]st/.test( str );

In this example, we were looking for the pattern l[iou]st in our test string. Before the l character though, the \b word boundary anchor filtered our results by requiring that a non-whitespace character cannot stand in front of the l.

Example 2: Let’s see some examples for positive and negative lookbehinds.

/(?<=a)b/ matches a b character in a string in case it stands right after an a character. The a character is not included in the match:

> /(?<=a)b/.exec('bb')
null

> /(?<=a)b/.exec('ab')
["b", index: 1, input: "ab", groups: undefined]

The /(?<!a)b/ negative lookbehind assertion matches the character b in a string in case the character before b is not an a:

> /(?<!a)b/.exec('ab')
null
> /(?<!a)b/.exec('bb')    
["b", index: 0, input: "bb", groups: undefined]

You can see in both solutions that no capture groups were created. In order to capture the value of a lookbehind in a capture group, you can add parentheses around the expression you want to capture:

/(?<=(a))b/.exec('ab')
(2) ["b", "a", index: 1, input: "ab", groups: undefined]

/(?<=(a))b/.exec('ab')[1]
"a"

In this expression, the resulting regex match object contains the captured a character under the index 1.

The groups property of the regex is still empty though. This is what we will deal with in the next section.

Named capture groups

Let’s introduce named capture groups using an exercise. Suppose we would like to retrieve the currency, the numeric price value, and the full price with currency in a string of format:

Price: €19.00

The regular expression matching this string is the following:

/^Price: [€\$]\d\d\.\d\d$/

We have to escape the dollar sign as a currency, because $ is a metasyntax character denoting an end of string or end of line anchor. We also have to escape a dot, because it is a metasyntax character denoting one arbitrary character.

After adding some parentheses to capture the values we wanted, the regular expression looks as follows:

/^Price: (([€\$])(\d\d\.\d\d))$/

We have three capture groups to access the data:

Capture group number	Data
1	full price
2	currency symbol
3	numeric price

From a maintainability point of view, using the indices 1, 2, and 3 to refer to these capture groups is not a brilliant idea.

Imagine for instance that requirements change such that Price may be multilingual, and you have to capture the price text in the language it appears:

/^(Price|Preis): (([€\$])(\d\d\.\d\d))$/

Bingo. Capture groups 1, 2, and 3 became 2, 3, and 4 respectively. You have to rewrite all your code processing these values.

This is why getting fed up with numbered capture groups is a healthy feeling. After the rage comes out, we will learn how to use names. Our capture groups will be as follows:

Capture group number	Data
<fullPrice>	full price
<currency>	currency symbol
<numPrice>	numeric price

We will use the syntax (?<name>content) to match content in capture group name.

/^Price: (?<fullPrice>(?<currency>[€\$])(?<numPrice>\d\d\.\d\d))$/

So, in order to create a named capture group, all we need to do is write a question mark after the start of the parentheses, then the capture group name inside greater than and less than symbols.

And we are done. Let’s execute the solution.

console.table( 
    /^Price: (?<fullPrice>(?<currency>[€\$])(?<numPrice>\d\d\.\d\d))$/
        .exec('Price: $15.99')
        .groups 
)

> (index)   Value
  currency  "$"
  numPrice  "15.99"
  fullPrice "$15.99"

Although you can still refer to the captured groups with their numeric indices 1, 2, and 3, you can also use their label names if you access the groups property.

Last, but not least, it is possible to formulate a backreference on the capture group name using the format \k<groupName>. I will create a backreference to this concept, no pun intended, at the end of the article, when talking about the drawbacks of ES2018 regexes.

Named capture groups make your expressions more maintainable. That’s their purpose.

Dotall Flag

This is a very simple update. As you might know, in JavaScript regular expressions, and in many PCRE regular expressions for that matter, newline characters such as \n do not match the dot.

/./.test('\n')
false

You can see that false is returned when we test it.

In ES2018, we can add an s flag, to make dots match the newline character:

/./s.test('\n')
true

To be exact, there are other line terminators such as the carriage return, \r, or the line separator and paragraph separator characters, which are U+2028, and U+2029 respectively. But in essence, we are fishing for line separators.

This is why the flag was named after s in single line, and not after d in dotall. With the s flag switched on, the whole string is treated as one single line.

When parsing server side logs in node.js, you might make use of this flag.

Unicode escapes

This is a documentation-heavy topic, because the documentation itself details every nitpicky detail of this update. I will link to the documentation as a reference. The documentation details how you can match certain unicode character groups with some expressions, without the use of any third party libraries. In this section, we will concentrate on some practical use cases of this update instead of a thorough description of every single unicode group.

Let’s start with a demo. Suppose you would like to match greek characters. How did we do it before ES2018 in case we decided on not using any third party libraries?

That’s right. We would have to create character sets.

/[θωερτψυιοπασδφγηςκλζχξωβνμάέήίϊΐόύϋΰώ]/u.test('λ')
true

If that was not enough, think about upper case letters too:

/[ΘΩΕΡΤΨΥΙΟΠΑΣΔΦΓΗςΚΛΖΧΞΩΒΝΜΆΈΉΊΪΐΌΎΫΰΏ]/u.test('Λ')
true

In ES2018, we have an easier notation:

/\p{Script=Greek}/u.test('π');
true

\p{Script=Greek} matches a Greek character and nothing else. This is a great semantic shorthand. Think about it. Greek is very much like English in terms of the limited number of characters. At the same time, you would have a very hard time working in Chinese or Japanese, where you would have to concatenate an endless pool of symbols. This problem is solved with Unicode escapes.

If I could wish another update, I would like to mix all characters from all languages. Imagine a Japanese text written in Hiragana on geometry, containing Greek letters, as well as a latin spelling of some mathematicians. I would prefer not forming character sets of scripts, especially if I don’t know what languages I will have to deal with.

This is where the Alphabetic unicode escape comes in:

/\p{Alphabetic}/u.test('á')
true

/\p{Alphabetic}/u.test('が')
true

/\p{Alphabetic}/u.test('6')
false

\p{Alphabetic} matches a lot more characters than \w, because with \p{Alphabetic}, we can match any alphabetical character of any language. Numbers are obviously excluded.

Talking about numbers, [0-9] may often be sufficient to match a decimal character. But sometimes, we have more fancy number characters. This is where \p{Decimal_Number} comes in:

/\d/u.test('𝟞');
false
/\p{Decimal_Number}/u.test('𝟞');
true

You can also match emojis, and many other useful groups of characters. Feel free to use the ES2018 version of regular expressions to process text in your language. Good luck with them!

What is missing

Despite the updates enriching JavaScript regular expressions, we still have a long way to go. The missing pieces in my opinion are the following:

Possessive loops and atomic groups
Maintainability updates
- Extended mode for turning off matching whitespaces
- Subroutines

Possessive Loops and Atomic Groups

For instance, I don’t get why can’t we have possessive repeat modifiers and possessive or atomic groups. These are both important optimization techniques.

Repeat modifiers such as a+ and a* have a possessive version. For instance, for these two examples, the respective possessive versions are a++ and a*+. A possessive loop works just like a regular greedy loop does, except when it has to backtrack, it fully fails. Possessive loops work really well in other PCRE languages, but not in JavaScript:

regex = /a++b/
> Uncaught SyntaxError: Invalid regular expression: /a++b/: Nothing to repeat

At the same time, it is very easy to implement possessive loops using some workarounds:

/(?=(a+))\1b/

First we define a positive lookahead greedily matching as many a characters as possible. This creates capture group 1 in the expression. Then we use a backreference to capture group 1 using \1 to match the value of the greedy a+ loop in the lookahead. Then we match b. If we backtrack, we backtrack fully. There is no way to unloop an a character from the loop inside the lookahead.

Given we don’t want to rely on capture group indices due to maintainability reasons, we can make this construct more semantic using named capture groups and named backreferences. Do you remember the backreference part of named capture groups? This is the place where you will see an example on a named backreference:

/(?=(?<possessive>a+))\k<possessive>b/.test('aaab')
true

/(?=(?<possessive>a+))\k<possessive>b/.test('b')
false

The same holds for atomic groups: (?>a|b) does not exist in JavaScript, but we can emulate it in the same way as we did with possessive loops.

I can illustrate atomic groups with an easy example from my personal life. I tend to lose stuff easily. So sometimes I go and look for my stuff in my house. For instance, suppose I am looking for my phone. I look at the table, my pocket, and also the kitchen.

Suppose I then check if I have an appointment. I have appointments on Tuesday. If the day does not match, I don’t need my phone to call anyone anyone.

The following regular expression captures first me looking for my phone, then checking the day.

/(?>table|pocket|kitchen) --> Tuesday/

Suppose on Wednesday, I find my phone in the pocket, then I figure out it’s not Thursday, so I have to backtrack. Then I go back and look for my phone again!

But once I have found my phone, it would make sense to stop looking for my phone! This is what an atomic groups solve.

As a homework assignment, implement it using JavaScript constructs. Hint: the solution is very similar to the possessive loop construct.

Maintainability updates

There is an evident fear circulating in the industry about writing and maintaining regular expressions. This is because as the expression becomes bigger, reading the expression gets harder.

This reminds me of the word Fluessigkeitsliebhaberei in my gym. It means the act of loving fluids. In one word. Complete nonsense. German words are long. Sometimes very long.

Rindfleischetikettierungsueberwachungsaufgabenuebertragungsgesetz is another example. Many of my readers are used to reading short words. Then comes a long word like this out of the blue. Then we can throw all our speed reading experience out of the window.

Software developers are used to reading code. But code normally contains a lot of whitespace. The PCRE expression

/^(?=.*[A-Z].*[A-Z])(?=.*[!@#$&*])(?=.*[0-9].*[0-9])(?=.*[a-z].*[a-z]).{8,}$/

is just like a very long German word. It is by far not the worst regex I have seen, and it takes time for me to interpret it.

By the way, I did not come up with this regex myself. The original source comes from StackOverflow. I slightly modified the expression to allow passwords longer than 8 characters. I also changed the minimum number of lower case letters from 3 to 2.

I hope we can agree that this regex is hard to read. But what if we could add some whitespaces around it? What if we could place comments in the expression?

This is exactly what the extended mode does in regular expressions.

/(?x) 
  ^                  # Start anchor
  (?=.*[A-Z].*[A-Z]) # At least two upper case letters
  (?=.*[!@#$&*])     # AND at least one special character
  (?=.*[0-9].*[0-9]) # AND at least two digits
  (?=.*[a-z].*[a-z]) # AND at least two lower case letters
  .{8,}              # String is at least 8 characters long
  $                  # End anchor
/

In pure JavaScript regexes, we don’t have extended mode. If we want to format our expressions, we have to use a different method such as:

const regex = new RegExp(
    '^'                  + // Start anchor
    '(?=.*[A-Z].*[A-Z])' + // At least two upper case letters
    '(?=.*[!@#$&*])'     + // AND at least one special character
    '(?=.*\\d.*\\d)'     + // AND at least two digits
    '(?=.*[a-z].*[a-z])' + // AND at least two lower case letters
    '.{8,}'              + // String is at least 8 characters long
    '$'                    // End anchor    
);

Notice the double escape in \d. We are using the RegExp constructor, so if we need a backslash in the string, we have to escape it. This is not the most comfortable thing in the world.

A good alternative is the xRegExp library, which comes with all the convenience features and more:

const regex = new XRegExp( '                                   \
    ^                   # Start anchor                         \
    (?=.*[A-Z].*[A-Z])  # At least two upper case letters      \
    (?=.*[!@#$&*])      # AND at least one special character   \
    (?=.*\\d.*\\d)      # AND at least two digits              \
    (?=.*[a-z].*[a-z])  # AND at least two lower case letters  \
    .{8,}               # String is at least 8 characters long \
    $                   # End anchor                           \
');

Talking about xRegExp, another feature that’s still missing from JavaScript regexes is subroutines. Subroutines are like functions in JavaScript. If we write a big regular expression, we still have to make it one expression. It’s like writing your JavaScript code in the global scope without using functions. There are limits to the complexity of problems you can solve.

Needless to say, xRegExp comes with subroutines. Let me paste one example from the documentation of xRegExp:

var time = XRegExp.build('(?x)^ {{hours}} ({{minutes}}) $', {
  hours: XRegExp.build('{{h12}} : | {{h24}}', {
    h12: /1[0-2]|0?[1-9]/,
    h24: /2[0-3]|[01][0-9]/
  }, 'x'),
  minutes: /^[0-5][0-9]$/
});

You get the idea, right? Subroutines are powerful, semantic, and self-document your code. If we use these extra features in xRegExp, we will be able to write easily maintainable regular expressions, similarly to writing maintainable software.

Summary

Despite the shortcomings, I hope you liked the ES2018 regex updates:

Lookbehind assertions,
Named capture groups,
Dotall (s) flag,
Unicode escapes.

If you need to overcome some of the shortcomings of JavaScript regexes, I still encourage you to use the xRegExp library. In fact, I can’t see xRegExp getting replaced by the JavaScript regex engine any time soon. At the same time, the ES2018 regex engine has become significantly better and more maintainable than before.

... more @ zsoltnagy.eu

zsoltnagy.eu