Monday, 23 September, 2019 UTC


Summary

Regular expressions can help us solve many different problems. They can also be the source of our headaches. A recent Cloudfare outage happened due to a regular expression that caused CPU to spike to 100% on (…) machines worldwide. In this article, we go through situations we need to watch out for like the catastrophic backtracking. To help us understand the problem, we also define what greedy and lazy quantifiers are and why the lookahead might be helpful.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Jamie Zawinski
xkcd.com
Diving deeper into quantifiers
The Regular Expression engines are quite complex. We can work wonders with regexp, but we need to take into account some problems that we might encounter. To do that, we need to dive deeper into how some regular expressions are executed.

Greedy quantifiers

In the previous parts of the course, we use quantifiers such as 
+
. It tells the engine to match at least one token.
const expression = /e+/;

expression.test('Hello!'); // true
expression.test('Heeeeello!'); // true
expression.test('Hllo!'); // false
Let’s tale a closer look at the second example: 
/e+/.test('Heeeeello!')
. We might wonder how many letters we match using the above expression.
Because quantifiers are greedy by default, we match as many letters as possible. We can confirm this by using the match function.
'Heeeeello!'.match(/e+/);
// ["eeeee", index: 1, input: "Heeeeello!", groups: undefined]
Another good example is the use of some HTML tags:
const string = 'Beware of <strong>greedy</strong> quantifiers!';

/<.+>/.test(string); // true
The first guess might be that it matches something like 
<strong>
. Not quite!
string.match(/<.+>/);
// ["<strong>greedy</strong>" (...) ]
 
As you can see, the greedy quantifier matched the longest possible string!

Lazy quantifiers

In this series, we also cover the 
?
 quantifier. It means to match zero or one time.
function wereFilesFound(string) {
  return /[1-9][0-9]* files? found/.test(string);
}
 
wereFilesFound('0 files found');  // false
wereFilesFound('No files found'); // false
wereFilesFound('1 file found');   // true
wereFilesFound('2 files found');  // true
The interesting thing is that by adding it to a greedy quantifier, we tell it to repeat as few times as possible and therefore make it lazy.
const string = 'Beware of <strong>greedy</strong> quantifiers!';
string.match(/<.+?>/);

// ["<strong>", (...) ]
 
Catastrophic backtracking
To understand how quantifiers can affect our performance, we need to take a closer look into a process called backtracking.
Let’s take a look at this seemingly innocent piece of code!
const expression = /^([0-9]+)*$/;
At first glance, it successfully detects a series of digits. Let’s break down how it works.
expression.test('123456789!');
  • First, the engine processes 
    [0-9]+
    . It is greedy, so it starts by attempting to match as many digits as possible. The first match is 
    123456789
    • then the engine tries to apply the * quantifier, but there are no more digits left
    • since we use the
      $
       sign, we want the string to end with our digits – this does not happen due to the
      !
       sign
  • Now the amount of matched characters in 
    [0-9]+
     decreases due to backtracking. It matches 
    12345678
    .
    • then the
      *
        quantifier applies and therefore 
      ([0-9]+)*
       results in two substrings: 
      12345678
       and 
      9
    • since no of the above substrings are at the end of the string, matching with 
      $
       fails
  • The engine keeps backtracking, by decreasing the number of digits the 
    [0-9]+
     matches
There are multiple different combinations the above process produces.
Our string ends with the
!
 sign. Because of that, the regex engine attempts to backtrack until it finds a substring of digits at the end of a provided string.
[12345678][9]!

[1234567][89]!

[1234567][8][9]!

[123456][789]!

[123456][7][89]!

[123456][78][9]!
After a lot of calculations, no match is found. It might lead to a big performance drop. If you use a really long string, the browser might hang, destroying the user experience.
The performance can sometimes be improved by changing the greedy quantifiers into lazy ones, but that is not the case in this particular example.
Lookahead
The most straightforward way to fix the above issue is to rewrite the regular expression completely. The above is not always easy and may result in quite a pain. A solution to the above problem can be the use of lookahead.
In its most basic form, it states that x matches only if it is followed by y.
const expression = /x(?=y)/;

expression.test('x'); // false
expression.test('xy'); // true
We refer to it to as positive lookahead. The negative lookahead matches x only if x is not followed by ‘y’
const expression = /x(?!y)/;

expression.test('x'); // true
expression.test('xy'); // false
The cool thing about lookahead is that it is atomic. As soon as its condition is satisfied, the engine will not backtrack to try different permutations.

Backreference

Another thing that we need to use here is a backreference.
const expression = /(a|b)(c|d)\1\2/;
Above, 
\1
 means the contents of the first capturing group, and
\2
 is the contents of the second capturing group.
expression.test('acac'); // true
expression.test('adad'); // true
expression.test('bcbc'); // true
expression.test('bdbd'); // true

expression.test('abcd'); // false
We can combine the use of the lookahead and the backreference to deal with the backtracking issue!
const expression = /^(?=([0-9]+))\1*$/
This looks quite complex; let’s break it down part by part.
  • (?=([0-9]+))
     looks for the longest string of digits, since 
    +
     is greedy
    • the engine does not backtrack looking for different combinations
  • (?=([0-9]+))\1
     the backreference states that the content found by the lookahead needs to appear in the string
Thanks to all of the above, we can safely test even very long strings without any hiccups.
const expression = /^(?=([0-9]+))\1*$/;
expression.test('5342193376141170558801674478263705216832 D:'); //false
expression.test('7558004377221767420519835955607645787848'); // true
Summary
In this article, we’ve dived deeper into quantifiers. We’ve learned that we can divide them into greedy and lazy quantifiers and that it can have an impact on the performance. We’ve also covered another issue that quantifiers can cause called catastrophic backtracking. We’ve also learned how to use lookahead to improve the performance instead off simply rewriting our expressions. With all that knowledge we can write even better code and avoid issues like the one that Cloudflare had.
The post Regex course – part four. Avoiding catastrophic backtracking using lookahead appeared first on Marcin Wanago Blog - JavaScript, both frontend and backend.