Tuesday, 17 January, 2017 UTC


Summary

Regular expressions are incredibly powerful, but you would be hard pressed to find anyone who believes they’re very intuitive. Sure, there’s that one developer you know who’s excellent at it, but most developers know just enough to be dangerous. Unfortunately, the level of danger when it comes to regular expressions and security is quite high. Misunderstanding regular expressions can ultimately end up making it easy for attackers to take your site down.
Let’s take the following regular expression as an example: js regex = /A(B|C+)+D/
If we break that down, here’s what this regular expression is accomplishing:
  • A The string must start with the letter ‘A’
  • (B|C+)+ The string must then follow the letter A with either the letter ‘B’ or some number of occurrences of the letter ‘C’ (the + matches one or more times). The + at the end of this section states that we can look for one or more matches of this section.
  • D Finally, we ensure this section of the string ends with a ‘D’
The expression would match inputs such as the following examples: ABBD ABCCCCD ABCBCCCD ACCCCCD
Generally speaking, it doesn’t take very long for a regex engine to find a match. For example, let’s see what happens when we test it against the following 30 characters long string: “ACCCCCCCCCCCCCCCCCCCCCCCCCCCCD”.
1
2
$ time node -e '/A(B|C+)+D/.test("ACCCCCCCCCCCCCCCCCCCCCCCCCCCCD")'
0.04s user 0.01s system 95% cpu 0.052 total
As expected, the string matches and matches quickly. The entire process takes around 52ms (as tested on a MacBook Air).
Now let’s see what happens when we test another 30 characters long string. This time, instead of providing a valid string, let’s provide one that is invalid by replacing the “D” at the end with an “X”: “ACCCCCCCCCCCCCCCCCCCCCCCCCCCCX”.
1
2
$ time node -e '/A(B|C+)+D/.test("ACCCCCCCCCCCCCCCCCCCCCCCCCCCCX")'
1.79s user 0.02s system 99% cpu 1.812 total
Suddenly it takes nearly two seconds to complete the test—over ten times as long as it took to test a valid string! The dramatic difference is due to the way regular expressions get evaluated. In short, they don’t like to give up.
How RegEx Engines Work
Regex engines differ, but most will work very similarly. The engine will match the first possible way to accept the current character and proceed to the next one. If it then fails to match the next one, it will backtrack and see if there was another way to digest the previous character. If it goes too far down the rabbit hole only to find out the string doesn’t match in the end, and if many characters have multiple valid regex paths, the number of backtracking steps can become very large, resulting in what is known as catastrophic backtracking.
Let’s look at how our expression runs into this problem, using a shorter string: “ACCCX”. While it seems fairly straightforward, there are still four different ways that the engine could those three C’s:
  1. CCC
  2. CC+C
  3. C+CC
  4. C+C+C.
The engine has to try each of those combinations to see if any of them potentially match against the expression. When you combine that with the other steps the engine must take, we can use RegEx 101 debugger to see the engine has to take a total of 38 steps before it can determine the string doesn’t match.
From there, the number of steps the engine must use to validate a string just continues to grow.
String Number of C’s Number of steps
ACCCX 3 38
ACCCCX 4 71
ACCCCCX 5 136
ACCCCCCCCCCCCCCX 14 65,553
By the time the string includes 14 C’s, the engine has to take over 65,000 steps just to see if the string is valid. Now imagine this process as it would apply to our original string: “ACCCCCCCCCCCCCCCCCCCCCCCCCCCCX”. You can see how this backtracking can get out of control very quickly.
The following screenshot show’s just a small section of the work the engine was doing, as reported by the RegEx101 debugger.
Given how the engine evaluates a regular expression, the reason for the huge jump in the time required to process an invalid expression becomes clearer: there is an exponential relationship between the length of the string and the number of paths the engine has to evaluate. If we add just one more character to our invalid sequence, the evaluation time nearly doubles.
1
2
$ time node -e '/A(B|C+)+D/.test("ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCX")'
3.51s user 0.06s system 95% cpu 3.732 total
Node and ReDoS
Bad actors can take advantage of the complexities of regular expression parsing by passing a long, complicated string that makes the engine take an inordinate amount of time to evaluate, resulting in what is known as a Regular Expression Denial of Service (ReDoS) attack.
That’s certainly not ideal in any environment, but it’s even worse in JavaScript environments, including Node. As we explained when we talked about timing attacks in an earlier post, Node.js (and Javascript in general) are event driven.
TThis means that if a thread is being occupied by a successful ReDos request, that request would block the entire event loop and the application will not be able to perform any other processing.
A Real-World Example
Let’s look at a recent real-world example that was found in the popular moment library. moment is used for parsing, validating, manipulating and formatting dates. Using the tool, you can specify the format you would like your date output as using the format() method:
1
2
3
moment().format('MMM Do YY');

//example output: Nov 22nd 16
When you define your format, moment uses a regular expression to verify it. In versions of moment older than 2.15.2, the regular expression looks like this:
var MONTHS_IN_FORMAT = /D[oD]?(\[[^\[\]]*\]|\s+)+MMMM?/;
The key here is the (\[[^\[\]]*\]|\s+)+ portion of the expression. Of particular note here is that \s+ is testing for the presence of one or more spaces. However, the + just outside of that group also means that the group nested in the parentheses can be found one or more times. This logic means that there are many different ways the regular expression engine can try to group these characters to find a match if there are a high number of spaces in the format. They could be lumped together either as one big group matched once, broken up as individual groups matched each on their own, or anything in between. The result is that it can take a long time to evaluate a string that is very close to being valid.
Consider this example we posted when we found the vulnerability:
1
2
3
4
var m = require("moment");
m.locale("be");
m().format("D                               MMN MMMM");
// Normal date format, other than the 31 spaces included in the string 
In this case, we pass in a 40 character long string with numerous space characters. Running this on a standard laptop showed that the expression blocks the event loop for about 20 seconds.
1
2
$ time node -e '/D[oD]?(\[[^\[\]]*\]|\s+)+MMMM?/.test("D                               MMN MMMM");'
21.24s user 0.14s system 96% cpu 22.079 total
After we disclosed the vulnerability to moment, the author put together a fix (you can test now to see if the patch applies for you) that removes the extraneous + operator from within the parantheses:
var MONTHS_IN_FORMAT = /D[oD]?(\[[^\[\]]*\]|\s)+MMMM?/;
By doing this, you do away with having two different matching groups that could each have anywhere from one to all of the spaces included in it. Because the + operator no longer follows the s, it can only match a space at a time. If you test the string again using the new regular expression, it now takes a matter of milliseconds for the engine to figure out it’s a mismatch.
1
2
$ time node -e '/D[oD]?(\[[^\[\]]*\]|\s)+MMMM?/.test("D                               MMN MMMM");'
0.05s user 0.02s system 53% cpu 0.135 total
Preventing Catastrophic Backtracking
To prevent catastrophic backtracking, you want to keep an eye out for anytime you’re using ‘+’ or ‘*’ operators within proximity to each other. If you are, chances are they’re going to end up in the type of tug-of-war that results in catastrophic backtracking. Make sure that both operators are in fact necessary. You can also consider replacing one of them with a limited set. If you recall the ReDoS vulnerability in moment, we fixed it by eliminating one of the ‘+’ operators. We could have also replaced one of them with a range, limiting the number of spaces we’d allow in a valid string:
var MONTHS_IN_FORMAT = /D[oD]?(\[[^\[\]]*\]|\s+){0,10}MMMM?/;
Regex parsing is pretty naive, but some languages have introduced advanced features that can help to avoid catastrophic backtracking. For example, let’s consider the simplified expression we started with:
1
regex = /A(B|C+)+D/

Atomic Groups

As we discussed, the expression can result in catastrophic backtracking. In the regular expression parser supported in Node.js, there isn’t a native feature that we can use to eliminate the issue. That’s not the case in some other languages. Take Ruby for example. Ruby supports something called atomic groups. Atomic groups let you tell the regex parser not to backtrack when a group has matched. You do this by prefacing a group with ?>, like so:
/A(?>B|C+)+D/
Adding ?> to the group that is looking for some number of “C’s” makes that an atomic group. Before, the regex parser has two competing operators to consider. The “C” inside the group can match any number of times, or the group itself can match any number of times. So if an invalid expression gets passed that contains a high number of “C” characters, the parser will get trapped in a battle between the two groups, repeatedly trying to see if there’s some magic combination that will make the expression valid.
With atomic grouping applied, however, the parser will lock the match of that group into place. In this case, it will match on however many “C’s” happen to be present. When the expression is invalid, it won’t be able to backtrack over that group—it’s as good as set in stone at this point. Without any need to juggle “C’s” between the two potential matching operators, an invalid expression will be identified very quickly, avoiding any catastrophic backtracking.

LookAhead

While JavaScript doesn’t support atomic groups, it does support LookAhead which can be used to accomplish the same general effect of atomic grouping.
A(?=(B|C+))\1+D
In this case, we use ?= to specify a LookAhead. This means that the regular expression will match that group only if immediately followed by the next group. By using \1, we capture the matched group, effectively creating an atomic group. At this point, the parser will treat this match just as if it were an atomic group and avoid any catastrophic backtracking.
If we test our string against the expression now using LookAhead, we’ll see that the time to invalidate the string has dropped dramatically (from 1.8 seconds without using LookAhead to 94ms with it):
1
2
$ time node -e '/A(?=(B|C+))\1+D/.test("ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCX")'
0.05s user 0.02s system 51% cpu 0.094 total
The tradeoff that you make here is complexity. You avoid the catastrophic backtracking, but you add some complexity to your expression that not only makes it more undecipherable for others on your team but makes it easier to get something wrong. If you can identify the combination of operators that could result in backtracking (in this case, the two +’s), you may be better off just reworking the expression altogether for clarity and readability.
Summary
ReDoS attacks can bring an application to a grinding halt. This is especially true in the world of Node.js where the event loop only amplifies the impact of catastrophic backtracking.
Preventing ReDoS vulnerabilities involves monitoring your regular expressions as well as testing your dependencies for any ReDos vulnerabilities.
There have been a few attempts at creating tools for automatically detecting expressions susceptible to ReDoS attacks (most notably safe-regex) but in even limited testing we found a number of false positives as well as several expressions that were determined to be safe when they were not.
Testing your dependencies is a bit more clear-cut—that’s something Snyk can do for you.
Either way, it’s worth your time to experiment. Using a tool like RegEx101’s debugger is an excellent way to play around with different expressions and better understand what steps the regular expression parsers will be taking as it tries to match your expression.
Who knows—given enough time and experimentation, maybe one day you’ll be the one swooping in to save the day with your intense knowledge of regular expressions.