Thursday, 10 May, 2018 UTC


Summary

A regular expression is converted into code executed in a virtual machine. This virtual machine runs on the virtual machine of the host language or editor.
Abstract models make us understand how regular expressions are executed. Although these models often neglect optimization techniques implemented by regex engines, they still give you valuable information on how regexes work.
Regular Expressions are Finite State Machines
The easiest way to visualize a regular expression is with a Finite State Machine. A Finite State Machine, or FSM in short, is a directed acyclic graph with a dedicated initial state and a dedicated end state. The FSM representation of a regular expression is an FSM, where the edges (arrows) represent characters we read from the input stream, and nodes represent an internal state of the regular expression.
For instance, the regular expression /ab/ can be represented with the following finite state machine:
/ab/ first matches the character a, then the character b. Imagine we are reading the text aabc, and we would like to figure out if this string matches the regular expression /ab/.
Let’s place a token on the start node.
  • After reading the first a character, we move the token through the arrow denoted by the a state. Our token is now in the intermediate state.
  • Once reading the second a, there is no arrow starting in the intermediate state, therefore we move back to the start state. In this state, we try reading a again. As there is an a arrow originating from the start state, we move to the intermediate state again.
  • Once reading the b character, we move our token from the intermediate state to the match state. Given that we have reached the match state, the string matches the regular expression.
Notice the match has been determined without reading the last character in the sequence.
Each character in the regular expression is an instruction executed in sequence. These instructions are matching tasks. Our goal is to reach the match node in any way possible. From the perspective of determining a match, it does not matter how many times we reach the match state, as long as we reach it at least once.
Backtracking happens once you fail to reach the match node and you cannot move anywhere from a node.
When we backtrack in a finite state automaton, we move backwards on the edges until we reach a node that has forward edges that we haven’t tried yet. If you have ever read a Fighting Fantasy gamebook, where you explore the Deathtrap Dungeon for instance, you might remember how you backtracked from one path of the maze to another. The same thing happens in a regular expression. As you move forward, you mark the edges you have visited. As you move back, you only try edges that you haven’t marked before.
While moving backwards on an edge, we also move back our caret pointing at the upcoming character of the input string, unreading the character on the edge.
One special form of backtracking is when we move back to the starting node. In this case, we have another move in our arsenal: moving a character forward in the input string, and attempting to start a new match.
For instance, when matching the string "abaa" with the regex /aa/, we first read the character a, then we backtrack, because we cannot move forward after the first character. After backtracking, we move the caret forward, leaving the string "baa" to match against /aa/. As we cannot match the character b against a, we move the caret forward again. We now have the string "aa" to match against /aa/. This matching will obviously succeed without any further backtracking.
In order to understand backtracking more deeply, let’s explore the differences between deterministic and nondeterministic regular expression modeling.
Deterministic and nondeterministic Regex modeling
The | (pipe) represents an or operation in most regular expression dialects. In EMACS and VIM, you have to escape the pipe yielding the \| operator. In BRE, or is not supported. In all other dialects, the or operator is a regular pipe. In this stection, we will stick to the latter notation.
/a|b/ is a regular expression that matches either an a character, or a b character.
The finite state machine representation of this regular expression looks like this:
In practice, regex implementations may simplify this automaton by representing a set of possible matches using a bitmask.
The bitmask representation simplifies the graph, especially in complex cases.
Let’s consider some more complex cases, where simplification is not obvious.
Expression: /list|lost|lust/
The most obvious construction of an automat looks like the following: we branch off for each operand of the or operator, and attempt to match the characters in sequence.
During execution, we have to attempt each branch. When one branch fails, we backtrack.
For instance, in the sequence lossless, we attempt the following steps:
  • As the first character is l, we try to match it in all three branches:
    • On the first branch, we check the second character, which is an o. As we cannot move forward in this branch, we backtrack.
    • On the second branch the second and third characters, o and s, both match the characters on the upcoming arrows. However, the fourth character is supposed to be a t instead of an s, so we backtrack.
    • On the third branch, we check the second character, which is still an o. As we needed an u to mmove forward, we backtrack.
  • The second, third, and fourth characters don’t match any of the arrows originating from the start node.
  • The fifth character is an l again. We try to match it in all three branches:
    • On the first branch, e != i, so we backtrack.
    • On the second branch, e != o, so we backtrack.
    • On the third branch, e != u, so we backtrack.
  • The rest of the characters e, s, and s don’t match any of the edges originating from the start node.
  • As there are no more letters to read, and we have not reached the match state, we return a failure.
Each time we can move in multiple directions from the same node, we take a nondeterministic action. The above automaton is a Nondeterministic Finite Automaton (NFA).
Many Regex interpreters use this nondeterministic form, without considering any compile time optimizations. Some regex interpreters go the extra mile, and convert the nondeterministic edges into deterministic ones.
First, notice each branch starts with l. We can simply use just one edge instead of the three, and delay the nondeterminism:
Second, notice each branch ends with st. We can use one edge and one state instead of the three redundant ones:
As a final step, let’s apply a bitmask instead of the three edges, where the letters are connected with or.
For now, we will denote this bitmask by [iou]. Read it this way: “choose one of the characters i, o, u“. We will learn about this language construct later.
More importantly, the resulting graph is now a DFA, or Deterministic Finite Automaton.
The nondeterministic form is a lot easier to build from the regular expression. It the length of the input is N, it takes O(N) steps to construct the tree. In exchange, we pay the price during execution, as the worst case execution time is O(2^N).
Constructing a deterministic automaton takes the magnitude of steps O(2^N). In exchange, execution costs are linear.
In practice, as long as our regular expressions are not reusable, we tend to settle for slower worst case execution time. This is because a regular expression can succeed very quickly. As soon as we find a match, execution is finished.
Therefore, even though most regex engines use optimizations, the nondeterministic form is often beneficial.
Finally, notice that our original expression was /list|lost|lust/. If we reverse engineer a regular expression from the final optimized automaton, we get /l(i|o|u)st/. This form is purely deterministic. Therefore, we can conclude that in case our regex interpreter has a tendency to execute regex matching in a nondeterministic way, we say prescius runtime by writing an optimized regular expression. Even if the interpreter makes the necessary optimizations for us, we still save automaton construction time by writing regexes in an optimized form.
This benefit is often negligible, so whenever we can choose between optimization and understandability, I would go for understandability.
If you are interested in reading more about optimization techniques, read this article.
Basic regex simplifications
We have seen that a nondeterministic automaton construction algorithm yields two completely different automata for the same regular expression. We have also seen that
  • /list|lost|lust/ resulted in a nondeterministic finite state automaton, while
  • /l(i|o|u)st/ resulted in a deterministic one.
The second form is more efficient than the first one. Generally, unless there is a semantic reason to increase the readability of the regular expression, we prefer the second form.
The same holds for constructs, when matching one branch implies that the other branch is matched.
For instance, in /lesson|less/ or /less|lesson/, matching lesson implies that we had to match less. This is because lesson starts with less. As a consequence, we can simplify the regular expression to /less/.
For instance, when executing /lesson|less/ on lessom, which is a nonsense word written most likely as a result of a typo, we first try to match against lesson, and then backtrack just to conclude that less is accepted. Although in small regular expressions this is not a big deal, but imagine a bloated regular expression like the following:
/less(a|b|c|de(f|g)|o(a|b|c|...))|less/
When checking lessom, we may waste a lot of prescious execution time in the first branch for no reason.
We may succeed in the first branch, for instance, by checking the input lessdeg. However, lessdeg also matches less, which means we did unnecessary work. Regular expressions are not designed to find the shortest match. They are designed to return the first match in the execution sequence.
A successful match is cheaper than failure
We saw in the previous sections that regular expressions tend to succeed faster than returning a failure. This is because a failure means we have to try each execution path in all possible ways.
As success is cheaper than failure, always formulate a condition for success when designing a regular expression. Failure tends to be more expensive.
Automatically generating regex FSMs
Visualizing a regular expressions helps you understand how they work and how they can be simplified. You could have a notebook with you all the time to construct a finite state machine for each regular expression you encounter. If you want to save time, you can also use some online services.
http://ivanzuzak.info/noam/webapps/fsm_simulator/ simulates the execution of a regular expression matching a given string. It not only creates automata for you, but also shows the execution of the matching algorithm character by character. Unfortunately, the language of the automaton is limited, as we can only concatenate strings, use the or operator, and use the Kleene (*) operator for indicating any number of occurrences. We can also use parentheses for grouping. Using the plus symbol instead of the pipe is slightly inconvenient, because in this special dialect, plus means alternative execution, and not the “at least once” repeat modifier. In exchance for the semantic difficulties, this software shows you three different types of simplified automata that can be used to match your strings. Our constructs in this book have been very close to the eNFA construct.
http://ivanzuzak.info/noam/webapps/regex_simplifier/ shows you how to simplify regular expressions. For instance, our expression containing the words list, lost, and lust, are simplified as follows:
Input: list+lost+lust

R1   list+lost+lust

Rule (ab+ac) => a(b+c)
R2   l(ist+ost)+lust

Rule (ab+ac) => a(b+c)
R3   l((ist+ost)+ust)

Rule (ab+cb) => (a+c)b
R4   l(((is+os)t)+ust)

Rule (a) => a
R5   l((is+os)t+ust)

Rule (ab+cb) => (a+c)b
R6   l(((is+os)+us)t)

Rule (ab+cb) => (a+c)b
R7   l((((i+o)s)+us)t)

Rule (a) => a
R8   l(((i+o)s+us)t)

Rule (ab+cb) => (a+c)b
R9   l((((i+o)+u)s)t)

Rule ab(cd) => abcd
R10  l(((i+o)+u)s)t

Rule ab(cd) => abcd
R11  l((i+o)+u)st

Rule (a+(b+c)) => a+b+c
R12  l(i+o+u)st
https://regexper.com/ constructs an execution graph out of a JavaScript regular expression. For instance, check out the representation of a complex JavaScript regular expression here. If you study the graph, you can easily reverse engineer what each metasyntax character means. If this regular expression seems intimidating for you, don’t worry, we all have trouble reading expressions like this one.
This graph representation has little to do with our finite state automata, because in the regexper graph, edges do not consume characters. In this graph, nodes consume characters, and edges can be selected in a nondeterministic way. Also notice, edges may have conditions on them such as 1..3 times, or 1+ times. The FSM representation of such edges may be completely different. This is a simplification for the sake of understandability.
https://www.debuggex.com/ uses the same format as regexper, and the interface helps you visualize how a regular expression matches a string character by character. You can use JavaScript, Python, and generic PCRE syntax.
https://regexr.com/ gives you a verbal explanation and cheat sheet for JavaScript, as well as the generic PCRE syntax.
https://regex101.com/ is another handy tool for testing is a regular expression matches a string. PCRE (PHP), JavaScript, Python, and GoLang are all usable. The tool can also generate code for C#, Java, Ruby, Rust, and Perl 5.
http://qfsm.sourceforge.net/download.html is my favorite Finite State Machine designer tool. You can visualize and simulate finite state machines conveniently.
Summary
We have learned the fundamentals of visualizing the execution of simple regular expressions with Finite State Machines. We have concluded that the execution of state machines can be deterministic or nondeterministic, depending on the way how regular expressions are simplified.
Deterministic execution can be visualized using a Deterministic Finite Automaton. Creating this automaton is expensive, but determining whether a regular expression matches a string is cheap (linear).
Nondeterministic execution can be visualized using a Nondeterministic Finite Automaton. Creating this automaton is cheap (linear), but executing whether a regular expression matches a string is expensive.
As success is always cheaper than a failure during runtime, we tend to favor nondeterministic execution to avoid losing too much time with optimization.
We concluded this article with some simple considerations for optimizing our regular expressions, and some tools that visualize, test, simplify, or debug our regular expressions.
If you like this article, check out my book Understanding Regular Expressions, or sign up below for more regex content.