JSFeeds: zsoltnagy.eu - Regular Expressions in JavaScript

Wednesday, 20 December, 2017 UTC

Regular Expressions in JavaScript

Summary

It is very easy to experiment with JavaScript regular expressions, as JavaScript is accessible in all browsers.

I will use the Chrome Developer Tools to execute regular expressions. The > symbol denotes an input. The return value and console logs are printed below the input lines. As we are experimenting in the console, I will use global variables. Obviously, in the source code, using let and const are encouraged.

Regular expressions can be constructed in two ways.

In their literal form, a regex pattern can be written in-between slashes (/), and some global modifiers can be added to them behind the trailing slash.

In their object form, we use the RegExp constructor to create regular expressions.

> new RegExp( 'xy+' )
/xy+/

The xy+ pattern matches strings that contain an x character followed by at least one y character.

JavaScript regular expressions are objects.

> typeof /xy+/
"object"

The advantage of the RegExp constructor function allows runtime compilation of the expression. For instance, you could construct a regex pattern using a visual editor, assemble a string based on the visual editor, and pass it to the RegExp constructor as an argument during runtime.

RegExp methods

There are many use cases for regular expressions in JavaScript. You can identify these use cases by examining the public interface of regexes, describing the methods executable on RegExp objects.

exec executes a search returning information on a match.
test executes a search returning a boolean indicating if a match was found.
toString stringifies a regular expression.

> /xy+/.exec( 'yyxyy' );
{0: "xyy", index: 2, input: "yyxyy", length: 1}

> /xy+/.test( 'yyxyy' );
true

> /xy+/.toString()
"/xy+/"

Notice that exec returns the longest match xyy, and not the shortest matcy xy. All regex functions attempt to return the longest match.

Notice in the above test example that from the point of view of finding a match, the xy+ search expression is equivalent to using xy. This is because locating one y character implies that we can also locate at least one.

Therefore, in the method test, /xy+/ and /xy/ behave in the same way. In other methods, y+ finds the longest sequence of y characters.

String methods accepting regular expressions

Some string methods accept regular expressions as arguments:

match executes a search in the string returning information on the upcoming match. It’s like the exec method on RegExp objects, exchanging the object and the argument.
search executes a search in the string returning the index of the upcoming match. The returned index is -1 if the regex pattern cannot be found in the string.
replace executes a search in the string and replaces the first match
split splits a string into strubstrings based on the specified regex patterns.

Examples:

> s = 'xyyzxyzz';
> s.match( /xy+/ );
{ 0: "xyy", index: 0, input: "xyyzxyzz", length: 1 }

> s.search( /xy+/ );
0

> s.replace( /xy+/, 'U' )
"Uzxyzz"

> s
"xyyzxyzz"

> s.split( /xy+/ )
["", "z", "zz"]

Notice that replace does not mutate the original string, it just returns a brand new string with the replaced values.

The split method of strings is polymorphic in a sense that it accepts a string as well as a regular expression. In the latter case, the string split is made according to the longest possible matches.

Regex modifiers

The second argument of the RegExp constructor function is the list of flags applied to the regular expression. These flags are called modifiers. For instance, for case sensitive search, we can apply the i flag:

> x = new RegExp( 'x' )
/x/

> xX = new RegExp( 'x', 'i' )
/x/i

> x.test( 'XY' )
false

> xX.test( 'XY' )
true

The following modifiers are available in JavaScript:

i: non-case sensitive matching. Upper and lower cases don’t matter.
g: global match. We attempt to find all matches instead of just returning the first match. The internal state of the regular expression stores where the last match was located, and matching is resumed where it was left in the previous match.
m: multiline match. It treats the ^ and $ characters to match the beginning and the end of each line of the tested string. A newline character is determined by \n or \r.
u: unicode search. The regex pattern is treated as a unicode sequence.
y: Sticky search.

Global matches

Let’s construct an example for global matching. We will find all sequences of x characters:

> regex = /x+/g;
> str = 'yxxxyxyxx';
> regex.exec( str )
{0: 'xxx', index: 1, input: 'yxxxyxyxx' }

> regex.lastIndex
4

> regex.exec( str )
{ 0: "x", index: 5, input: "yxxxyxyxx" }

> regex.exec( str )
{ 0: "xx", index: 7, input: "yxxxyxyxx" }

> regex.exec( str )
null

> str.match( /x+/ )
{0: 'xxx', index: 1, input: 'yxxxyxyxx' }

> str.match( /x+/g )
["xxx", "x", "xx"]

In the exec example, you can see that all three matches are returned one by one. After the third match, null is returned. The lastIndex property of the global regular expression stores the position where it needs to resume execution.

We have already learned that in case of string matching, the return value is similar to the first execution of regex.exec. However, in case of a global regex argument, the return value is an array containing all matches in sequential order.

Multiline matches

Let’s execute a multiline example. The regular expression

> xRow = /^x+$/m

defines that each row of a possibly multiline string can only contain lower case x characters, and each row has to contain at least one x character. The ^ character indicates that each row has to start with the specified regex sequence. The $ character indicates that each row has to end with the specified regex sequence.

The match function specifies that the first row of the string matches all the specified criteria.

'xx\nxXx\nxxxx'.match( xRow )
["xx", index: 0, input: "xx↵xXx↵xxxx"]

If we want to retrieve all matches, we have to add the global flag to the regular expression as well. This expression matches the first and the third row of the string, ignoring the second row:

'xx\nxXx\nxxxx'.match( /^x+$/mg )
["xx", "xxxx"]

Without the flags, the newline characters count as whitespace. As whitespace characters are not equal to x, the string does not match the regular expression:

'xx\nxXx\nxxxx'.match( /^x+$/ )
null

As you can see, without the multiline flag, the ^ and $ characters indicate that whe whole string may only contain characters specified by the pattern x+.

ES6 Unicode Regular Expressions

In ES6, we can specify Unicode characters for matching. A Unicode character is treated as one character regardless of the number of bytes the character occupies:

> 'x'.codePointAt( 0 ).toString( 16 )
"78"

The hexadecimal code of the x character is 78. The corresponding Unicode character in JavaScript is \u{78}. However, a regular expression containing this unicode character does not match the character itself:

> /\u{78}/.test( 'x' )
false

This is why we need the u flag:

> /\u{78}/u.test( 'x' )
true

Another problem with Unicode characters is that their size in bytes may vary. For instance, the character "𪯍" has the Unicode value '\u{2ABCD}'.

Let’s construct a regular expression that checkes if the corresponding string contains exactly one character. We can do this using the arbitrary character symbol .. We can specify that our string starts with this character, ends with this character, and there is nothing in-between: /^.$/.

Let’s test if the string '𪯍' matches this regex:

/^.$/.test( '𪯍' )
false

The answer is no, because the old JavaScript regex engine interprets ‘𪯍’ as the sequence of two characters.

However, with the u flag, the long unicode character is recognized as one single character:

> /^.$/u.test( '𪯍' )
true

Sticky matches

The y flag sets the lastIndex property of a regular expression after a match to the first character after the last matched sequence. If the last execution of the regular expression resulted in no matches, lastIndex is set to 0.

This is a mutation of the internal state of the regular expression. Always be aware of this side-effect!

When the y flag is on, a ^ is automatically added to the beginning of the regular expression. This means the character at position lastIndex has to match the start of the regular expression.

Example:

> regExp = /ab+/y
/ab+/y

> 'ababbabbb'.match( regExp )
{ 0:"ab", index: 0, input: "ababbabbb" }

> regExp.lastIndex
2

> 'ababbabbb'.match( regExp )
{ 0: "abb", index: 2, input: "ababbabbb" }

> regExp.lastIndex
5

> 'ababbabbb'.match( regExp )
{ 0:"abbb", index: 5, input: "ababbabbb" }

> regExp.lastIndex
9

> 'ababbabbb'.match( regExp )
null

> regExp.lastIndex
0

> 'ababbabbb'.match( regExp )
{ 0: "ab", index: 0, input: "ababbabbb" }

> regExp.lastIndex
2

// ...

Summary

Regular expressions in JavaScript have some unique features worth experimenting with. Some of these features are unique in the JavaScript regular expression virtual machine. Other features are common with other languages.

We didn’t focus on the exact syntax of the regex patterns here, because we will learn the exact rules at a later stage. The examples you saw, such as the + (at least one), . (exactly one arbitrary character), ^ (match the start of the string), $ (match the end of the string) metasyntax characters act as teasers for the capabilities of regular expressions in most languages.

You have learned that regular expressions are objects in JavaScript, and they are integrated into some String methods as well.

The RegExp public interface allows testing a string, finding the first match, finding all matches, replacing substrings, and even splitting strings.

In order to perform some of the above mentioned use cases, we can use some modifiers such as the global g modifier, the sticky y modifier, or the multiline m modifier.

Two more modifiers make it more convenient to process strings: i makes our string insensitive to upper or lower cases, while u makes the JavaScript regex virtual machine handle Unicode characters properly.

... more @ zsoltnagy.eu

zsoltnagy.eu