Regular expressions (or regexes as they're fondly known) have long been a mainstay of the Perl world. Perl is a hotbed of regex innovation, giving reluctant (non-greedy) quantifiers and zero-width assertions to the world, to name only two. These innovations have been enthusiastically embraced by other languages, and have even been incorporated nearly wholesale into Java.
Interestingly, some of the current crop of experimental additions seem to be less popular; notably the constructs that allow Perl code to be embedded in expressions. There are several reasons for that: the implementation is still rather buggy, the security-related caveats are offputting and there's a sense that code maintainability will suffer. Also, these extensions are very specific to Perl, and could never be portable to other languages.
We propose a new extension, simple but very powerful, which subsumes most current uses of the (??{...}) operator. Our extension is fully portable, and has been implemented as a modification to the PCRE library.
We add a new construct (?nnn), where nnn is a number which refers to some bracketed group which has not yet been closed in the regex. This construct simply matches anything which would be matched at that point by the group in question.
The effect of this is something like that of a recursive procedure call. It means that you can, in certain cases, emulate the effect of a regex which is infinitely long.
Currently the (??{...}) operator is often used to emulate just this sort of recursion, though it's clumsy. A typical usage is something like:
my $bal = qr{([^()]|\((??{$bal})\))*};
print "Balanced!\n" if /^$bal$/;
which matches just those strings whose parentheses are balanced. With our extension, that circumlocution can be replaced by the single regex
/^([^()]|\((?1)*\))*$/
Here we offer some simple examples, to give a taste of the power that we now have. All these examples have been tested with the implementation described below.
Any kind of nested construct can be matched. For instance, you could write a regex to check the well-formedness of XML documents. (In fact, Juerd at Perl Monks has done exactly that!) As a slightly larger example, here's an expression to match strings which are balanced with respect to each of four different kinds of bracket:
m{
^(
[^()[]{}<>]
| \( (?1)* \)
| \[ (?1)* \]
| \{ (?1)* \}
| \< (?1)* \>
)*$
}x;
Using back referencing as well, it's possible to achieve some particularly powerful effects. Here, for example, is a regex which only matches palindromes:
/^\W*(?:((.)\W*(?1)\W*\2|)|((.)\W*(?3)\W*\4|\W*.\W*))\W*$/i
That will match the strings "Satan, oscillate my metallic sonatas!" and "A man, a plan, a canal: Panama", but not "Puffins live in holes." The next example matches well-formed bracketed arithmetic expressions like -12 and (((2+2)*-3)-7).
/^(\d+|\((?1)([+*-])(?1)\)|-(?1))$/
The thoughtful reader may have wondered about the possibility of infinite recursion; and it would certainly seem that a pattern like /((?1))/ will loop forever. This sort of problem is prevented by forbidding left recursion. We stipulate that a recursive call can not be made unless the pattern that's being called has already consumed at least one character since it was last entered.
In the current implementation we detect at compile time whether an expression is potentially left recursive, and if it is we reject it. It's always possible to rewrite your expression to pass this test, so this doesn't limit the expressive power.
We would have liked to implement this extension to Perl's regular expression engine, but were deterred by the notoriously baffling implementation. We chose instead to patch Philip Hazel's infinitely more grokkable PCRE library, which is a C library providing Perl-compatible regexes. The patch (which, encouragingly, took only a day to write from a standing start) is available here.
The next version will detect left recursion at compile time.
We hope that by now you're itching to try out this new toy. Here's how.
tar zxf pcre-3.9.tar.gz
cd pcre-3.9
patch -p0 <../PATCH-0.5.pcre
./configure
make
./pcretest
PCRE version 3.9 02-Jan-2002
re> /^\W*(?:((.)\W*(?1)\W*\2|)|((.)\W*(?3)\W*\4|\W*.\W*))\W*$/i
data> Satan, oscillate my metallic sonatas!
0: Satan, oscillate my metallic sonatas!
1: <unset>
2: <unset>
3: Satan, oscillate my metallic sonatas
4: S
data> Puffins live in holes.
No match
Traditional regular expressions have a useful property called compositionality. Put in Perl terms, that means that if $s1 =~ /^$r1$/ and $s2 =~ /^$r2$/ then "$s1$s2" =~ /^$r1$r2$/. Practically, it means that you can compose your patterns from smaller building blocks. That makes it much easier to keep track of the parts of a large and complex expression.
Back references are a crime against compositionality. If you concatenate two regular expressions, then the groups in the second one will probably get renumbered and any back references will stop working. There's also a pernicious kind of action at a distance: adding a new pair of parentheses to one part of a pattern can break a totally different part, because the group numbers get out of whack. Both these problems can make back references difficult to work with.
This crime is compounded by our new construct, which also relies on numbered groups. The cure for this ill is to allow capturing groups to be named rather than numbered, and to decree that such a name remains in scope until another group is encountered using the same name. This truth was revealed to the Pythonistas some time ago.
We won't go into this too much, although the author's use of the plural to refer to himself is a sign that he has been reading too much theoretical literature recently.
The most important observation is that all the context-free languages can be represented using this extension. That includes the majority of languages you might want to match: anything that can be parsed using Parse::RecDescent, for example. That's not entirely obvious, but we think it's true. We're working on writing up the proof properly. A sketch for cognoscenti: a CFG can be thought of as a system of equations over the complete lattice of languages. The language under consideration is the (unique) least solution to the system of equations. This solution is expressible using the fixpoint operator µ, using Gauss elimination to reduce the system to a single µ-expression. And our recursive operator is an implementation of the fixpoint construct (modulo left recursion, but left recursion can be eliminated from the CFG at the start, so that's okay).
Another interesting observation is that the above proof sketch actually defines an algorithm for converting an arbitrary context free grammar into a recursive regex. It would be interesting to implement this algorithm.
This work was initially inspired by the book Rudiments of µ-Calculus (Arnold and Niwinski, Elsevier 2001) - you can think of a recursive call as being a fixpoint operator, as mentioned above. The whole algebraic approach to formal language theory has been hugely influential. Respect to its originators: John Conway, Dexter Kozen, Arto Salomaa, and many more.
Thanks to Philip Hazel for PCRE, Larry Wall for Perl, and Ilya Zakharevich for doing so much crazy work on Perl's regexes. Mad respect to Stephen Kleene, who started this whole "regex" business in the first place, back in 1956.
Robin Houston / robin@cpan.org