Parse with regular expressions

First published at Wednesday, 7 November 2007

Warning: This blog post is more then 15 years old – read and use with care.

Parse with regular expressions

As mentioned earlier, you can not parse any recursive structures using regular expressions, because they miss some essential features, like a stack, recursion, or similar.

But in this blog post I also told, that PCRE implements a superset of regular expressions, so that you are able to "parse" more languages, then just regular ones. If someone could send me some proof, which language types can be matched using regular expresions, I would be happy - maybe just implement a turing machine with PCRE regular axpressions?

Parse recursive structures

During my talk, I gave to day about regular expressions, I showed a example regular expressions which is able to validate if some BBCode document is valid, which means, it only has matching tags, and a correct tree structure of tags:

( ( [^\[\]]* \[([a-z]+)(?:=([^\]]+))?\] (?# The actual recursion ) (?>[^\[\]]* | (?R) ) \[/\2\] [^\[\]]* ) )ix

If you don't understand this intuitively, you should probably attend on of my talks on regular expressions, or just try to help yourself reading the manual section on PCRE.

So what happens, passing some example strings to this regular expressions?

'Some [b] longer [i]text[/i][/b].' => bool( true ) 'Some [b] longer [i]text[/b][/i].' => bool( false ) 'Some [b] longer [i]te [u] xt[/i][/b].' => bool( false )

As you can see, at least simple validations pass. I did not test this extensively, so you may find some problems when validating more complex documents.

Parse the braces language

In my ealier blog post about this, I mentiond the trivial recursive example, of a language, which consists of any number of opening braces, followed by the same number of clsing braces. This simple language can also be validated, and the manual even shows this as an example.


Do not use this

Why, you ask?

That you are not get what the regular expressions do at the first glance should be answer enough. This stuff is neither readable, nor maintainable or debugable. And even, if you do so, the next one reading your code won't be able to do so - or would need hours to find out.


open source at Wednesday, 5.12. 2007

This regulal expresion looks so complicated for me ;)