Do not try using regular expressions for parsing - Kore Nordmann

Do NOT try parsing with regular expressions

Some days ago somebody joined a german IRC channel where we try to help others using PHP, where I hang around for some years now, and had some problems with his regular expressions he tried to use to "parse" BBCode he wants to use in his custom application. I answered something like:

It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions.

The usual reaction: No reaction. No belive. Some days later nearly the same question again, from the same guy, same answer by me, same result. No, not really the same result, I decided to try to summarize the reasons for this in a small blog post.

If you are just looking for a way to "parse" HTML, I added a new blog entry, which describes how to do that: "Extracting data from HTML".

What are BBCodes?

BBCodes, as you could see from the wikipedia link mentioned above, are the same class of language like HTML. In my eyes it does not make any sense not to use (X)HTML, but a markup language which has exactly the same complexity in terms of parsing and typing of the user. BBCode is nothing more then a subset of HTML using [] insted of <>. Stupid, but OK, we now have this kind of markup.

Having a language like HTML, more or less XML, we get the same problems here. If you want to do the parsing properly you need to handle recursive tag structures, need to check for matching tags etc.

You may of course just replace the BBCode with the according HTML elements, but the you might not only end up with invalid code, but also with user provided inlined HTML which destroys the layout of your website.

The second thing you might do is just checking for a closing tag for each opened tag, or just replace BBCodes, which also have a closing tag, but you still easily find examples where it creates invalid markup, even it might not always destroy your layout, but result in undefined behaviour of the displaying browser.

[b] hello [i] [/b] world [/i] [b] ! [/b]

You may also check for this using a regular expression, but trust me, I will always find a way to bypass your checks - the next paragraph will prove this.

Parsing recursive structures with regular expressions

It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions. When you have studied computer science you of course know Chomsky hierarchies and therefore know, that regular expressions are a type 3 grammar, also called regular languages, which are equivalent to finite state machines. (So you also know, that each regular expression is quite easy to transform into a FSM).

So lets take a step further and prove, that recursive structures are not parseable. Using the pumping lemma you may not prove, that a language is a regular language, but you can prove, that a language is not a regular language.

Talking about rekursive stuff lets take one of the simplest possible examples - the braces language. You may think of this like opening and closing elements instead of opening and closing braces. Consider the laguage:

L = { (^m )^m; m >= 1 }

All words, where the number of opening braces equals the number of closing braces. Then one number n exists, so that every word x, which is element of this language, and has the length 2 * n can be stripped into:

x = u v w = (^n )^n

Where u, v and w are words fulfilling the following conditions:

The pumping lemma now says the that each word of the following definition always is element of the defined language, when L is a regular language.

u v^k w; k >= 1

With this definitions you for example get something like:

u = (^( n - i ) v = (^i w = )^n

Using for example k = 2 and i = 1, you get a word which is not element of the language we defined at the beginning:

x = (^( n + 1) )^n

The numbers of opening braces obviously do not match the number of closing braces any more, so that this word is not a member of the prior defined language. Thats all you need to prove the point.

I can match this language with PCRE!

Yes, you can. PCRE can do more then regualar expressions can do by definition. The class of languages you can detect with PCRE regular expressions is, as far as I know, not clearly defined. The number of back references is limited (as the memory is in every computer, but 99 back references are far less, then some GBs of memory for your parser ;), and so on.

Some people say, that regular expressions are never readable, even I disagree here, I would say writing a real parser using PCRE regular expression will definitely be unreadable. But I will evaluate this in further detail and write a follow up on this.

Where you can use regular expressions in your parser

When you really want to parse BBCodes write a common compiler using syntax trees etc. - or simply use a common rich text editor and strip down the resulting (X)HTML to the set you want to allow in your application. DOM and similar PHP extensions do a great job here - and they ensure valid markup.

When writing a custom parser you can - and should - use regular expressions in the tokenizer. They do a great job here. To rephrase it:

Use regular expressions to recognize words, not structures.

Using BBCode in your application

As written earlier - I personally do not see any sense in using BBCodes and even wiki markup, because their complexity equals the complexity of HTML, they are another language to learn for the user, they differ in details in each implementation, and do not reuse potential knowledge of the user about HTML. And even, if the user does not know HTML yet, why should it be harder to learn your current subset, then your special BBCode implementation? Are <> really so much harder to type then []? On german keyboards they are even a lot easier to type...

But if you really want to use such stuff, there are existing packages which help you using them in your application, so that you can reuse existing working parsers and not end up with not working regular expression experiments, and continue investing hours in your application to fix some funny edge cases. Some examples: