Do NOT try parsing with regular expressions

First published at Friday, 27 July 2007

Warning: This blog post is more then 19 years old – read and use with care.

Do NOT try parsing with regular expressions

Some days ago somebody joined a german IRC channel where we try to help others using PHP, where I hang around for some years now, and had some problems with his regular expressions he tried to use to "parse" BBCode he wants to use in his custom application. I answered something like:

It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions.

The usual reaction: No reaction. No belive. Some days later nearly the same question again, from the same guy, same answer by me, same result. No, not really the same result, I decided to try to summarize the reasons for this in a small blog post.

If you are just looking for a way to "parse" HTML, I added a new blog entry, which describes how to do that: "Extracting data from HTML".

What are BBCodes?

BBCodes, as you could see from the wikipedia link mentioned above, are the same class of language like HTML. In my eyes it does not make any sense not to use (X)HTML, but a markup language which has exactly the same complexity in terms of parsing and typing of the user. BBCode is nothing more then a subset of HTML using [] insted of <>. Stupid, but OK, we now have this kind of markup.

Having a language like HTML, more or less XML, we get the same problems here. If you want to do the parsing properly you need to handle recursive tag structures, need to check for matching tags etc.

You may of course just replace the BBCode with the according HTML elements, but the you might not only end up with invalid code, but also with user provided inlined HTML which destroys the layout of your website.

The second thing you might do is just checking for a closing tag for each opened tag, or just replace BBCodes, which also have a closing tag, but you still easily find examples where it creates invalid markup, even it might not always destroy your layout, but result in undefined behaviour of the displaying browser.

[b] hello [i] [/b] world [/i] [b] ! [/b]

You may also check for this using a regular expression, but trust me, I will always find a way to bypass your checks - the next paragraph will prove this.

Parsing recursive structures with regular expressions

It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions. When you have studied computer science you of course know Chomsky hierarchies and therefore know, that regular expressions are a type 3 grammar, also called regular languages, which are equivalent to finite state machines. (So you also know, that each regular expression is quite easy to transform into a FSM).

So lets take a step further and prove, that recursive structures are not parseable. Using the pumping lemma you may not prove, that a language is a regular language, but you can prove, that a language is not a regular language.

Talking about rekursive stuff lets take one of the simplest possible examples - the braces language. You may think of this like opening and closing elements instead of opening and closing braces. Consider the laguage:

L = { (^m )^m; m >= 1 }

All words, where the number of opening braces equals the number of closing braces. Then one number n exists, so that every word x, which is element of this language, and has the length 2 * n can be stripped into:

x = u v w = (^n )^n

Where u, v and w are words fulfilling the following conditions:

Length of the word uv is less or equal n
Length of the word uvw is of course two times n

The pumping lemma now says the that each word of the following definition always is element of the defined language, when L is a regular language.

u v^k w; k >= 1

With this definitions you for example get something like:

u = (^( n - i )
v = (^i
w = )^n

Using for example k = 2 and i = 1, you get a word which is not element of the language we defined at the beginning:

x = (^( n + 1) )^n

The numbers of opening braces obviously do not match the number of closing braces any more, so that this word is not a member of the prior defined language. Thats all you need to prove the point.

I can match this language with PCRE!

Yes, you can. PCRE can do more then regualar expressions can do by definition. The class of languages you can detect with PCRE regular expressions is, as far as I know, not clearly defined. The number of back references is limited (as the memory is in every computer, but 99 back references are far less, then some GBs of memory for your parser ;), and so on.

Some people say, that regular expressions are never readable, even I disagree here, I would say writing a real parser using PCRE regular expression will definitely be unreadable. But I will evaluate this in further detail and write a follow up on this.

Where you can use regular expressions in your parser

When you really want to parse BBCodes write a common compiler using syntax trees etc. - or simply use a common rich text editor and strip down the resulting (X)HTML to the set you want to allow in your application. DOM and similar PHP extensions do a great job here - and they ensure valid markup.

When writing a custom parser you can - and should - use regular expressions in the tokenizer. They do a great job here. To rephrase it:

Use regular expressions to recognize words, not structures.

Using BBCode in your application

As written earlier - I personally do not see any sense in using BBCodes and even wiki markup, because their complexity equals the complexity of HTML, they are another language to learn for the user, they differ in details in each implementation, and do not reuse potential knowledge of the user about HTML. And even, if the user does not know HTML yet, why should it be harder to learn your current subset, then your special BBCode implementation? Are <> really so much harder to type then []? On german keyboards they are even a lot easier to type...

But if you really want to use such stuff, there are existing packages which help you using them in your application, so that you can reuse existing working parsers and not end up with not working regular expression experiments, and continue investing hours in your application to fix some funny edge cases. Some examples:

Comments

Whisller at Saturday, 28.7. 2007

I exactly think like You. I don't understand why users think [b] or string is better than <b>. I'm using html purifier, I think it is good library - but now I don't have a time to read documentation and code files.

Martin Holzhauer at Saturday, 28.7. 2007

There is also an bbcode Extension[1] in Pecl in an early development state

http://pecl.php.net/package/bbcode/

kore at Saturday, 28.7. 2007

I see some sense in using foo and _bar_ for markup, because this established in the Usenet, maybe even before HTML was used, and is still used in a lot of text base applications, like Mail and IRC.

... but you just could keep those, because everyone will understand this markup even without conversion to HTML, I think.

Toby at Saturday, 28.7. 2007

Wer hat Dich denn geärgert? ;)

Christopher Hogan at Sunday, 29.7. 2007

I concur, I use a form of markdown if the client requests it in the CMS. But my software also provides a secondary option using tinyMCE for clients that want WYSIWYG editing functionality. I think wysiwyg is WAY more simple for clients. Since they TinyMCE Moxicode editor displays html pretty much the same way dreamweaver woud, the only other steps I take to ensure valid XHTML are:

use DOM to validate
using the parse tags functionality in DOM I check the content in the tags for ISO-8859 extended char sets and filter them (htmlentities function and htmlspecialchars functions won't replace them into entities correctly). I do this by using PHP 5.2 and the new filter_input/filter_var/filter_array functions
encode ampersands/ addslashes or check for / use magic quotes if I'm putting it in an RDBMS.
My version of the backend code for TinyMCE uses AJAX to update elements on parts of the page dynamically and re-export the files as regular old HTML.

Ammar Ibrahim at Sunday, 29.7. 2007

Exactly! This is also similar discussion to templating engines, where they try to simplify if statements and such. The thing they don't understand is that it makes things harder. Wiki syntax is very hard compared to HTML, I still don't understand why it was invented.

But the only advatnage of BBCode is the protection against XSS, it's very easy to strip out all HTML, and then convert BBCode to HTML. But if you allow users to write HTML, sanitizing HTML is extremely complicated.

your mother at Tuesday, 28.8. 2007

i hope you are brushing your teeth dear!

Boris at Wednesday, 29.8. 2007

Actually there is a good point for using BBCode instead of a small html subset, it is much easier to avoid CSS-attacks.

Kore at Wednesday, 29.8. 2007

@Boris: No, actually not. You can just do the same white listing approach.

With BBCode you only convert a subset, while you just allow a very specific subset with (X)Html - in most cases just do not allow any attributes and you are safe.

_p_ at Friday, 31.8. 2007

Isn't it a secure way to generate markup? strip tags and then replace bbcode with regular expressions to have the control over what we generate ((X)HTML etc.).

Kore at Saturday, 1.9. 2007

@_p_: Please actually read the blog post before commenting. This would help you to answer properly. Regular expressions are just not capable of converting BBCode to HTML.

Evert at Monday, 3.9. 2007

I use regular expressions for parsing non-valid html.. I do need some PHP around it to aid it though :)

I do think BBCode is more secure than trying to clean html. Black lists for HTML almost never are 100% waterproof, just because there are so many variations. By using BBCode, everything that wasn't handled correctly by the parser will simply be spit out as their BBCode.. e.g.: [script] is harmless. Making the same mistake with cleaning html would produce <script>.

Academically it could be just as secure, but when you consider that their might be bugs in software (and there always is) BBCode provides an extra fallback..

That doesn't make it pretty though, therefore I would still stick with HTML or a wiki-like syntax to do this stuff.

Mike Seth at Monday, 3.9. 2007

Hahahahahahaha Kore!

I've just written a post saying the same a couple of days ago! Indeed you were much more verbose this time! :)

here it is:

http://blog.mikeseth.com/index.php?/archives/1-For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS.html

dirt at Thursday, 25.10. 2007

im using a wysiwyg js to format posts by users (in place of a textarea), and its fast and already set up for bbcode, i came across this post when researching the best way to take the bbcode in the database/posting and change it to html for output...

if you are saying regex isnt the way to go, then what is the most efficient way to format user posts? i looked at patBBCode but i don't have access to install anything on the server itself and if im correct that package needs to be installed via PEAR along with patError.. correct me if i'm wrong.

allowing straight html doesnt seem smart, so i figured a selected number of bbcode ([b][i][u][code][quote][color=#ff0000][url][img]) stored in the db/post and translating it when outputting would be the way to go...

what to do what to do..

sanjuro at Thursday, 28.2. 2008

This is good to know, I was getting a headache trying to parse bbcode with regular expressions. I'm still however looking for a simple alternative, these parsers seem a bit over-the-top.

wangtang at Friday, 29.2. 2008

Well, there are actually some more arguments for BBCode you don't mention. For once, people are used to it. Of course it would be desirable that people used and knew html, but they don't. They do however know BBCode, mostly the younger ones are accustomed to using BBCode in forums. You also mention "[b] hello [i] [/b] world [/i] [b] ! [/b]" as an example. With <> instead of [], this wouldn't even be valid html-code, so there's no need to correct user input failures in the BBCode case (where I use span's anyway), at least in my book. As for parsing BBCode with regular expressions: I've seen many sites which use a very limited set of BBCode, where this argument "It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions." does only apply in 0,01% the cases. On top of that: It's easy to implement. Not perfect, but you don't have to read yourself into libraries parsing with syntax trees. And the "easy" argument dominates all others in many cases. While it surely is desirable to apply the theoretics of Informatics to applications, pragmatically you can always accomodate the methods you're using to the demands which need to be met.

In the end, wysiwyg's will sooner than later eat "normal" forms with BBCode or html, as they are comfortable to use for the end-user, and easy to implement for webdesigners. Most of the newer wysiwyg's do output valid xhtml, so that's where it will go. In the meantime, I don't see html replace BBCode in forms, and regexps are a simple way to deal with a lot of the easier task tags.

wangtang at Friday, 29.2. 2008

P.S.: I know I sound like "let's use h1 for big text and address for italic text, and a table for layouting our site". But since we're not doing markup here, I'm still on the regexp side ;)

Kristjan Siimson at Tuesday, 7.10. 2008

Thank you for this post. I was trying to do the same thing because it seems to make sense to use this kind of approach. That saved me a lot of time. But I'd like to point out that while writing [b] instead of <b> does not make life much easier, such codes can become useful when things get more complex, such as when adding flash objects.

Ab at Monday, 19.1. 2009

Good post about regEx, but shame your giving such bad advice in regards to forum/whatever-a-user-can-post-markup-in security. The main reason BB exists is because it is way more secure to disallow all HTML and have a small, totally under your own control set of Codes you can change to HTML yourself then allowing all HTML tags.

I have a feeling you know a lot about RegEx, but are rather clueless about secure programming and not trusting user-input.

Kore at Monday, 19.1. 2009

@Ab:

Please reread the text. Validating HTML against a white list of allowed elements and structures makes it possible to entirely validate user input given as HTML. There is no magic behind that.

On the other hand lots of broken, dumb BBCode-"parsers" directly convert attribute values into HTML, which can lead into script injection.

BBCode are no step for securing user input, this happens completely independent from the markup language. And you always need to validate the entire structure to properly validate input, which is impossible with regular expressions for BBCode like shown above.

David King at Tuesday, 3.3. 2009

Personally, I think that there's no real argument against HTML or BBCode parsing and security just so long as you do it right. With BBCode it's trivial to prevent inline javascript events (like adding onclick to a bold tag) because only [B] would be parsed and [B onclick="evil"] would be ignored. That leaves only [LINK] tags susceptible to attack - and decent sanitation there is a complete doddle.

For me, the reason to use BB-style over HTML is that you keep the user from thinking "ooh, HTML - lets get creative" then have them disappointed because you only support a small set of tags (STRONG, EM, U for example) leaving them thinking "bollocks, how can I edit my comment"? However when presented with BB-style they will know immediately that it is a simpler set of tags and to either use the given reference or provided toolbar

hari at Sunday, 12.7. 2009

Hi Kore, I read your arguments, but from a practical perspective, it doesn't seem to make sense to write a WHOLE parser - scanner for merely adding a few small tags for the end user. Especially for simple applications where you just need bold, italic, underline, quoting and code block.

I DON'T want to write an entire lexical parser, scanner application or use the DOM approach in a PHP script to just convert a small subset of BBCode tags to their HTML equivalents.

I DON'T want to allow pure HTML or even a stripped down sanitized version of HTML because there is no 100% guarantee that it is safe.

Can not regular expressions be used as an acceptable, but less than perfect alternative? Enforcing correct "structure" of tags is just too much to expect. I can correct wrongly nested tags from user input manually if need be. It saves far more time than a full fledged lexical analyzer-cum-parser.

Faltzer at Monday, 23.11. 2009

Late response, but I see this as worth mentioning:

If the problem with parsing is nesting and the misuse of regular expressions, then try parsing BBCode with SUIT's parser (which is stack-based, not dependent on any regular expressions).

For starters, the "Hello world" example you provided does not work on SUIT, because SUIT ignores bad nesting. By the [/i], the stack will be empty, and the [ b] ! [/ b] will be parsed correctly.

In case you want to have a look at how the parser works (it's heavily commented): http://suitframework.svn.sourceforge.net/viewvc/suitframework/suit/trunk/suit/suit.class.php?revision=33&view=markup

John @ SEO Software at Sunday, 6.12. 2009

I heard lots of people say regex is bad idea to parse html, actually it is true for complex regex one time when i was parse a page it took the regex about 15 minutes to collected matches while a simple loop did it instantly for me. however i think regex is a very good idea when it comes to simple tasks like parsing emails or so...more simple tasks.

Nikita Popov at Friday, 22.1. 2010

Sorry, but i really do disagree with what you say. I think using RegExp or not is only a question on how good you are at RegExp and a question of performance.

PHPs Implementation of regular expressions (I talk about PCRE) does support recursion.

Here's an example for a quote-BB-Code:

#\[quote]((?>[^[]|\[(?!/?quote])|(?R))+)\[/quote]#si

In my eyes it's fairly obvious what it does and I made a little performance check to: Running the RegExp one million times (preg_replace_callback is used for replacing) takes from three to ten seconds, depending on the complexity of the input stream. Obviously, hundrets of encapsulated [quote]s would need a little bit more processing time.

The above regexp could be changed to parse all BB-Codes:

#\[([a-z]+)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si

Now, this is one line of code and it does all you want. You only need to call it recursively with preg_replace_callback and do the correct replacement, depending on what is in $matches[1] (bb-code)

Kore at Friday, 22.1. 2010

@Nikita Popov: PCRE are no regular expressions any more. I covered parsing with PCRE in another blog post:

https://kore-nordmann.de/blog/parse_with_regexp.html

It is still meaningless to try to parse with PCRE, since you get no AST or proper match arrays from matching the string. You can only tell if the string is valid markup, nothing else. You can't get no proper error reporting, either.

Nikita Popov at Friday, 22.1. 2010

Wow, fast answer.

Obviously PCRE has very little in common with what the computer scientists call regular expressions. But if talking about PHP RegExp always refers to PCRE.

Sure you can parse with regex, it's only little bit slow, but on the other hand much easier to implement then some complicated parser / lexer / tokenzier stuff.

Here some example sourcecode for the bb-code [ident] (which has no meaining at all):

function doit(&$matches) {
        if(is_array($matches)) {
                if(stripos($matches[1], '[ident]'))
                        return '{IDENT}' . doit($matches[1]) . '{/IDENT}';
                return '{IDENT}' . $matches[1] . '{/IDENT}';
        }
        $regexp =  '#\[ident]((?>[^[]|\[(?!/?ident])|(?R))+)\[/ident]#si';
        return preg_replace_callback($regexp,'doit',$matches);
}

Only argument against it is performance.

In reality you maybe would use this one instead:

#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si

and would use distinguish several replacements, depending on the bb-code.

Kore at Friday, 22.1. 2010

@Nikita Popov:

a) You still don't get proper matches for recursively stacked tags, or am I missing something here? This only works for tags, which are not stacked. b) You still don't get proper error reporting, f.e. "[a] ... [b] .. [/b]" would just leave the [a] unmatched & silently ignored.

Using the input string "[foo] ... [i]Hello [b]World[/b][/i]. Hello [i]internet[/i]." with your (adapted) example code results in "[foo] ... {IDENT}i{/IDENT}. Hello {IDENT}i{/IDENT}.". And from the values received in the callback function it at least always seemed to me, that there is no way to properly handle recursive markup this way.

On the other hand, you are also calling your own regular expression matching function recursively yourself, which could already be considered a simple "parser". This example shows the problem: http://k023.de/parse.txt

The outer stuff is never properly reported
The markup error is ignored, and you cannot use any anchors like A or ^ to prevent from this.

Writing a proper lexer & parser is not really hard on the other hand, and results in far more readable code. See this presentation [1] or this code [2] as an example.

[1] /talks/09_08_22_parsing_with_php.pdf [2] http://svn.ez.no/svn/ezcomponents/trunk/Document/src/pcss/parser.php

Nikita Popov at Friday, 22.1. 2010

The code I posted above works only with the first regular expression. For the second it must be adapted.

Here some (untested) code how it could work:

function doit(&$matches) {
    if(is_array($matches)) {
        $start, $end;
    if($matches[1] == 'b') {
        $start = '<strong>';
        $end = '</strong>';
    }
    elseif($matches[1] == 'i') {
        $start = '<em>';
        $end = '</em>';
    }
    elseif($matches[1] == 'quote') {
        $start = '<blockquote>';
        $start = '</blockquote>';
    }
    $content = $matches[1];
    if(preg_match('#\[(?>b|i|quote)]#si', $content)) // for optimisation, for not applying th whole regex if no further nesting is there
        $content = doit($content);

        return $start . $content . $end;
    }

    $regexp = '#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si';
    return preg_replace_callback($regexp,'doit',$matches); // this is for initialization and recursion later
}

I haven't tested it, therefore I don't know whether it works or not. But you should get the basic idea. (I'll test later how well this works...)

This one would (or should) replace doit('[foo] ... [i]Hello [b]World[/b][/i]. Hello [i]internet[/i].') By '[foo] ... <em>'.doit('Hello [b]World[/b]').'</em>. Hello <em>internet</em>.' and '[foo] ... <em>Hello<strong>World</strong></em>. Hello <em>internet</em>.' in the end.

> a) You still don't get proper matches for recursively stacked tags, or am I > missing something here? This only works for tags, which are not stacked.

Sorry, my English isn't that good, I don't understand what you want to say with "recursively stacked tags".

> b) You still don't get proper error reporting, f.e. "[a] ... [b] .. [/b]" would just > leave the [a] unmatched & silently ignored.

Yep. This is intended behavior in my eyes.

I will have a look at your links...

Nikita Popov at Friday, 22.1. 2010

Ooops, im sorry, mixed up $matches[1] and $matches[2] in the code above.

This one works (tested):

function doit(&$matches) {
    if(is_array($matches)) {
        $start = $end = '';
        if($matches[1] == 'b') {
            $start = '<strong>';
            $end = '</strong>';
        }
        elseif($matches[1] == 'i') {
            $start = '<em>';
            $end = '</em>';
        }
        elseif($matches[1] == 'quote') {
            $start = '<blockquote>';
            $start = '</blockquote>';
        }
        $content = $matches[2];
        if(preg_match('#\[(?>b|i|quote)]#si', $content)) // for optimisation, for not applying th whole regex if no further nesting is there
            $content = doit($content);

        return $start . $content . $end;
    }

    $regexp = '#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si';
    return preg_replace_callback($regexp,'doit',$matches); // this is for
initialization and recursion later
}

Anonymous at Tuesday, 13.4. 2010

BBCode isn't entirely useless, as it allows bbcode implementers to design custom tags without forcing posters to use invalid html tags. A common example of this is the [spoiler] tag, which makes text invisible until a mouseover. If you tried to implement this on a site that allowed posters to use a subset of HTML (<b><i> etc), having users type in <spoiler> could cause malformed XML. This can be avoided, of course, by sanitizing the comment right after submission, but why would you want to confuse people by having them refer to a nonexistent HTML tag anyways?

You can, of course, debate the importance of a spoiler tag, but it's just an example. If you're going the route of custom tags, it's better to go out and say that you're non-standard from the beginning than to pretend that you're using valid HTML.

Anon Cow at Wednesday, 14.4. 2010

Worth pointing out that if you allow a whitelist of HTML rather than using BBCode you will inevitably get someone complaining that they can't use a certain tag, especially in forum/comment and other public style software. So, from a usability perspective BBCode beats HTML because it manages the users expectations of what they can and can't do.

A Rich Text style UI is obviously the better for simple formatting, although it has usability drawbacks for more complex formatting and features such as [code] or [spoiler] tags, as well as the kinds of tagging Wiki uses.

Ornela @ cheap host at Thursday, 22.7. 2010

i agree, using wysiwyg is much more simple for clients.

Akki at Sunday, 1.8. 2010

The reason you would want to use BBCode over HTML is because you don't want to force the hole standard on your users. They don't care for things like semantics, they just want that word in bold.

You also have a lot of associations that are made simple:

[img]url[/img] => <img src="url" /> [b]text[/b] => <istrong>text<i/strong> [i]text[/i] => <iem>text<i/em> For youtube: [yt]id[/yt] => (insert mountain of code here) [size=3]text[/size] => <h2>, <h3>, or some CSS thing You also have questions like, should the users insert their own p tags?

And I disagree Kore, ] is not easier to type then > (because of the SHIFT involved).

Anyway, great article on why you can't/shouldn't be using regex to parse. :)

deejayy at Friday, 6.8. 2010

Yeah, and please substitute this with html:

[flash]/upload/demo.swf[/flash]

SasQ at Wednesday, 1.9. 2010

It's fun to see people saying "I can parse HTML/BBcode with RegExps" and then pasting a code which is nothing less than a simple top-down recursive-descent parser encoded in PHP's functions/callbacks ;)

And as to the non-HTML BBcodes like [spoiler] or [flash]: You can simply allow something like this: <div class="spoiler">...</div> for special blocks. For the Flash you can use: <a class="flash">http://path.to/flash.swf</a> and substitute it with proper markup in your script. SWFObject does exactly that, but from the JS level.

SasQ at Wednesday, 1.9. 2010

@Nikita Popov: "Sorry, my English isn't that good, I don't understand what you want to say with 'recursively stacked tags'."

Something like that:

<p>These are <i><b>nested/stacked</b> tags</i>. Try that.</p>

Here the tags are nested hierarchically, one into another. You can think of it as of parentheses:

(These are ((nested/stacked) tags). Try that.)

And you can probably see that you have to check if they're nested properly. If they're not, like in this example:

<p>These are <i>badly <b>nested</i> tags</b>.</p>

a RegExp can't handle with it, because it understands only one level of nesting (Chomsky's level 3 grammar). You need a level 2 grammar parser which understands recursive, hierarchical, self-similar patterns. But it doesn't mean that you have to write a full-blown parsing engine ;) Often it could be done by a simple recursive-descent parser using recursive function calls for each non-terminal symbol of the grammar.

Brandon @ How to become a ticket broker at Friday, 12.11. 2010

Honestly, I find BBCode even more confusing. Yes, it is more intuitive, but I feel like I've already gotten so used to HTML that I get tripped up when BBcode is introduced.

Given that pages are coded in HTML and not in BBCode, my recommendation is that people learn basic HTML. It's intimidating at the start but eventually you get the hang of it.

Emma30 at Monday, 22.11. 2010

This is good to know, I was getting a headache trying to parse bbcode with regular expressions. I'm still however looking for a simple alternative, these parsers seem a bit over-the-top.

Tom Christiansen at Monday, 22.11. 2010

Your posting is inapplicable to modern pattern-matching languages. Please stop using REGULAR that way and hitting people over the head with theoretical computability concerns that simply do not apply to any modern pattern in common use today.

Even GNU egrep, originally a DFA, now matches "(.)1" with ease, by definition a non-REGULAR language. I know of no modern programming language that requires REGULAR regular expressions.

Please note that I am here using the offensively all-capital "REGULAR" to denote that highly irregular ivory-tower sense of the otherwise common English word "regular". No one uses REGULAR expressions any more (or next to no one). We all use regular expressions, which are something altogether different.

No one has REGULAR expressions. Even the most basic of regular expressions are non-REGULAR. That's because POSIX demands backref support in BREs for compliance, thereby putting the lie to the silly old REGULARity chestnut. It was time to retire it more than 30 years ago; I can't believe people still think it matters -- or applies. It does not.
In other matters, I do not know which regex dialects are restricted to 99 backrefs the way you mention. Certainly Perl and PCRE are not; you may go as high as you need to, provided you have enough memory for it. In constructed patterns, I've used up into the 100s of them, but that was simply to test the limits. In practice, you never need to go that high, because other mechanisms are available to you.

Case in point. Consider the modern regular expression commonly used for matching balanced parens: that's just plain " ((?:[^()]*+|(?0))*)", a pattern both simple and elegant. You do not need backrefs of arbitrarily high numbering, since arbitrarily many of them are accommodated through simple recursion, a key approach in any computer scientist's algorithmic arsenal.

All that aside, I do recommend that you read Russ Cox's excellent paper, "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)." Because we all derive from Henry's regex algorithm instead of Ken's, certain operations become pathologically slow, but need not do so. A dynamic, hybrid approach that at regcomp time determines whether the alternate algorithm is feasible in any given pattern seems the obvious way to approach this, and some already do.

flower shop at Monday, 10.1. 2011

Thanks for explanation..I haven´t uderstood what exactelly BBCodes are until now

FirestormX at Tuesday, 25.1. 2011

Sorry to post on such an old article - a good article by the way - but I thought this very important point got brushed off: bbcode is used for custom/predefined markup and functionality.

For example a [video] tag could be replaced by something like <div class="video"> or something, and then parsed later - but wouldn't that be practically the same as parsing [video]? In addition, it helps to keep things styled in a way that matches your website's layout. A simple example is the {quote] tag. In order for all quotes to match other quotes on your website, a casual browser would have to know your style sheets. Like: <div class="quote_header">Quote by person 1</div><div class="quote_body">test</div> As opposed to just [quote=person 1]text[/quote]

Similarly, I have a [user] tag on my own site, which parses the user name and links back to that particular user's profile page. If I parse [url=firestormx], it's very short for the poster to write, and I have full control of it in the future. If I ever change the URL for a profile page - or more likely - change where I want the url tag to direct to (say I want to change it to link to sending a user a PM) it is very simple for me to do that in the code.

I'm not arguing the logic behind what you presented for not using regular expressions for parsing bbcode, but rather I'm arguing that bbcode is useful to do things above and beyond what a user would be able to do with white-listed html.

bbcode can be used for site-specific functionality, as well as consistent styling of content across the entire website.

HGH at Friday, 18.2. 2011

Even GNU egrep, originally a DFA, now matches "(.)1" with ease, by definition a non-REGULAR language. I know of no modern programming language that requires REGULAR regular expressions.

topanky at Wednesday, 2.3. 2011

Thank you very much for this post - it was very helpful for me!

Callum Macrae at Friday, 18.3. 2011

Just read your article, and don't agree with it. I recently built a Regex parser for my framework, and even though it parses a string into a Regex ([url={URL}]{all}[/url]), it's extremely lightweight:

https://github.com/callumacrae/lynx-framework/blob/feature%2Fhelpers/lynx/helpers/text/text.php

I haven't benchmarked it yet, but it would be interesting to see how well it stands against something like the bbcode extension in a benchmark.

ridgerunner at Sunday, 3.4. 2011

I would just like to wholeheartedly agree with everything Tom Christiansen said in his excellent comment above. I recently implemented a new PHP BBCode parser for the FluxBB forum software project which uses a (non-REGULAR) recursive regex called recursively by a preg_replace_callback() function. It easily and efficiently handles nested elements and performs quite well. And the recursive design makes it a trivial matter to identify erroneous markup such as orphan tags and illegally nested tags.

Won't you please join the rest of us here in the 21st century!?

Kore at Monday, 4.4. 2011

@Tom Christiansen & ridgerunner:

Years ago I also wrote a blog post "Parse with regular expressions" [1] which is referenced in the trackbacks. There I show how you can handle recursive structures with PCRE. I am well aware of this since years.

It does not change the fact that handling recursive structures is impossible with _regular_expressions_. PCRE "regular expressions aren't regular expressions any more. But trying to implement parsers with them has major limitations, which, from my experience, result in hard-to-maintain code (even I read PCRE fluently).

I implemented multiple parsers in my live and I can only suggest to you to only use regular expressions for tokenizing and implement some of the common parser approaches for actually building the AST. For trivial languages like BBCode this is easy, efficient and very readable.

[1] https://kore-nordmann.de/blog/parse_with_regexp.html

Bryan at Thursday, 26.5. 2011

First off I should preface this by saying that I don't disagree with anything you're saying, and I don't think you're wrong at all :)

But what I do want to say is that in reference to the story at the top of the page about the guy wanting help with using regular expressions to parse bbcode, you might be misinterpreting his request.

I understand what you're saying about the computer science yes/no's of that request, but I think you're taking his request a little too literal. He is probably writing an algorithm to do the actual "parsing", and simply using regular expressions to find the patterns that that algorithm needs to find, and just wanted help with said regular expression parts of his algorithm. This is not at all impossible and is how a lot of people handle simple everyday parsing needs that they run into during normal web work, myself having successfully done it many times from bbcode to WP style shortcodes to wiki codes to proprietary mumbo-jumbo codes.

Keep in mind that you may be a little smarter that the average question asker of a PHP forum. When they say they are "writing regular expressions to parse bbcode" what they likely mean is they are "writing an algorithm to parse bbcode and it uses regular expressions as part of that algorithm." :)

Also, on the issue of the usefulness of bbcode...you incorrectly explained what bbcode is. By your definition (same as HTML but with [] instead of <>) then bbcode absolutely sounds moronic. But that's NOT what bbcode is. There are some tags where essentially it boils down to that, but the real reason for bbcode is to offer less versatility than HTML with simpler usage than HTML, and just use [] instead of <> because otherwise bbcode and HTML code would be too easy to mistake for each other.

For example in HTML to embed an image you'd use <img src="url" />. bbcode for that is [img]url[/img]. The major difference being you can make bbcode do most of it's basic functions without people understanding the concept and syntax of attributes, only the concept of start tag/end tag. As simple as attributes may seem, getting an average butcher/baker/candlestick maker to properly write them just to put an image in a forum is often an exercise in futility, yet bbcode they quite often grasp right away. The same holds true for making a url. Instead of the more versatile <a href="url">text</a> you just use [url]url[/url] which is simpler but gives up the separation of visible text from link text.

They also often act in a "shortcode" type way simplifying to one tag what would be a little more complex in HTML. Such as [youtube]id[/youtube] being used to embed a video, which profoundly more people understand than the actual youtube embed html.

MrBarry at Thursday, 14.7. 2011

I think you have some good points, but I disagree about your opinion on whether it's even worth while to use bbcode, since it has similar complexity to xml. The reason people use it is for security - it limits how users can code, and offers a filter for the code, too. This, however, can still be done with xml based languages, like html, but it requires a different set of filters, such as elimintating javascript hijacking, removing tags that would purposefully disrupt the generated page.

For instance, in a forum situation (hence the primary reason for using bbcode), if every post is wrapped inside a table, it will look something like this:

<table>
<tr>
  <td>User Name</td>
  <td> -Post- </td>
</tr>
</table/>

Now, with no security, and if people were using straight html, it would be easy to do this in the post:

"</td></tr></table>lol Caused a forum error!"

All subsequent posts would be heavily distorted.

Now, noobs would find this funny, but it's not a practical hijack - it would be better to just insert javascript, but since it would be easy to clear off <script> tags and all on*="" html attributes, this is the next best thing.

The solution, of course, is to limit the expressive language of html, so you have an exclusive list of html tags that are able to be used, let's say they are: em,u,i,ul,li,a,img,embed,object, and param, which is the typical bbcode setup (at least by the bbcode standard tags). These elements still need to be parsed for valid markup, and for clean attributes (would want to make sure all js events are nuked, along with styles, classes, and ids), and at this point, you're doing the same amount of work for bbcode. This isn't including checks to make sure embedded objects are just videos and not malicious scripts.

The difference, of course, is that bbcode already limits users' ability to add malicious scripts, and should be checking for valid markup. bbcode allows all the above-mentioned html tags, but does this in a static and safe way (all the tags have pre-configured styles and settings that cannot be changed by the end-user).

So the balance is whether you want security, expressiveness, or some sort of in-between state. BBcode offers more security than plain html, in both filtering invalid markup, but also malicious code.

Xenland at Saturday, 23.7. 2011

This is a very good article about how ridicules bbcode is in comparison to the pros and cons of HTML. However, the point of bbcode or anything else that has there own formatting for website input from the user is because its a safer, easier approach to validating that the user isn't attempting to inject javascript code that could to malicious things such as for example having a javascript code steal the session cookie(Yes this can happen with PHP sessions too!) and then attempt to act as that user gaining access to w/e sensitive data that is located on that site.

JD Bell at Tuesday, 26.7. 2011

Perhaps ECMA Script-style Regular Expressions are not capable of recognizing structure, but many advancements in various implementations do allow for parsing structure, at least in a limited sense. Jeffrey Friedl presents several examples of matching balanced constructs and .NET even has its own "balancing group definition" (http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition).

I agree, however, that true parsing is going to require greater sophistication and sensitivity to context than regular expressions can provide.

Peter Thoeny at Thursday, 15.9. 2011

I found a combination of regular expressions and recursive function calls to be very fast to parse nested structures. My blog post about this: http://twiki.org/cgi-bin/view/Blog/BlogEntry201109x3

Robbie at Friday, 14.10. 2011

First of all, thank you very much for your opinion, which I really appreciate. Of course you are right. As a developer of a forum software, it would be much easier for me to handle formating without bb-codes. BUT, there are some very good reasons doing it this way! Especially security (e.g. CSS attacks)! In my opinion there is no better way to parse bb-code with regular expressions. And yes! It takes some days to develop a regex pattern, which is good enough to be able to avoid unwanted entries and clientside mistakes.

Joseph Szymborski at Saturday, 23.6. 2012

While I agree that BBCode is not ideal, I do not agree that it is useless. It is a matter of great security. If you replace < and > with their html equivalents without exception, you're increasing security by a ton. BBCode allows you to maintain a whitelist of tags while not using < or >.

Now, Markdown achieves all this in a much more elegant way, but the learning curve for it is much higher than BBCode for those familiar with HTML.

There are definitely security and usability benefits to BBCode, but Markdown does this much more efficiently, elegantly and securely in my opinion

Anonymous at Sunday, 28.7. 2013

What everyone should realise before reading this is that the article is very old, hence outdated. I was not so fortunate.

That's not to say the message is not valuable.

BBCode is a high-level abstraction of tedious-to-write HTML. It offers the security benefits of not requiring (and allowing) users to write HTML, CSS and JavaScript in their posts.

[spoiler=Title]Spoiler content[/spoiler]

That's most commonly a button that displays its content when clicked. You wouldn't want users to write that out every time they needed it.

P at Tuesday, 23.12. 2014

Thank God for Markdown! http://ben.balter.com/2014/03/31/word-versus-markdown-more-than-mere-semantics/

eirl at Saturday, 20.2. 2016

@Whisller > I've never been capable to understand why users think [b] or string is better than <b>

String needs 2 characters while <b>string</b> needs 7 characters. I guess they don't find it better, just more comfortable.

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog:

A classic RSS feed (for example in Portalific)
I'll toot about it on mastodon
All updates will go to LinkedIn, as well