The discussion on my blog post "Do not use regular expressions for parsing" ended in a discussion about BBcodes in general. I just used them as an example to demonstrate why it it is impossible to parse such a language with regular expressions - and mentioned in a subclause, that I don't see any sense in using them at all.
Just for the completeness, a short explanation, what they basically are:
BBCode is a markup language very similar to HTML, using  instead of <>, and with a different attribute notation.
[element=attribute_value] some text [/element]
When it comes to multiple attributes the markup definitions differ in nearly each implementation, as the list of available elements does.
There are several myths about BBCodes, which try to explain why it makes sense to use them.
_p_ mentions in his comment, that he thinks BBCodes are more secure to use, then allowing to embed HTML in some input fields - even he completely ignores the topic of the blog post by claiming he wants to parse the BBCodes using regular expressions, which has been proven impossible.
I am not a security expert, but even the basic knowledge, each developer should have, offers you several attack vectors.
As proven in the already mentioned blog post it is not possible to validate BBCode using regular expressions, so it will be quite simple to break your layout by omitting closing tags. This only harms the layout, but is still annoying. Another variant are the mentioned attributes. Common is some [url] element to provide links. The very common way to "parse" such things I have seen in many application is:
$code = preg_replace( '(\[url=([^]]+)\](.*?)\[/url\])is', '<a href="\\1">\\2</a>', $code );
The attack is now obvious for everybody. Just use something like this in your BBCode enhanced tag and a nice new XSS was born.
Click this [url=" onclick="alert( 'XSS' );]URL![/url] Amazing stuff!
Of course, those injections could be bypassed by proper escaping, but I have nearly never seen this in any self implemented BBCode "parsers". To summarize this:
BBCode by default is not more secure then HTML. You need value checks in both cases.
I think this is the most used argument when BBCodes should be used somewhere. But I simply don't get, how anyone could think, that this is true... OK, there are a lot more HTML tags, then there are BBCode tags. And I do not want my user to learn all those HTML tags.
But, the user do not need to learn "all those HTML tags", but only the very limited sub set allowed in your blog.
Somebody who is not familiar with HTML will have the same learning effort, as for learning BBCodes. There is some defined syntax, there are some tags with their special meaning. Beside the slightly more complicate notation of attributes, there is nothing different - and you don't think your user will fail because of this, do you?
On the other hand having multiple attributes in BBCodes gets complicated and often results in very similar notations, compared with HTML. And, as BBCode are not defined by any real standard, there are very different implementations for this out there.
Somebody familiar with HTML has only to learn which subset he may use. This is so much simpler then learning about your current BBCode definition. It is only a list of elements (and attributes in those elements) to learn. Something normally accomplished in seconds.
What i do not get, is why BBCodes duplicate the faults of XML like the requirement to retype the tag name in the closing tag. When I would want to use some completely different markup I would reuse some established standard, which is really easy to type and properly defined, like reStructured Text, which I am actually using to type my blog postings...
As described above you need to invest some time and energy to ensure that your application is not prone to XSS using BBCodes - same applies to HTML. But the concept is quite easy. Some people try blacklisting, but we all now that nobody really knows which browser will interprete which funny stuff as HTML and execute some ECMAScript... this is like Don Quixote's fight against windmills.
So, as always when it is possible, white lists are the way to go. In this case the creation of the actually white list is pretty easy, you define a set of allowed elements, with (for each element) contains a list of the allowed attributes - this list should of course not contain attributes like style, on*, etc. which may be used for ECMAScript execution.
Then you use some common XML parser which can work with HTML and ensures valid HTML, like DOM, or tidy. Those can ensure matching elements and proper attribute definitions. Now remove or mask all elements and attributes you do not want to be used and you got some pretty safe HTML, which is not really vulnerable to injection any more. Values for class attributes, or URLs may also be checked against some pattern, to ensure they contain proper values. But this highly depends on the attribute you are checking.
Of course, there are some libraries which makes this task easier, or spare you the implementation of something like this - even it is quite easy to do.
From their website: "HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications."
Besides the already mentioned reason, that many users are already used to HTML, so they can type it nearly fluently, there are other reasons for using it.
You can easily (optionally) use a WYSIWYG editor for the contents on your website like TinyMCE, or similar. Because you use HTML there is nearly no limitation here.
Are there any reasons to use BBCodes I am missing? ... besides backwards compatibility.
Comments are closed. This blog only exists so that all articles can still be referenced. There is no relevant activity any more on this blog. Since spammers still also find this blog comments are shut down entirely.