First published at Monday, September 3, 2007

Warning: This blog post is more then 14 years old – read and use with care.

Why are you using BBcodes?

The discussion on my blog post "Do not use regular expressions for parsing" ended in a discussion about BBcodes in general. I just used them as an example to demonstrate why it it is impossible to parse such a language with regular expressions - and mentioned in a subclause, that I don't see any sense in using them at all.

What are BBCodes?

Just for the completeness, a short explanation, what they basically are:

BBCode is a markup language very similar to HTML, using [] instead of <>, and with a different attribute notation.

[element=attribute_value] some text [/element]

When it comes to multiple attributes the markup definitions differ in nearly each implementation, as the list of available elements does.

The myths about BBCode

There are several myths about BBCodes, which try to explain why it makes sense to use them.

BBCodes are more secure

_p_ mentions in his comment, that he thinks BBCodes are more secure to use, then allowing to embed HTML in some input fields - even he completely ignores the topic of the blog post by claiming he wants to parse the BBCodes using regular expressions, which has been proven impossible.

I am not a security expert, but even the basic knowledge, each developer should have, offers you several attack vectors.

As proven in the already mentioned blog post it is not possible to validate BBCode using regular expressions, so it will be quite simple to break your layout by omitting closing tags. This only harms the layout, but is still annoying. Another variant are the mentioned attributes. Common is some [url] element to provide links. The very common way to "parse" such things I have seen in many application is:

$code = preg_replace( '(\[url=([^]]+)\](.*?)\[/url\])is', '<a href="\\1">\\2</a>', $code );

The attack is now obvious for everybody. Just use something like this in your BBCode enhanced tag and a nice new XSS was born.

Click this [url=" onclick="alert( 'XSS' );]URL![/url] Amazing stuff!

Of course, those injections could be bypassed by proper escaping, but I have nearly never seen this in any self implemented BBCode "parsers". To summarize this:

BBCode by default is not more secure then HTML. You need value checks in both cases.

BBCodes are easier to type / learn

I think this is the most used argument when BBCodes should be used somewhere. But I simply don't get, how anyone could think, that this is true... OK, there are a lot more HTML tags, then there are BBCode tags. And I do not want my user to learn all those HTML tags.

But, the user do not need to learn "all those HTML tags", but only the very limited sub set allowed in your blog.

  1. Somebody who is not familiar with HTML will have the same learning effort, as for learning BBCodes. There is some defined syntax, there are some tags with their special meaning. Beside the slightly more complicate notation of attributes, there is nothing different - and you don't think your user will fail because of this, do you?

    On the other hand having multiple attributes in BBCodes gets complicated and often results in very similar notations, compared with HTML. And, as BBCode are not defined by any real standard, there are very different implementations for this out there.

  2. Somebody familiar with HTML has only to learn which subset he may use. This is so much simpler then learning about your current BBCode definition. It is only a list of elements (and attributes in those elements) to learn. Something normally accomplished in seconds.

What i do not get, is why BBCodes duplicate the faults of XML like the requirement to retype the tag name in the closing tag. When I would want to use some completely different markup I would reuse some established standard, which is really easy to type and properly defined, like reStructured Text, which I am actually using to type my blog postings...

Allowing HTML results in XSS

As described above you need to invest some time and energy to ensure that your application is not prone to XSS using BBCodes - same applies to HTML. But the concept is quite easy. Some people try blacklisting, but we all now that nobody really knows which browser will interprete which funny stuff as HTML and execute some ECMAScript... this is like Don Quixote's fight against windmills.

So, as always when it is possible, white lists are the way to go. In this case the creation of the actually white list is pretty easy, you define a set of allowed elements, with (for each element) contains a list of the allowed attributes - this list should of course not contain attributes like style, on*, etc. which may be used for ECMAScript execution.

Then you use some common XML parser which can work with HTML and ensures valid HTML, like DOM, or tidy. Those can ensure matching elements and proper attribute definitions. Now remove or mask all elements and attributes you do not want to be used and you got some pretty safe HTML, which is not really vulnerable to injection any more. Values for class attributes, or URLs may also be checked against some pattern, to ensure they contain proper values. But this highly depends on the attribute you are checking.

Of course, there are some libraries which makes this task easier, or spare you the implementation of something like this - even it is quite easy to do.

  • HTML Purifier

    From their website: "HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications."

Reasons for HTML

Besides the already mentioned reason, that many users are already used to HTML, so they can type it nearly fluently, there are other reasons for using it.

You can easily (optionally) use a WYSIWYG editor for the contents on your website like TinyMCE, or similar. Because you use HTML there is nearly no limitation here.

So, why are you actually using BBCodes?

Are there any reasons to use BBCodes I am missing? ... besides backwards compatibility.

Comments

Evert at Monday, 3.9. 2007

I do think BBCode is more secure than trying to clean html. Black lists for HTML almost never are 100% waterproof, just because there are so many variations. By using BBCode, everything that wasn't handled correctly by the parser will simply be spit out as their BBCode.. e.g.: [script] is harmless. Making the same mistake with cleaning html would produce <script>.

Academically it could be just as secure, but when you consider that their might be bugs in software (and there always is) BBCode provides an extra fallback..

That doesn't make it pretty though, therefore I would still stick with HTML or a wiki-like syntax to do this stuff.

martynas jusevicius at Monday, 3.9. 2007

Hey. I was thinking about the same -- that BBCode is no better than HTML, it's just yet another sort of markup. It is easier to deal with, I would say -- if the tags don't parse, just leave them as text, while with HTML is not that simple. But the problem lies deeper I think -- the is no elegant way to edit and embed (the HTML code of) websites. I don't expect users to know ANY markup language. WYGIWYG is not a good solution either, because they produce crappy source code and are overkill in most situations.

Tobias Struckmeier at Monday, 3.9. 2007

@ Evert: You don't want blacklisting. You want whitelisting which you can achieve with several classes which rebuild the input DOM, but only with the "save" elements and with validated attribute values. An additional benefit of html input for "rich text enabled" fields is that you have the possibility of using the browser as a wysiwyg html editor. Anybody is able produce html code that way. And I don't see the extra fallback you speak of? People which let bypass <script> also don't escape their sql queries ;).

Classes which provide such features are for example: fDomDocument (http://fcms.de) HTML_Safe http://pear.php.net/package/HTML_Safe HTMLPurifier http://htmlpurifier.org/

Sure in all those there might be holes as well. But the code is better to maintain, understand and to fix than a huge amount of regular expressions.

mg at Monday, 3.9. 2007

I was asking myself why people started using MarkDown or one of the many WikiSyntax in their blogs. Next step would be the same shit in forums. As it goes for my blog i'm using a minimal set of allowed HTML-tags.

Basically there is nothing wrong with using an optional WYSIWYG-editors if you wanna make it simple for the users. Just don't use BBCode. Who doesn't understand it doesn't understand it.

The argument “it's safer” doesn't count on HTML. You don't need to use blacklists but whitelists with your allowed tags and attributes instead, and a Markup cleanup like tidy.

I also ask myself why i always have to see these useless font-family and color-dropdowns in WYSIWYG editors. Give the users a format dropdown! Induvidualism is a no-go on forums!

Just a basic set of XHTML 1.1 formatting is enough!

iamsure at Monday, 3.9. 2007

For me, its that I'm familiar with bbcode. I know html, and I know bbcode, but when posting to a forum, or similar, bbcode is what seems most comfortable.

My blog (from blogger) supports html, and I've known html for far longer, but yet bbcode seems more comfortable in that context.

In my case, I like using bbcode, and its that simple. Its not about security, or ease of use, or even XSS. Its just what I feel comfortable using in that context.

But beyond that, its an interesting problem to solve, which I enjoy. Sure, I could tell users to only use html. Or plain text. Or SGML docbook. But supporting technologies that make things easier for users is appealing - even if it is truly challenging to support well.

I disagree with your assertion that it has been <b>proven</b> that you cannot implement bbcode with regex. I'd say that its extremely challenging to do well, and to do in multiple levels of nesting. Those are not the same things. :)

Kore at Monday, 3.9. 2007

@Evert: I explicitely said, that blacklisting won't work well. Use whitelistin. I wrote that in the blog post. You of course should never try blacklisting in this context.

Kore at Monday, 3.9. 2007

@iamsure:

The prove, that you can't use regular expression to validate / parse BBCodes is here: http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

There is no option for discussion on that, until you either prove some of the basic axioms of common mathematics as wrong, or show, in which way my analytic proof is broken.

Joakim Nygård at Tuesday, 4.9. 2007

I've never liked BBCode. Every time I get the chance to influence a decision about formatted input, I suggest Markdown. Created by John Gruber (daringfireball.com) it looks like one would format a plain text file. Simple and easy to understand.

Security concerns are obviously dependant on the actual code used to parse the Markdown into html and so is not part of the syntax itself.

One advantage to using an intermediate syntax and convert it to html is that it gives you the option of outputting it in other structures -updated html/xhtml for instance. In other words, you are not stuch with whatever code was entered but can update the parser to follow new standards.

philip at Wednesday, 5.9. 2007

See Also: http://php.net/bbcode

Ronald Iwema at Wednesday, 5.9. 2007

I think the main reason I would use BB is to control which HTML elements a user can use .

kore at Wednesday, 5.9. 2007

@Ronald Iwema: Where is the difference between a list of BBCodes to transform and a list of allowed (X)Html elements? None.

Ronald Iwema at Friday, 7.9. 2007

@kore

I don't agree. Looking at implementing BBCode, u make sure all < and > are replaced by &lt; and &gt;, and then u can process the BBCodes. If u work with a whitelist of allowed HTML elements u have to do more work in implementing it correctly. Also controlling which attributes can be used is easier.

Actually I prefer the "don't let users style there comments" approach.

Void at Friday, 7.9. 2007

I'm the author of the php BBCode extension and I'm happy my (early stage) Extension is mentionned here, I'm currently working on a brandly new parser (i've seen many limitations in the current one)

However, i have to mention that BBCode is only a convention and that nobody (forums, blogs systems & so on) really parse it the same way, my extension was a try to make a unified approach, my "nexgen" parser is described here: http://news.php.net/php.pecl.dev/4825

I don't think BBCode is the perfect solution, however, it's widespread of use in manyforums and so, so i used it on my website and the parsing was a performance critical operation so, i started the extension.

However, I'll be happy to have your feedback on what is "missing" because it's still beta and many new capabilities can be added while still coding.

Just mail me suggestions xavier - the at sign - bmco -dot, yes, just dot- eu

It's fun that people still uses regex (when it's not str_replace) to "parse" (in fact, it's not parsing, as parsing require a tokenization and lexical analysis, phase).

The discussion is however good, and, i agree, BBCode is not a good langage, it's an error, that has been widespread.

Kore at Sunday, 9.9. 2007

Thanks for the comment Void.

Writing a fast and real parser is of course very much appreciated for existing applications which are still bound to BBCodes.

Void at Sunday, 9.9. 2007

However, I think that insecurity is not inherent to any markup language (or markdown or whatever) The insecurity is in the markup change, the implementation needs to remove the input escaping and add the output escaping

(and we need to replace < and > by &lt; and &gt; but also we need to replace & by &amp;, " by &quot; and also ' to &apos;(i think)

But most users forget the tree laters, which can lead to serious problems in security :)

and my parser won't actually escape this, as he is also able to output Text

Mastodont at Monday, 5.11. 2007

BBCode is not the only one markup language and I'm using ML on account of rate and simplicity. Wisiwyg editors usually demand keyboard-mouse-keyboard move, while typing bold text or ::italic text:: or ++url text++ is straightforward.

With regard to impossibility to validate ML using regular expressions, what about:

step 1) translate ML with very simple regexps to (X)HTML

step 2) verify resulting code in HTML Purifier

Pete at Saturday, 19.1. 2008

well, BB codes could be advantageous for making comments styling simpler. I can tell you that my community does not want to change back to HTML. regards Pete

Kredyt Mieszkaniowy at Friday, 1.2. 2008

Why can we just use one language (inplementation) of HTML, I dont like BBCode :( ???

Kredyt at Wednesday, 30.4. 2008

i use bbcodes, because i work with bulletin boards and its easier to handle than with html. but i agree to kredyt mieszkaniowy, that one language is better than two different:)

Michael at Wednesday, 4.6. 2008

Familiarity counts. Whether it's a botched variant of HTML isn't relevant for those users that learned their formatting from forum software. So if your users have lots of experience with posting in forums (or a pre-packaged forum plays a part in your site), using BBCode is the path of least resistance.

And I'd rather use Markdown, Textile, roff etc. than a subset of HTML. With a completely separate format, I don't have to keep in mind what tags I can use in this particular textarea...

gareth at Thursday, 12.6. 2008

bbcode is simpler to parse/validate than a subset of html, and simpler for the user, eg.

[youtube]0123456789a[/youtube]

[quote=guy]blaa[/quote]

Muhabbet at Sunday, 13.6. 2010

BBCode is not the only one markup language and I'm using ML on account of rate and simplicity. Wisiwyg editors usually demand keyboard-mouse-keyboard move, while typing bold text or ::italic text:: or ++url text++ is straightforward.

goedkoop geld lenen at Sunday, 13.6. 2010

Use HTML Purifier, it is much much better than BB code which is dumb. I completely agree with your article

Chat at Wednesday, 30.6. 2010

you super sites. admin thanks:))))

ezmoz article at Tuesday, 24.8. 2010

there is nothing wrong with using an optional WYSIWYG-editors if you wanna make it simple for the users.

btw, Thanks for sharing.

bayanlarla sohbet at Saturday, 18.12. 2010

I think Bbcodes very usefl for my forum Vbulletin. For special signatures and good looking profiles it is very necessary. Bbcodes are very functional on forums

Adidas schoenen at Friday, 21.1. 2011

Great insights in the comments!

Aloha! at Sunday, 9.10. 2011

Hey Kore, anything to say about [youtube] and [quote] as gareth mentioned earlier?

gareth does make a good point here. These things are great for those non tech savvy and such. I would rather use BBcode, than to waste anyone's time learning to do [youtube] and [quote] alternative.

vans schoenen at Sunday, 5.2. 2012

Use HTML Purifier, it is much much better than BB code which is dumb. I completely agree with your article!

Bobby Jobs at Sunday, 15.7. 2012

I agree with everyone, use the HTML Purifier when using HTML. BB code is another option. I've used BB code a few times.

boyaci at Tuesday, 5.2. 2013

BBCode is not the only one markup language and I'm using ML on account of rate and simplicity. Wisiwyg editors usually demand keyboard-mouse-keyboard move, while typing bold text or ::italic text:: or ++url text++ is straightforward.

Mijn Schoenen Online at Thursday, 2.5. 2013

Pitty that forum builders decided to start using bbcodes. Why not just using html code?

AnnyIngram at Friday, 27.12. 2013

Really great discussion share with us. It can make big help to the web developers to understand why they are using BBcodes in website. It's much perceptive.