Kore Nordmann - PHP / Projects / Politics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:Author: Kore Nordmann
:Date: Mon, 03 Sep 2007 11:02:40 +0200
:Revision: 4
:Copyright: CC by-sa
==========================
Why are you using BBcodes?
==========================
:Description:
The discussion on my blog post "Do not use regular expressions for parsing"
ended in a discussion about BBcodes in general. I just used them as an
example to demonstrate why it it is impossible to parse such a language
with regular expressions - and mentioned in a subclause, that I don't see
any sense in using them at all. So why use them at all?
The discussion on my blog post `"Do not use regular expressions for
parsing"`__ ended in a discussion about BBcodes in general. I just used them
as an example to demonstrate why it it is impossible to parse such a language
with regular expressions - and mentioned in a subclause, that I don't see any
sense in using them at all.
__ do_NOT_parse_using_regexp.html
What are BBCodes?
=================
Just for the completeness, a short explanation, what they basically are:
BBCode is a markup language very similar to HTML, using [] instead of <>, and
with a different attribute notation. ::
[element=attribute_value] some text [/element]
When it comes to multiple attributes the markup definitions differ in nearly
each implementation, as the list of available elements does.
The myths about BBCode
======================
There are several myths about BBCodes, which try to explain why it makes sense
to use them.
BBCodes are more secure
-----------------------
`_p_ mentions in his comment`__, that he thinks BBCodes are more secure to
use, then allowing to embed HTML in some input fields - even he completely
ignores the topic of the blog post by claiming he wants to parse the BBCodes
using regular expressions, which has been proven impossible.
__ do_NOT_parse_using_regexp.html#comment_10
I am not a security expert, but even the basic knowledge, each developer
should have, offers you several attack vectors.
As proven in the `already mentioned blog post`__ it is not possible to
validate BBCode using regular expressions, so it will be quite simple to break
your layout by omitting closing tags. This only harms the layout, but is still
annoying. Another variant are the mentioned attributes. Common is some [url]
element to provide links. The very common way to "parse" such things I have
seen in many application is: ::
$code = preg_replace(
'(\[url=([^]]+)\](.*?)\[/url\])is',
'\\2',
$code
);
__ do_NOT_parse_using_regexp.html
The attack is now obvious for everybody. Just use something like this in your
BBCode enhanced tag and a nice new XSS was born. ::
Click this [url=" onclick="alert( 'XSS' );]URL![/url] Amazing stuff!
Of course, those injections could be bypassed by proper escaping, but I have
nearly never seen this in any self implemented BBCode "parsers". To summarize
this:
BBCode by default is not more secure then HTML. You need value checks in
both cases.
BBCodes are easier to type / learn
----------------------------------
I think this is the most used argument when BBCodes should be used somewhere.
But I simply don't get, how anyone could think, that this is true... OK, there
are a lot more HTML tags, then there are BBCode tags. And I do not want my
user to learn all those HTML tags.
But, the user do not need to learn "all those HTML tags", but only the very
limited sub set allowed in your blog.
1) Somebody who is not familiar with HTML will have the same learning effort,
as for learning BBCodes. There is some defined syntax, there are some tags
with their special meaning. Beside the slightly more complicate notation of
attributes, there is nothing different - and you don't think your user will
fail because of this, do you?
On the other hand having multiple attributes in BBCodes gets complicated
and often results in very similar notations, compared with HTML. And, as
BBCode are not defined by any real standard, there are very different
implementations for this out there.
2) Somebody familiar with HTML has only to learn which subset he may use. This
is so much simpler then learning about your current BBCode definition. It
is only a list of elements (and attributes in those elements) to learn.
Something normally accomplished in seconds.
What i do not get, is why BBCodes duplicate the faults of XML like the
requirement to retype the tag name in the closing tag. When I would want to
use some completely different markup I would reuse some established standard,
which is really easy to type and properly defined, like `reStructured Text`__,
which I am actually using to type my blog postings...
__ http://docutils.sourceforge.net/rst.html
Allowing HTML results in XSS
----------------------------
As described above you need to invest some time and energy to ensure that your
application is not prone to XSS using BBCodes - same applies to HTML. But the
concept is quite easy. Some people try blacklisting, but we all now that
nobody really knows which browser will interprete which funny stuff as HTML
and execute some ECMAScript... this is like `Don Quixote's`__ fight against
windmills.
__ http://en.wikipedia.org/wiki/Don_Quixote
So, as always when it is possible, white lists are the way to go. In this case
the creation of the actually white list is pretty easy, you define a set of
allowed elements, with (for each element) contains a list of the allowed
attributes - this list should of course not contain attributes like style,
on*, etc. which may be used for ECMAScript execution.
Then you use some common XML parser which can work with HTML and ensures valid
HTML, like `DOM`__, or `tidy`__. Those can ensure matching elements and proper
attribute definitions. Now remove or mask all elements and attributes you do
not want to be used and you got some pretty safe HTML, which is not really
vulnerable to injection any more. Values for class attributes, or URLs may
also be checked against some pattern, to ensure they contain proper values.
But this highly depends on the attribute you are checking.
__ http://php.net/DOM
__ http://tidy.sourceforge.net/
Of course, there are some libraries which makes this task easier, or spare you
the implementation of something like this - even it is quite easy to do.
- `HTML Purifier`__
From their website: "HTML Purifier is a standards-compliant HTML filter
library written in PHP. HTML Purifier will not only remove all malicious
code (better known as XSS) with a thoroughly audited, secure yet permissive
whitelist, it will also make sure your documents are standards compliant,
something only achievable with a comprehensive knowledge of W3C's
specifications."
__ http://htmlpurifier.org/
Reasons for HTML
================
Besides the already mentioned reason, that many users are already used to
HTML, so they can type it nearly fluently, there are other reasons for using
it.
You can easily (optionally) use a WYSIWYG editor for the contents on your
website like TinyMCE, or similar. Because you use HTML there is nearly no
limitation here.
So, why are you actually using BBCodes?
=======================================
Are there any reasons to use BBCodes I am missing? ... besides backwards
compatibility.
Trackbacks
==========
Comments
========
- Evert at Mon, 03 Sep 2007 17:53:24 +0200
I do think BBCode is more secure than trying to clean html. Black lists for
HTML almost never are 100% waterproof, just because there are so many
variations. By using BBCode, everything that wasn't handled correctly by the
parser will simply be spit out as their BBCode.. e.g.: [script] is harmless.
Making the same mistake with cleaning html would produce