Do NOT try parsing with regular expressions
Table of Contents
Some days ago somebody joined a german IRC channel where we try to help others using PHP, where I hang around for some years now, and had some problems with his regular expressions he tried to use to "parse" BBCode he wants to use in his custom application. I answered something like:
It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions.
The usual reaction: No reaction. No belive. Some days later nearly the same question again, from the same guy, same answer by me, same result. No, not really the same result, I decided to try to summarize the reasons for this in a small blog post.
If you are just looking for a way to "parse" HTML, I added a new blog entry, which describes how to do that: "Extracting data from HTML".
What are BBCodes?
BBCodes, as you could see from the wikipedia link mentioned above, are the same class of language like HTML. In my eyes it does not make any sense not to use (X)HTML, but a markup language which has exactly the same complexity in terms of parsing and typing of the user. BBCode is nothing more then a subset of HTML using [] insted of <>. Stupid, but OK, we now have this kind of markup.
Having a language like HTML, more or less XML, we get the same problems here. If you want to do the parsing properly you need to handle recursive tag structures, need to check for matching tags etc.
You may of course just replace the BBCode with the according HTML elements, but the you might not only end up with invalid code, but also with user provided inlined HTML which destroys the layout of your website.
The second thing you might do is just checking for a closing tag for each opened tag, or just replace BBCodes, which also have a closing tag, but you still easily find examples where it creates invalid markup, even it might not always destroy your layout, but result in undefined behaviour of the displaying browser.
[b] hello [i] [/b] world [/i] [b] ! [/b]You may also check for this using a regular expression, but trust me, I will always find a way to bypass your checks - the next paragraph will prove this.
Parsing recursive structures with regular expressions
It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions. When you have studied computer science you of course know Chomsky hierarchies and therefore know, that regular expressions are a type 3 grammar, also called regular languages, which are equivalent to finite state machines. (So you also know, that each regular expression is quite easy to transform into a FSM).
So lets take a step further and prove, that recursive structures are not parseable. Using the pumping lemma you may not prove, that a language is a regular language, but you can prove, that a language is not a regular language.
Talking about rekursive stuff lets take one of the simplest possible examples - the braces language. You may think of this like opening and closing elements instead of opening and closing braces. Consider the laguage:
L = { (^m )^m; m >= 1 }All words, where the number of opening braces equals the number of closing braces. Then one number n exists, so that every word x, which is element of this language, and has the length 2 * n can be stripped into:
x = u v w = (^n )^nWhere u, v and w are words fulfilling the following conditions:
Length of the word uv is less or equal n
Length of the word uvw is of course two times n
The pumping lemma now says the that each word of the following definition always is element of the defined language, when L is a regular language.
u v^k w; k >= 1With this definitions you for example get something like:
u = (^( n - i )
v = (^i
w = )^nUsing for example k = 2 and i = 1, you get a word which is not element of the language we defined at the beginning:
x = (^( n + 1) )^nThe numbers of opening braces obviously do not match the number of closing braces any more, so that this word is not a member of the prior defined language. Thats all you need to prove the point.
I can match this language with PCRE!
Yes, you can. PCRE can do more then regualar expressions can do by definition. The class of languages you can detect with PCRE regular expressions is, as far as I know, not clearly defined. The number of back references is limited (as the memory is in every computer, but 99 back references are far less, then some GBs of memory for your parser ;), and so on.
Some people say, that regular expressions are never readable, even I disagree here, I would say writing a real parser using PCRE regular expression will definitely be unreadable. But I will evaluate this in further detail and write a follow up on this.
Where you can use regular expressions in your parser
When you really want to parse BBCodes write a common compiler using syntax trees etc. - or simply use a common rich text editor and strip down the resulting (X)HTML to the set you want to allow in your application. DOM and similar PHP extensions do a great job here - and they ensure valid markup.
When writing a custom parser you can - and should - use regular expressions in the tokenizer. They do a great job here. To rephrase it:
Use regular expressions to recognize words, not structures.
Using BBCode in your application
As written earlier - I personally do not see any sense in using BBCodes and even wiki markup, because their complexity equals the complexity of HTML, they are another language to learn for the user, they differ in details in each implementation, and do not reuse potential knowledge of the user about HTML. And even, if the user does not know HTML yet, why should it be harder to learn your current subset, then your special BBCode implementation? Are <> really so much harder to type then []? On german keyboards they are even a lot easier to type...
But if you really want to use such stuff, there are existing packages which help you using them in your application, so that you can reuse existing working parsers and not end up with not working regular expression experiments, and continue investing hours in your application to fix some funny edge cases. Some examples:
If you liked this blog post, or learned something please consider using flattr to contribute back: .
Trackbacks
Comments
-
Whisller at Sat, 28 Jul 2007 02:28:16 +0200
I exactly think like You. I don't understand why users think [b] or *string* is better than <b>. I'm using html purifier, I think it is good library - but now I don't have a time to read documentation and code files.
Link to comment -
Martin Holzhauer at Sat, 28 Jul 2007 07:15:56 +0200
There is also an bbcode Extension[1] in Pecl in an early development state
Link to comment
1. http://pecl.php.net/package/bbcode/ -
kore at Sat, 28 Jul 2007 10:07:21 +0200
I see some sense in using *foo* and _bar_ for markup, because this established in the Usenet, maybe even before HTML was used, and is still used in a lot of text base applications, like Mail and IRC.
Link to comment
... but you just could keep those, because everyone will understand this markup even without conversion to HTML, I think. -
Toby at Sat, 28 Jul 2007 10:18:35 +0200
Wer hat Dich denn geärgert? ;)
Link to comment -
Christopher Hogan at Sun, 29 Jul 2007 02:48:45 +0200
I concur, I use a form of markdown if the client requests it in the CMS. But my software also provides a secondary option using tinyMCE for clients that want WYSIWYG editing functionality. I think wysiwyg is WAY more simple for clients. Since they TinyMCE Moxicode editor displays html pretty much the same way dreamweaver woud, the only other steps I take to ensure valid XHTML are:
Link to comment
1.) use DOM to validate
2.) using the parse tags functionality in DOM I check the content in the tags for ISO-8859 extended char sets and filter them (htmlentities function and htmlspecialchars functions won't replace them into entities correctly). I do this by using PHP 5.2 and the new filter_input/filter_var/filter_array functions
3.) encode ampersands/ addslashes or check for / use magic quotes
if I'm putting it in an RDBMS.
4.) My version of the backend code for TinyMCE uses AJAX to update elements on parts of the page dynamically and re-export the files as regular old HTML. -
Ammar Ibrahim at Sun, 29 Jul 2007 08:46:13 +0200
Exactly! This is also similar discussion to templating engines, where they try to simplify if statements and such. The thing they don't understand is that it makes things harder. Wiki syntax is very hard compared to HTML, I still don't understand why it was invented.
Link to comment
But the only advatnage of BBCode is the protection against XSS, it's very easy to strip out all HTML, and then convert BBCode to HTML. But if you allow users to write HTML, sanitizing HTML is extremely complicated. -
your mother at Tue, 28 Aug 2007 10:37:46 +0200
i hope you are brushing your teeth dear!
Link to comment -
Boris at Wed, 29 Aug 2007 17:01:11 +0200
Actually there is a good point for using BBCode instead of a small html subset, it is much easier to avoid CSS-attacks.
Link to comment -
Kore at Wed, 29 Aug 2007 23:53:07 +0200
@Boris: No, actually not. You can just do the same white listing approach.
Link to comment
With BBCode you only convert a subset, while you just allow a very specific subset with (X)Html - in most cases just do not allow any attributes and you are safe. -
_p_ at Fri, 31 Aug 2007 17:56:34 +0200
Isn't it a secure way to generate markup?
Link to comment
strip tags and then replace bbcode with regular expressions to
have the control over what we generate ((X)HTML etc.). -
Kore at Sat, 01 Sep 2007 11:47:56 +0200
@_p_: Please actually read the blog post before commenting. This would help you to answer properly. Regular expressions are just not capable of converting BBCode to HTML.
Link to comment -
Evert at Mon, 03 Sep 2007 17:50:50 +0200
I use regular expressions for parsing non-valid html.. I do need some PHP around it to aid it though :)
Link to comment
I do think BBCode is more secure than trying to clean html. Black lists for HTML almost never are 100% waterproof, just because there are so many variations. By using BBCode, everything that wasn't handled correctly by the parser will simply be spit out as their BBCode.. e.g.: [script] is harmless. Making the same mistake with cleaning html would produce <script>.
Academically it could be just as secure, but when you consider that their might be bugs in software (and there always is) BBCode provides an extra fallback..
That doesn't make it pretty though, therefore I would still stick with HTML or a wiki-like syntax to do this stuff. -
Mike Seth at Mon, 03 Sep 2007 23:27:59 +0200
Hahahahahahaha Kore!
Link to comment
I've just written a post saying the same a couple of days ago! Indeed you were much more verbose this time! :)
here it is:
http://blog.mikeseth.com/index.php?/archives/1-For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS.html -
dirt at Thu, 25 Oct 2007 03:55:55 +0200
im using a wysiwyg js to format posts by users (in place of a textarea), and its fast and already set up for bbcode, i came across this post when researching the best way to take the bbcode in the database/posting and change it to html for output...
Link to comment
if you are saying regex isnt the way to go, then what is the most efficient way to format user posts? i looked at patBBCode but i don't have access to install anything on the server itself and if im correct that package needs to be installed via PEAR along with patError.. correct me if i'm wrong.
allowing straight html doesnt seem smart, so i figured a selected number of bbcode ([b][i][u][code][quote][color=#ff0000][url][img]) stored in the db/post and translating it when outputting would be the way to go...
what to do what to do.. -
sanjuro at Thu, 28 Feb 2008 17:50:18 +0100
This is good to know, I was getting a headache trying to parse bbcode with regular expressions. I'm still however looking for a simple alternative, these parsers seem a bit over-the-top.
Link to comment -
wangtang at Fri, 29 Feb 2008 02:14:36 +0100
Well, there are actually some more arguments for BBCode you don't mention.
Link to comment
For once, people are used to it. Of course it would be desirable that people used and knew html, but they don't. They do however know BBCode, mostly the younger ones are accustomed to using BBCode in forums.
You also mention "[b] hello [i] [/b] world [/i] [b] ! [/b]" as an example. With <> instead of [], this wouldn't even be valid html-code, so there's no need to correct user input failures in the BBCode case (where I use span's anyway), at least in my book.
As for parsing BBCode with regular expressions: I've seen many sites which use a very limited set of BBCode, where this argument "It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions." does only apply in 0,01% the cases.
On top of that: It's easy to implement. Not perfect, but you don't have to read yourself into libraries parsing with syntax trees. And the "easy" argument dominates all others in many cases. While it surely is desirable to apply the theoretics of Informatics to applications, pragmatically you can always accomodate the methods you're using to the demands which need to be met.
In the end, wysiwyg's will sooner than later eat "normal" forms with BBCode or html, as they are comfortable to use for the end-user, and easy to implement for webdesigners. Most of the newer wysiwyg's do output valid xhtml, so that's where it will go. In the meantime, I don't see html replace BBCode in forms, and regexps are a simple way to deal with a lot of the easier task tags. -
wangtang at Fri, 29 Feb 2008 02:32:56 +0100
P.S.: I know I sound like "let's use h1 for big text and address for italic text, and a table for layouting our site". But since we're not doing markup here, I'm still on the regexp side ;)
Link to comment -
Kristjan Siimson at Tue, 07 Oct 2008 11:36:16 +0200
Thank you for this post. I was trying to do the same thing because it seems to make sense to use this kind of approach. That saved me a lot of time. But I'd like to point out that while writing [b] instead of <b> does not make life much easier, such codes can become useful when things get more complex, such as when adding flash objects.
Link to comment -
Ab at Mon, 19 Jan 2009 10:33:51 +0100
Good post about regEx, but shame your giving such bad advice in regards to forum/whatever-a-user-can-post-markup-in security. The main reason BB exists is because it is way more secure to disallow all HTML and have a small, totally under your own control set of Codes you can change to HTML yourself then allowing all HTML tags.
Link to comment
I have a feeling you know a lot about RegEx, but are rather clueless about secure programming and not trusting user-input. -
Kore at Mon, 19 Jan 2009 17:06:13 +0100
@Ab:
Link to comment
Please reread the text. Validating HTML against a white list of allowed elements and structures makes it possible to entirely validate user input given as HTML. There is no magic behind that.
On the other hand lots of broken, dumb BBCode-"parsers" directly convert attribute values into HTML, which can lead into script injection.
BBCode are *no* step for securing user input, this happens completely independent from the markup language. And you always need to validate the entire structure to properly validate input, which is impossible with regular expressions for BBCode like shown above. -
David King at Tue, 03 Mar 2009 01:07:01 +0100
Personally, I think that there's no real argument against HTML or BBCode parsing and security just so long as you do it right. With BBCode it's trivial to prevent inline javascript events (like adding onclick to a bold tag) because only [B] would be parsed and [B onclick="evil"] would be ignored. That leaves only [LINK] tags susceptible to attack - and decent sanitation there is a complete doddle.
Link to comment
For me, the reason to use BB-style over HTML is that you keep the user from thinking "ooh, HTML - lets get creative" then have them disappointed because you only support a small set of tags (STRONG, EM, U for example) leaving them thinking "bollocks, how can I edit my comment"? However when presented with BB-style they will know immediately that it is a simpler set of tags and to either use the given reference or provided toolbar -
hari at Sun, 12 Jul 2009 16:57:57 +0200
Hi Kore,
Link to comment
I read your arguments, but from a practical perspective, it doesn't seem to make sense to write a WHOLE parser - scanner for merely adding a few small tags for the end user. Especially for simple applications where you just need bold, italic, underline, quoting and code block.
I DON'T want to write an entire lexical parser, scanner application or use the DOM approach in a PHP script to just convert a small subset of BBCode tags to their HTML equivalents.
I DON'T want to allow pure HTML or even a stripped down sanitized version of HTML because there is no 100% guarantee that it is safe.
Can not regular expressions be used as an acceptable, but less than perfect alternative? Enforcing correct "structure" of tags is just too much to expect. I can correct wrongly nested tags from user input manually if need be. It saves far more time than a full fledged lexical analyzer-cum-parser. -
Faltzer at Mon, 23 Nov 2009 01:30:19 +0100
Late response, but I see this as worth mentioning:
Link to comment
If the problem with parsing is nesting and the misuse of regular expressions, then try parsing BBCode with SUIT's parser (which is stack-based, not dependent on any regular expressions).
For starters, the "Hello world" example you provided does not work on SUIT, because SUIT ignores bad nesting. By the [/i], the stack will be empty, and the [ b] ! [/ b] will be parsed correctly.
In case you want to have a look at how the parser works (it's heavily commented): http://suitframework.svn.sourceforge.net/viewvc/suitframework/suit/trunk/suit/suit.class.php?revision=33&view=markup -
John @ SEO Software at Sun, 06 Dec 2009 23:44:31 +0100
I heard lots of people say regex is bad idea to parse html, actually it is true for complex regex one time when i was parse a page it took the regex about 15 minutes to collected matches while a simple loop did it instantly for me.
Link to comment
however i think regex is a very good idea when it comes to simple tasks like parsing emails or so...more simple tasks. -
Nikita Popov at Fri, 22 Jan 2010 13:00:39 +0100
Sorry, but i really do disagree with what you say. I think using RegExp or not is only a question on how good you are at RegExp and a question of performance.
Link to comment
PHPs Implementation of regular expressions (I talk about PCRE) does support recursion.
Here's an example for a quote-BB-Code:
#\[quote]((?>[^[]|\[(?!/?quote])|(?R))+)\[/quote]#si
In my eyes it's fairly obvious what it does and I made a little performance check to: Running the RegExp one million times (preg_replace_callback is used for replacing) takes from three to ten seconds, depending on the complexity of the input stream. Obviously, hundrets of encapsulated [quote]s would need a little bit more processing time.
The above regexp could be changed to parse *all* BB-Codes:
#\[([a-z]+)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si
Now, this is *one* line of code and it does all you want. You only need to call it recursively with preg_replace_callback and do the correct replacement, depending on what is in $matches[1] (bb-code) -
Kore at Fri, 22 Jan 2010 13:46:04 +0100
@Nikita Popov: PCRE are no regular expressions any more. I covered parsing with PCRE in another blog post:
Link to comment
http://kore-nordmann.de/blog/parse_with_regexp.html
It is still meaningless to try to parse with PCRE, since you get no AST or proper match arrays from matching the string. You can only tell if the string is valid markup, nothing else. You can't get no proper error reporting, either. -
Nikita Popov at Fri, 22 Jan 2010 14:01:09 +0100
Wow, fast answer.
Link to comment
Obviously PCRE has very little in common with what the computer scientists call regular expressions. But if talking about PHP RegExp always refers to PCRE.
Sure you can parse with regex, it's only little bit slow, but on the other hand much easier to implement then some complicated parser / lexer / tokenzier stuff.
Here some example sourcecode for the bb-code [ident] (which has no meaining at all):
function doit(&$matches) {
if(is_array($matches)) {
if(stripos($matches[1], '[ident]'))
return '{IDENT}' . doit($matches[1]) . '{/IDENT}';
return '{IDENT}' . $matches[1] . '{/IDENT}';
}
$regexp = '#\[ident]((?>[^[]|\[(?!/?ident])|(?R))+)\[/ident]#si';
return preg_replace_callback($regexp,'doit',$matches);
}
Only argument against it is performance.
In reality you maybe would use this one instead:
#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si
and would use distinguish several replacements, depending on the bb-code. -
Kore at Fri, 22 Jan 2010 15:10:12 +0100
@Nikita Popov:
Link to comment
a) You still don't get proper matches for recursively stacked tags, or am I missing something here? This only works for tags, which are not stacked.
b) You still don't get proper error reporting, f.e. "[a] ... [b] .. [/b]" would just leave the [a] unmatched & silently ignored.
Using the input string "[foo] ... [i]Hello [b]World[/b][/i]. Hello [i]internet[/i]." with your (adapted) example code results in "[foo] ... {IDENT}i{/IDENT}. Hello {IDENT}i{/IDENT}.". And from the values received in the callback function it at least always seemed to me, that there is no way to properly handle recursive markup this way.
On the other hand, you are also calling your own regular expression matching function recursively yourself, which could already be considered a simple "parser". This example shows the problem: http://k023.de/parse.txt
a) The outer stuff is never properly reported
b) The markup error is ignored, and you cannot use any anchors like \A or ^ to prevent from this.
Writing a proper lexer & parser is not really hard on the other hand, and results in far more readable code. See this presentation [1] or this code [2] as an example.
[1] http://kore-nordmann.de/talks/09_08_parsing_with_php.pdf
[2] http://svn.ez.no/svn/ezcomponents/trunk/Document/src/pcss/parser.php -
Nikita Popov at Fri, 22 Jan 2010 18:04:45 +0100
The code I posted above works only with the first regular expression. For the second it must be adapted.
Link to comment
Here some (untested) code how it could work:
[code]
function doit(&$matches) {
if(is_array($matches)) {
$start, $end;
if($matches[1] == 'b') {
$start = '<strong>';
$end = '</strong>';
}
elseif($matches[1] == 'i') {
$start = '<em>';
$end = '</em>';
}
elseif($matches[1] == 'quote') {
$start = '<blockquote>';
$start = '</blockquote>';
}
$content = $matches[1];
if(preg_match('#\[(?>b|i|quote)]#si', $content)) // for optimisation, for not applying th whole regex if no further nesting is there
$content = doit($content);
return $start . $content . $end;
}
$regexp = '#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si';
return preg_replace_callback($regexp,'doit',$matches); // this is for initialization and recursion later
}
[/code]
I haven't tested it, therefore I don't know whether it works or not. But you should get the basic idea. (I'll test later how well this works...)
This one would (or should) replace
doit('[foo] ... [i]Hello [b]World[/b][/i]. Hello [i]internet[/i].')
By
'[foo] ... <em>'.doit('Hello [b]World[/b]').'</em>. Hello <em>internet</em>.'
and
'[foo] ... <em>Hello<strong>World</strong></em>. Hello <em>internet</em>.'
in the end.
> a) You still don't get proper matches for recursively stacked tags, or am I
> missing something here? This only works for tags, which are not stacked.
Sorry, my English isn't that good, I don't understand what you want to say with "recursively stacked tags".
> b) You still don't get proper error reporting, f.e. "[a] ... [b] .. [/b]" would just
> leave the [a] unmatched & silently ignored.
Yep. This is intended behavior in my eyes.
I will have a look at your links... -
Nikita Popov at Fri, 22 Jan 2010 18:21:18 +0100
Ooops, im sorry, mixed up $matches[1] and $matches[2] in the code above.
Link to comment
This one works (tested):
function doit(&$matches) {
if(is_array($matches)) {
$start = $end = '';
if($matches[1] == 'b') {
$start = '<strong>';
$end = '</strong>';
}
elseif($matches[1] == 'i') {
$start = '<em>';
$end = '</em>';
}
elseif($matches[1] == 'quote') {
$start = '<blockquote>';
$start = '</blockquote>';
}
$content = $matches[2];
if(preg_match('#\[(?>b|i|quote)]#si', $content)) // for optimisation, for not applying th whole regex if no further nesting is there
$content = doit($content);
return $start . $content . $end;
}
$regexp = '#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si';
return preg_replace_callback($regexp,'doit',$matches); // this is for initialization and recursion later
} -
Anonymous at Tue, 13 Apr 2010 05:01:14 +0200
BBCode isn't entirely useless, as it allows bbcode implementers to design custom tags without forcing posters to use invalid html tags.
Link to comment
A common example of this is the [spoiler] tag, which makes text invisible until a mouseover. If you tried to implement this on a site that allowed posters to use a subset of HTML (<b><i> etc), having users type in <spoiler> could cause malformed XML. This can be avoided, of course, by sanitizing the comment right after submission, but why would you want to confuse people by having them refer to a nonexistent HTML tag anyways?
You can, of course, debate the importance of a spoiler tag, but it's just an example. If you're going the route of custom tags, it's better to go out and say that you're non-standard from the beginning than to pretend that you're using valid HTML. -
Anon Cow at Wed, 14 Apr 2010 05:25:24 +0200
Worth pointing out that if you allow a whitelist of HTML rather than using BBCode you will inevitably get someone complaining that they can't use a certain tag, especially in forum/comment and other public style software. So, from a usability perspective BBCode beats HTML because it manages the users expectations of what they can and can't do.
Link to comment
A Rich Text style UI is obviously the better for simple formatting, although it has usability drawbacks for more complex formatting and features such as [code] or [spoiler] tags, as well as the kinds of tagging Wiki uses. -
Ornela @ cheap host at Thu, 22 Jul 2010 16:32:21 +0200
i agree, using wysiwyg is much more simple for clients.
Link to comment -
Akki at Sun, 01 Aug 2010 09:44:38 +0200
The reason you would want to use BBCode over HTML is because you don't want to force the hole standard on your users. They don't care for things like semantics, they just want that word in bold.
Link to comment
You also have a lot of associations that are made simple:
[img]url[/img] => <img src="url" />
[b]text[/b] => <istrong>text<i/strong>
[i]text[/i] => <iem>text<i/em>
For youtube: [yt]id[/yt] => (insert mountain of code here)
[size=3]text[/size] => <h2>, <h3>, or some CSS thing
You also have questions like, should the users insert their own p tags?
And I disagree Kore, ] is not easier to type then > (because of the SHIFT involved).
Anyway, great article on why you can't/shouldn't be using regex to parse. :) -
deejayy at Fri, 06 Aug 2010 14:02:50 +0200
Yeah, and please substitute this with html:
Link to comment
[flash]/upload/demo.swf[/flash] -
SasQ at Wed, 01 Sep 2010 03:53:31 +0200
It's fun to see people saying "I *can* parse HTML/BBcode with RegExps" and then pasting a code which is nothing less than a simple top-down recursive-descent parser encoded in PHP's functions/callbacks ;)
Link to comment
And as to the non-HTML BBcodes like `[spoiler]` or `[flash]`:
You can simply allow something like this: `<div class="spoiler">...</div>` for special blocks. For the Flash you can use: `<a class="flash">http://path.to/flash.swf</a>` and substitute it with proper markup in your script. SWFObject does exactly that, but from the JS level. -
SasQ at Wed, 01 Sep 2010 04:17:15 +0200
@Nikita Popov:
Link to comment
"Sorry, my English isn't that good, I don't understand what you want to say with 'recursively stacked tags'."
Something like that:
<p>These are <i><b>nested/stacked</b> tags</i>. Try that.</p>
Here the tags are nested hierarchically, one into another. You can think of it as of parentheses:
(These are ((nested/stacked) tags). Try that.)
And you can probably see that you have to check if they're nested properly. If they're not, like in this example:
<p>These are <i>badly <b>nested</i> tags</b>.</p>
a RegExp can't handle with it, because it understands only one level of nesting (Chomsky's level 3 grammar). You need a level 2 grammar parser which understands recursive, hierarchical, self-similar patterns.
But it doesn't mean that you have to write a full-blown parsing engine ;) Often it could be done by a simple recursive-descent parser using recursive function calls for each non-terminal symbol of the grammar. -
Brandon @ How to become a ticket broker at Fri, 12 Nov 2010 01:21:07 +0100
Honestly, I find BBCode even more confusing. Yes, it is more intuitive, but I feel like I've already gotten so used to HTML that I get tripped up when BBcode is introduced.
Link to comment
Given that pages are coded in HTML and not in BBCode, my recommendation is that people learn basic HTML. It's intimidating at the start but eventually you get the hang of it. -
Emma30 at Mon, 22 Nov 2010 08:33:15 +0100
This is good to know, I was getting a headache trying to parse bbcode with regular expressions. I'm still however looking for a simple alternative, these parsers seem a bit over-the-top.
Link to comment -
Tom Christiansen at Mon, 22 Nov 2010 20:24:55 +0100
Your posting is inapplicable to modern pattern-matching languages. Please stop using REGULAR that way and hitting people over the head with theoretical computability concerns that simply do not apply to any modern pattern in common use today.
Link to comment
Even GNU egrep, originally a DFA, now matches "(.)\1" with ease, by definition a non-REGULAR language. I know of no modern programming language that requires REGULAR regular expressions.
Please note that I am here using the offensively all-capital "REGULAR" to denote that highly irregular ivory-tower sense of the otherwise common English word "regular". No one uses REGULAR expressions any more (or next to no one). We all use regular expressions, which are something altogether different.
No one has REGULAR expressions. Even the most basic of regular expressions are non-REGULAR. That's because POSIX demands backref support in BREs for compliance, thereby putting the lie to the silly old REGULARity chestnut. It was time to retire it more than 30 years ago; I can't believe people still think it matters -- or applies. It does not.
In other matters, I do not know which regex dialects are restricted to 99 backrefs the way you mention. Certainly Perl and PCRE are not; you may go as high as you need to, provided you have enough memory for it. In constructed patterns, I've used up into the 100s of them, but that was simply to test the limits. In practice, you never need to go that high, because other mechanisms are available to you.
Case in point. Consider the modern regular expression commonly used for matching balanced parens: that's just plain " \((?:[^()]*+|(?0))*\)", a pattern both simple and elegant. You do not need backrefs of arbitrarily high numbering, since arbitrarily many of them are accommodated through simple recursion, a key approach in any computer scientist's algorithmic arsenal.
All that aside, I do recommend that you read Russ Cox's excellent paper, "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)." Because we all derive from Henry's regex algorithm instead of Ken's, certain operations become pathologically slow, but need not do so. A dynamic, hybrid approach that at regcomp time determines whether the alternate algorithm is feasible in any given pattern seems the obvious way to approach this, and some already do. -
flower shop at Mon, 10 Jan 2011 15:56:46 +0100
Thanks for explanation..I haven´t uderstood what exactelly BBCodes are until now
Link to comment -
FirestormX at Wed, 26 Jan 2011 00:06:16 +0100
Sorry to post on such an old article - a good article by the way - but I thought this very important point got brushed off:
Link to comment
bbcode is used for custom/predefined markup and functionality.
For example a [video] tag could be replaced by something like <div class="video"> or something, and then parsed later - but wouldn't that be practically the same as parsing [video]?
In addition, it helps to keep things styled in a way that matches your website's layout. A simple example is the {quote] tag. In order for all quotes to match other quotes on your website, a casual browser would have to know your style sheets. Like:
<div class="quote_header">Quote by person 1</div><div class="quote_body">test</div>
As opposed to just [quote=person 1]text[/quote]
Similarly, I have a [user] tag on my own site, which parses the user name and links back to that particular user's profile page. If I parse [url=firestormx], it's very short for the poster to write, and I have full control of it in the future. If I ever change the URL for a profile page - or more likely - change where I want the url tag to direct to (say I want to change it to link to sending a user a PM) it is very simple for me to do that in the code.
I'm not arguing the logic behind what you presented for not using regular expressions for parsing bbcode, but rather I'm arguing that bbcode is useful to do things above and beyond what a user would be able to do with white-listed html.
bbcode can be used for site-specific functionality, as well as consistent styling of content across the entire website. -
HGH at Fri, 18 Feb 2011 16:30:00 +0100
Even GNU egrep, originally a DFA, now matches "(.)\1" with ease, by definition a non-REGULAR language. I know of no modern programming language that requires REGULAR regular expressions.
Link to comment -
topanky at Wed, 02 Mar 2011 17:25:12 +0100
Thank you very much for this post - it was very helpful for me!
Link to comment -
Callum Macrae at Fri, 18 Mar 2011 07:27:02 +0100
Just read your article, and don't agree with it. I recently built a Regex parser for my framework, and even though it parses a string into a Regex ([url={URL}]{all}[/url]), it's *extremely* lightweight:
Link to comment
https://github.com/callumacrae/lynx-framework/blob/feature%2Fhelpers/lynx/helpers/text/text.php
I haven't benchmarked it yet, but it would be interesting to see how well it stands against something like the bbcode extension in a benchmark. -
ridgerunner at Mon, 04 Apr 2011 01:38:35 +0200
I would just like to wholeheartedly agree with everything Tom Christiansen said in his excellent comment above. I recently implemented a new PHP BBCode parser for the FluxBB forum software project which uses a (non-REGULAR) recursive regex called recursively by a preg_replace_callback() function. It easily and efficiently handles nested elements and performs quite well. And the recursive design makes it a trivial matter to identify erroneous markup such as orphan tags and illegally nested tags.
Link to comment
Won't you please join the rest of us here in the 21st century!? -
Kore at Mon, 04 Apr 2011 08:48:50 +0200
@Tom Christiansen & ridgerunner:
Link to comment
Years ago I also wrote a blog post "Parse with regular expressions" [1] which is referenced in the trackbacks. There I show how you can handle recursive structures with PCRE. I am well aware of this since years.
It does not change the fact that handling recursive structures is impossible with _regular_expressions_. PCRE "regular expressions aren't regular expressions any more. But trying to implement parsers with them has major limitations, which, from my experience, result in hard-to-maintain code (even I read PCRE fluently).
I implemented multiple parsers in my live and I can only suggest to you to only use regular expressions for tokenizing and implement some of the common parser approaches for actually building the AST. For trivial languages like BBCode this is easy, efficient and very readable.
[1] http://kore-nordmann.de/blog/parse_with_regexp.html -
Bryan at Thu, 26 May 2011 17:21:00 +0200
First off I should preface this by saying that I don't disagree with anything you're saying, and I don't think you're wrong at all :)
Link to comment
But what I do want to say is that in reference to the story at the top of the page about the guy wanting help with using regular expressions to parse bbcode, you might be misinterpreting his request.
I understand what you're saying about the computer science yes/no's of that request, but I think you're taking his request a little too literal. He is probably writing an algorithm to do the actual "parsing", and simply using regular expressions to find the patterns that that algorithm needs to find, and just wanted help with said regular expression parts of his algorithm. This is not at all impossible and is how a lot of people handle simple everyday parsing needs that they run into during normal web work, myself having successfully done it many times from bbcode to WP style shortcodes to wiki codes to proprietary mumbo-jumbo codes.
Keep in mind that you may be a little smarter that the average question asker of a PHP forum. When they say they are "writing regular expressions to parse bbcode" what they likely mean is they are "writing an algorithm to parse bbcode and it uses regular expressions as part of that algorithm." :)
Also, on the issue of the usefulness of bbcode...you incorrectly explained what bbcode is. By your definition (same as HTML but with [] instead of <>) then bbcode absolutely sounds moronic. But that's NOT what bbcode is. There are some tags where essentially it boils down to that, but the real reason for bbcode is to offer less versatility than HTML with simpler usage than HTML, and just use [] instead of <> because otherwise bbcode and HTML code would be too easy to mistake for each other.
For example in HTML to embed an image you'd use <img src="url" />. bbcode for that is [img]url[/img]. The major difference being you can make bbcode do most of it's basic functions without people understanding the concept and syntax of attributes, only the concept of start tag/end tag. As simple as attributes may seem, getting an average butcher/baker/candlestick maker to properly write them just to put an image in a forum is often an exercise in futility, yet bbcode they quite often grasp right away. The same holds true for making a url. Instead of the more versatile <a href="url">text</a> you just use [url]url[/url] which is simpler but gives up the separation of visible text from link text.
They also often act in a "shortcode" type way simplifying to one tag what would be a little more complex in HTML. Such as [youtube]id[/youtube] being used to embed a video, which profoundly more people understand than the actual youtube embed html. -
MrBarry at Thu, 14 Jul 2011 17:57:56 +0200
I think you have some good points, but I disagree about your opinion on whether it's even worth while to use bbcode, since it has similar complexity to xml.
Link to comment
The reason people use it is for security - it limits how users can code, and offers a filter for the code, too. This, however, can still be done with xml based languages, like html, but it requires a different set of filters, such as elimintating javascript hijacking, removing tags that would purposefully disrupt the generated page.
For instance, in a forum situation (hence the primary reason for using bbcode), if every post is wrapped inside a table, it will look something like this:
<table>
<tr>
<td>User Name</td>
<td> -Post- </td>
</tr>
</table/>
Now, with no security, and if people were using straight html, it would be easy to do this in the post:
"</td></tr></table>lol Caused a forum error!"
All subsequent posts would be heavily distorted.
Now, noobs would find this funny, but it's not a practical hijack - it would be better to just insert javascript, but since it would be easy to clear off <script> tags and all on*="" html attributes, this is the next best thing.
The solution, of course, is to limit the expressive language of html, so you have an exclusive list of html tags that are able to be used, let's say they are:
em,u,i,ul,li,a,img,embed,object, and param, which is the typical bbcode setup (at least by the bbcode standard tags). These elements still need to be parsed for valid markup, and for clean attributes (would want to make sure all js events are nuked, along with styles, classes, and ids), and at this point, you're doing the same amount of work for bbcode. This isn't including checks to make sure embedded objects are just videos and not malicious scripts.
The difference, of course, is that bbcode already limits users' ability to add malicious scripts, and should be checking for valid markup. bbcode allows all the above-mentioned html tags, but does this in a static and safe way (all the tags have pre-configured styles and settings that cannot be changed by the end-user).
So the balance is whether you want security, expressiveness, or some sort of in-between state. BBcode offers more security than plain html, in both filtering invalid markup, but also malicious code. -
Xenland at Sat, 23 Jul 2011 23:14:18 +0200
This is a very good article about how ridicules bbcode is in comparison to the pros and cons of HTML. However, the point of bbcode or anything else that has there own formatting for website input from the user is because its a safer, easier approach to validating that the user isn't attempting to inject javascript code that could to malicious things such as for example having a javascript code steal the session cookie(Yes this can happen with PHP sessions too!) and then attempt to act as that user gaining access to w/e sensitive data that is located on that site.
Link to comment -
JD Bell at Tue, 26 Jul 2011 14:58:21 +0200
Perhaps ECMA Script-style Regular Expressions are not capable of recognizing structure, but many advancements in various implementations do allow for parsing structure, at least in a limited sense. Jeffrey Friedl presents several examples of matching balanced constructs and .NET even has its own "balancing group definition" (http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition).
Link to comment
I agree, however, that true parsing is going to require greater sophistication and sensitivity to context than regular expressions can provide. -
Peter Thoeny at Thu, 15 Sep 2011 20:49:11 +0200
I found a combination of regular expressions and recursive function calls to be very fast to parse nested structures. My blog post about this: http://twiki.org/cgi-bin/view/Blog/BlogEntry201109x3
Link to comment -
Robbie at Fri, 14 Oct 2011 11:33:13 +0200
First of all, thank you very much for your opinion, which I really appreciate. Of course you are right. As a developer of a forum software, it would be much easier for me to handle formating without bb-codes. BUT, there are some very good reasons doing it this way! Especially security (e.g. CSS attacks)! In my opinion there is no better way to parse bb-code with regular expressions. And yes! It takes some days to develop a regex pattern, which is good enough to be able to avoid unwanted entries and clientside mistakes.
Link to comment -
Joseph Szymborski at Sat, 23 Jun 2012 20:02:28 +0200
While I agree that BBCode is not ideal, I do not agree that it is useless. It is a matter of great security. If you replace < and > with their html equivalents without exception, you're increasing security by a ton. BBCode allows you to maintain a whitelist of tags while not using < or >.
Link to comment
Now, Markdown achieves all this in a much more elegant way, but the learning curve for it is much higher than BBCode for those familiar with HTML.
There are definitely security and usability benefits to BBCode, but Markdown does this much more efficiently, elegantly and securely in my opinion -
vzcglviooz at Fri, 24 May 2013 18:55:24 +0200
I am not clear if this is the correct area to put this, but I have a heartfelt concern.
Link to comment
What did you do to lose weight fast?. What was your workout routine?. I need to lose weight and I want to do it fast.. Like intense workouts. But no equiptment besides a treadmill.. Flatter stomach,. just get the fat off of all around my body. fast.. how long did it take you, how much did you lose etc?.
Fields with bold names are mandatory.
Extracting data from HTML on Sun, 24 May 2009 12:55:49 +0200 in Kore Nordmann - PHP / Projects / Politics
A lot of people try to scrape content from HTML - the first approach always
seem to be regular expressions, which are incapable of parsing HTML - which
I proved earlier, already. So, how to do it properly with PHP?
Parse with regular expressions on Sun, 24 May 2009 12:59:22 +0200 in Kore Nordmann - PHP / Projects / Politics
With recursive patterns in PCRE you can actually match recursive structures,
even you should not try this. A regular expression to validate BBCode
documents is included in the blog post.