Do NOT try parsing with regular expressions
Table of Contents
Some days ago somebody joined a german IRC channel where we try to help others using PHP, where I hang around for some years now, and had some problems with his regular expressions he tried to use to "parse" BBCode he wants to use in his custom application. I answered something like:
It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions.
The usual reaction: No reaction. No belive. Some days later nearly the same question again, from the same guy, same answer by me, same result. No, not really the same result, I decided to try to summarize the reasons for this in a small blog post.
If you are just looking for a way to "parse" HTML, I added a new blog entry, which describes how to do that: "Extracting data from HTML".
What are BBCodes?
BBCodes, as you could see from the wikipedia link mentioned above, are the same class of language like HTML. In my eyes it does not make any sense not to use (X)HTML, but a markup language which has exactly the same complexity in terms of parsing and typing of the user. BBCode is nothing more then a subset of HTML using [] insted of <>. Stupid, but OK, we now have this kind of markup.
Having a language like HTML, more or less XML, we get the same problems here. If you want to do the parsing properly you need to handle recursive tag structures, need to check for matching tags etc.
You may of course just replace the BBCode with the according HTML elements, but the you might not only end up with invalid code, but also with user provided inlined HTML which destroys the layout of your website.
The second thing you might do is just checking for a closing tag for each opened tag, or just replace BBCodes, which also have a closing tag, but you still easily find examples where it creates invalid markup, even it might not always destroy your layout, but result in undefined behaviour of the displaying browser.
[b] hello [i] [/b] world [/i] [b] ! [/b]You may also check for this using a regular expression, but trust me, I will always find a way to bypass your checks - the next paragraph will prove this.
Parsing recursive structures with regular expressions
It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions. When you have studied computer science you of course know Chomsky hierarchies and therefore know, that regular expressions are a type 3 grammar, also called regular languages, which are equivalent to finite state machines. (So you also know, that each regular expression is quite easy to transform into a FSM).
So lets take a step further and prove, that recursive structures are not parseable. Using the pumping lemma you may not prove, that a language is a regular language, but you can prove, that a language is not a regular language.
Talking about rekursive stuff lets take one of the simplest possible examples - the braces language. You may think of this like opening and closing elements instead of opening and closing braces. Consider the laguage:
L = { (^m )^m; m >= 1 }All words, where the number of opening braces equals the number of closing braces. Then one number n exists, so that every word x, which is element of this language, and has the length 2 * n can be stripped into:
x = u v w = (^n )^nWhere u, v and w are words fulfilling the following conditions:
Length of the word uv is less or equal n
Length of the word uvw is of course two times n
The pumping lemma now says the that each word of the following definition always is element of the defined language, when L is a regular language.
u v^k w; k >= 1With this definitions you for example get something like:
u = (^( n - i )
v = (^i
w = )^nUsing for example k = 2 and i = 1, you get a word which is not element of the language we defined at the beginning:
x = (^( n + 1) )^nThe numbers of opening braces obviously do not match the number of closing braces any more, so that this word is not a member of the prior defined language. Thats all you need to prove the point.
I can match this language with PCRE!
Yes, you can. PCRE can do more then regualar expressions can do by definition. The class of languages you can detect with PCRE regular expressions is, as far as I know, not clearly defined. The number of back references is limited (as the memory is in every computer, but 99 back references are far less, then some GBs of memory for your parser ;), and so on.
Some people say, that regular expressions are never readable, even I disagree here, I would say writing a real parser using PCRE regular expression will definitely be unreadable. But I will evaluate this in further detail and write a follow up on this.
Where you can use regular expressions in your parser
When you really want to parse BBCodes write a common compiler using syntax trees etc. - or simply use a common rich text editor and strip down the resulting (X)HTML to the set you want to allow in your application. DOM and similar PHP extensions do a great job here - and they ensure valid markup.
When writing a custom parser you can - and should - use regular expressions in the tokenizer. They do a great job here. To rephrase it:
Use regular expressions to recognize words, not structures.
Using BBCode in your application
As written earlier - I personally do not see any sense in using BBCodes and even wiki markup, because their complexity equals the complexity of HTML, they are another language to learn for the user, they differ in details in each implementation, and do not reuse potential knowledge of the user about HTML. And even, if the user does not know HTML yet, why should it be harder to learn your current subset, then your special BBCode implementation? Are <> really so much harder to type then []? On german keyboards they are even a lot easier to type...
But if you really want to use such stuff, there are existing packages which help you using them in your application, so that you can reuse existing working parsers and not end up with not working regular expression experiments, and continue investing hours in your application to fix some funny edge cases. Some examples:
If you liked this blog post, or learned something please consider using flattr to contribute back: .
Trackbacks
Comments
-
Whisller at Sat, 28 Jul 2007 02:28:16 +0200
I exactly think like You. I don't understand why users think [b] or *string* is better than <b>. I'm using html purifier, I think it is good library - but now I don't have a time to read documentation and code files.
Link to comment -
Martin Holzhauer at Sat, 28 Jul 2007 07:15:56 +0200
There is also an bbcode Extension[1] in Pecl in an early development state
Link to comment
1. http://pecl.php.net/package/bbcode/ -
kore at Sat, 28 Jul 2007 10:07:21 +0200
I see some sense in using *foo* and _bar_ for markup, because this established in the Usenet, maybe even before HTML was used, and is still used in a lot of text base applications, like Mail and IRC.
Link to comment
... but you just could keep those, because everyone will understand this markup even without conversion to HTML, I think. -
Toby at Sat, 28 Jul 2007 10:18:35 +0200
Wer hat Dich denn geƤrgert? ;)
Link to comment -
Christopher Hogan at Sun, 29 Jul 2007 02:48:45 +0200
I concur, I use a form of markdown if the client requests it in the CMS. But my software also provides a secondary option using tinyMCE for clients that want WYSIWYG editing functionality. I think wysiwyg is WAY more simple for clients. Since they TinyMCE Moxicode editor displays html pretty much the same way dreamweaver woud, the only other steps I take to ensure valid XHTML are:
Link to comment
1.) use DOM to validate
2.) using the parse tags functionality in DOM I check the content in the tags for ISO-8859 extended char sets and filter them (htmlentities function and htmlspecialchars functions won't replace them into entities correctly). I do this by using PHP 5.2 and the new filter_input/filter_var/filter_array functions
3.) encode ampersands/ addslashes or check for / use magic quotes
if I'm putting it in an RDBMS.
4.) My version of the backend code for TinyMCE uses AJAX to update elements on parts of the page dynamically and re-export the files as regular old HTML. -
Ammar Ibrahim at Sun, 29 Jul 2007 08:46:13 +0200
Exactly! This is also similar discussion to templating engines, where they try to simplify if statements and such. The thing they don't understand is that it makes things harder. Wiki syntax is very hard compared to HTML, I still don't understand why it was invented.
Link to comment
But the only advatnage of BBCode is the protection against XSS, it's very easy to strip out all HTML, and then convert BBCode to HTML. But if you allow users to write HTML, sanitizing HTML is extremely complicated. -
your mother at Tue, 28 Aug 2007 10:37:46 +0200
i hope you are brushing your teeth dear!
Link to comment -
Boris at Wed, 29 Aug 2007 17:01:11 +0200
Actually there is a good point for using BBCode instead of a small html subset, it is much easier to avoid CSS-attacks.
Link to comment -
Kore at Wed, 29 Aug 2007 23:53:07 +0200
@Boris: No, actually not. You can just do the same white listing approach.
Link to comment
With BBCode you only convert a subset, while you just allow a very specific subset with (X)Html - in most cases just do not allow any attributes and you are safe. -
_p_ at Fri, 31 Aug 2007 17:56:34 +0200
Isn't it a secure way to generate markup?
Link to comment
strip tags and then replace bbcode with regular expressions to
have the control over what we generate ((X)HTML etc.). -
Kore at Sat, 01 Sep 2007 11:47:56 +0200
@_p_: Please actually read the blog post before commenting. This would help you to answer properly. Regular expressions are just not capable of converting BBCode to HTML.
Link to comment -
Evert at Mon, 03 Sep 2007 17:50:50 +0200
I use regular expressions for parsing non-valid html.. I do need some PHP around it to aid it though :)
Link to comment
I do think BBCode is more secure than trying to clean html. Black lists for HTML almost never are 100% waterproof, just because there are so many variations. By using BBCode, everything that wasn't handled correctly by the parser will simply be spit out as their BBCode.. e.g.: [script] is harmless. Making the same mistake with cleaning html would produce <script>.
Academically it could be just as secure, but when you consider that their might be bugs in software (and there always is) BBCode provides an extra fallback..
That doesn't make it pretty though, therefore I would still stick with HTML or a wiki-like syntax to do this stuff. -
Mike Seth at Mon, 03 Sep 2007 23:27:59 +0200
Hahahahahahaha Kore!
Link to comment
I've just written a post saying the same a couple of days ago! Indeed you were much more verbose this time! :)
here it is:
http://blog.mikeseth.com/index.php?/archives/1-For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS.html -
dirt at Thu, 25 Oct 2007 03:55:55 +0200
im using a wysiwyg js to format posts by users (in place of a textarea), and its fast and already set up for bbcode, i came across this post when researching the best way to take the bbcode in the database/posting and change it to html for output...
Link to comment
if you are saying regex isnt the way to go, then what is the most efficient way to format user posts? i looked at patBBCode but i don't have access to install anything on the server itself and if im correct that package needs to be installed via PEAR along with patError.. correct me if i'm wrong.
allowing straight html doesnt seem smart, so i figured a selected number of bbcode ([b][i][u][code][quote][color=#ff0000][url][img]) stored in the db/post and translating it when outputting would be the way to go...
what to do what to do.. -
sanjuro at Thu, 28 Feb 2008 17:50:18 +0100
This is good to know, I was getting a headache trying to parse bbcode with regular expressions. I'm still however looking for a simple alternative, these parsers seem a bit over-the-top.
Link to comment -
wangtang at Fri, 29 Feb 2008 02:14:36 +0100
Well, there are actually some more arguments for BBCode you don't mention.
Link to comment
For once, people are used to it. Of course it would be desirable that people used and knew html, but they don't. They do however know BBCode, mostly the younger ones are accustomed to using BBCode in forums.
You also mention "[b] hello [i] [/b] world [/i] [b] ! [/b]" as an example. With <> instead of [], this wouldn't even be valid html-code, so there's no need to correct user input failures in the BBCode case (where I use span's anyway), at least in my book.
As for parsing BBCode with regular expressions: I've seen many sites which use a very limited set of BBCode, where this argument "It is quite easy to prove, that it is not possible to properly detect and parse recursive structures using regular expressions." does only apply in 0,01% the cases.
On top of that: It's easy to implement. Not perfect, but you don't have to read yourself into libraries parsing with syntax trees. And the "easy" argument dominates all others in many cases. While it surely is desirable to apply the theoretics of Informatics to applications, pragmatically you can always accomodate the methods you're using to the demands which need to be met.
In the end, wysiwyg's will sooner than later eat "normal" forms with BBCode or html, as they are comfortable to use for the end-user, and easy to implement for webdesigners. Most of the newer wysiwyg's do output valid xhtml, so that's where it will go. In the meantime, I don't see html replace BBCode in forms, and regexps are a simple way to deal with a lot of the easier task tags. -
wangtang at Fri, 29 Feb 2008 02:32:56 +0100
P.S.: I know I sound like "let's use h1 for big text and address for italic text, and a table for layouting our site". But since we're not doing markup here, I'm still on the regexp side ;)
Link to comment -
Kristjan Siimson at Tue, 07 Oct 2008 11:36:16 +0200
Thank you for this post. I was trying to do the same thing because it seems to make sense to use this kind of approach. That saved me a lot of time. But I'd like to point out that while writing [b] instead of <b> does not make life much easier, such codes can become useful when things get more complex, such as when adding flash objects.
Link to comment -
Ab at Mon, 19 Jan 2009 10:33:51 +0100
Good post about regEx, but shame your giving such bad advice in regards to forum/whatever-a-user-can-post-markup-in security. The main reason BB exists is because it is way more secure to disallow all HTML and have a small, totally under your own control set of Codes you can change to HTML yourself then allowing all HTML tags.
Link to comment
I have a feeling you know a lot about RegEx, but are rather clueless about secure programming and not trusting user-input. -
Kore at Mon, 19 Jan 2009 17:06:13 +0100
@Ab:
Link to comment
Please reread the text. Validating HTML against a white list of allowed elements and structures makes it possible to entirely validate user input given as HTML. There is no magic behind that.
On the other hand lots of broken, dumb BBCode-"parsers" directly convert attribute values into HTML, which can lead into script injection.
BBCode are *no* step for securing user input, this happens completely independent from the markup language. And you always need to validate the entire structure to properly validate input, which is impossible with regular expressions for BBCode like shown above. -
David King at Tue, 03 Mar 2009 01:07:01 +0100
Personally, I think that there's no real argument against HTML or BBCode parsing and security just so long as you do it right. With BBCode it's trivial to prevent inline javascript events (like adding onclick to a bold tag) because only [B] would be parsed and [B onclick="evil"] would be ignored. That leaves only [LINK] tags susceptible to attack - and decent sanitation there is a complete doddle.
Link to comment
For me, the reason to use BB-style over HTML is that you keep the user from thinking "ooh, HTML - lets get creative" then have them disappointed because you only support a small set of tags (STRONG, EM, U for example) leaving them thinking "bollocks, how can I edit my comment"? However when presented with BB-style they will know immediately that it is a simpler set of tags and to either use the given reference or provided toolbar -
hari at Sun, 12 Jul 2009 16:57:57 +0200
Hi Kore,
Link to comment
I read your arguments, but from a practical perspective, it doesn't seem to make sense to write a WHOLE parser - scanner for merely adding a few small tags for the end user. Especially for simple applications where you just need bold, italic, underline, quoting and code block.
I DON'T want to write an entire lexical parser, scanner application or use the DOM approach in a PHP script to just convert a small subset of BBCode tags to their HTML equivalents.
I DON'T want to allow pure HTML or even a stripped down sanitized version of HTML because there is no 100% guarantee that it is safe.
Can not regular expressions be used as an acceptable, but less than perfect alternative? Enforcing correct "structure" of tags is just too much to expect. I can correct wrongly nested tags from user input manually if need be. It saves far more time than a full fledged lexical analyzer-cum-parser. -
Faltzer at Mon, 23 Nov 2009 01:30:19 +0100
Late response, but I see this as worth mentioning:
Link to comment
If the problem with parsing is nesting and the misuse of regular expressions, then try parsing BBCode with SUIT's parser (which is stack-based, not dependent on any regular expressions).
For starters, the "Hello world" example you provided does not work on SUIT, because SUIT ignores bad nesting. By the [/i], the stack will be empty, and the [ b] ! [/ b] will be parsed correctly.
In case you want to have a look at how the parser works (it's heavily commented): http://suitframework.svn.sourceforge.net/viewvc/suitframework/suit/trunk/suit/suit.class.php?revision=33&view=markup -
John @ SEO Software at Sun, 06 Dec 2009 23:44:31 +0100
I heard lots of people say regex is bad idea to parse html, actually it is true for complex regex one time when i was parse a page it took the regex about 15 minutes to collected matches while a simple loop did it instantly for me.
Link to comment
however i think regex is a very good idea when it comes to simple tasks like parsing emails or so...more simple tasks. -
Nikita Popov at Fri, 22 Jan 2010 13:00:39 +0100
Sorry, but i really do disagree with what you say. I think using RegExp or not is only a question on how good you are at RegExp and a question of performance.
Link to comment
PHPs Implementation of regular expressions (I talk about PCRE) does support recursion.
Here's an example for a quote-BB-Code:
#\[quote]((?>[^[]|\[(?!/?quote])|(?R))+)\[/quote]#si
In my eyes it's fairly obvious what it does and I made a little performance check to: Running the RegExp one million times (preg_replace_callback is used for replacing) takes from three to ten seconds, depending on the complexity of the input stream. Obviously, hundrets of encapsulated [quote]s would need a little bit more processing time.
The above regexp could be changed to parse *all* BB-Codes:
#\[([a-z]+)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si
Now, this is *one* line of code and it does all you want. You only need to call it recursively with preg_replace_callback and do the correct replacement, depending on what is in $matches[1] (bb-code) -
Kore at Fri, 22 Jan 2010 13:46:04 +0100
@Nikita Popov: PCRE are no regular expressions any more. I covered parsing with PCRE in another blog post:
Link to comment
http://kore-nordmann.de/blog/parse_with_regexp.html
It is still meaningless to try to parse with PCRE, since you get no AST or proper match arrays from matching the string. You can only tell if the string is valid markup, nothing else. You can't get no proper error reporting, either. -
Nikita Popov at Fri, 22 Jan 2010 14:01:09 +0100
Wow, fast answer.
Link to comment
Obviously PCRE has very little in common with what the computer scientists call regular expressions. But if talking about PHP RegExp always refers to PCRE.
Sure you can parse with regex, it's only little bit slow, but on the other hand much easier to implement then some complicated parser / lexer / tokenzier stuff.
Here some example sourcecode for the bb-code [ident] (which has no meaining at all):
function doit(&$matches) {
if(is_array($matches)) {
if(stripos($matches[1], '[ident]'))
return '{IDENT}' . doit($matches[1]) . '{/IDENT}';
return '{IDENT}' . $matches[1] . '{/IDENT}';
}
$regexp = '#\[ident]((?>[^[]|\[(?!/?ident])|(?R))+)\[/ident]#si';
return preg_replace_callback($regexp,'doit',$matches);
}
Only argument against it is performance.
In reality you maybe would use this one instead:
#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si
and would use distinguish several replacements, depending on the bb-code. -
Kore at Fri, 22 Jan 2010 15:10:12 +0100
@Nikita Popov:
Link to comment
a) You still don't get proper matches for recursively stacked tags, or am I missing something here? This only works for tags, which are not stacked.
b) You still don't get proper error reporting, f.e. "[a] ... [b] .. [/b]" would just leave the [a] unmatched & silently ignored.
Using the input string "[foo] ... [i]Hello [b]World[/b][/i]. Hello [i]internet[/i]." with your (adapted) example code results in "[foo] ... {IDENT}i{/IDENT}. Hello {IDENT}i{/IDENT}.". And from the values received in the callback function it at least always seemed to me, that there is no way to properly handle recursive markup this way.
On the other hand, you are also calling your own regular expression matching function recursively yourself, which could already be considered a simple "parser". This example shows the problem: http://k023.de/parse.txt
a) The outer stuff is never properly reported
b) The markup error is ignored, and you cannot use any anchors like \A or ^ to prevent from this.
Writing a proper lexer & parser is not really hard on the other hand, and results in far more readable code. See this presentation [1] or this code [2] as an example.
[1] http://kore-nordmann.de/talks/09_08_parsing_with_php.pdf
[2] http://svn.ez.no/svn/ezcomponents/trunk/Document/src/pcss/parser.php -
Nikita Popov at Fri, 22 Jan 2010 18:04:45 +0100
The code I posted above works only with the first regular expression. For the second it must be adapted.
Link to comment
Here some (untested) code how it could work:
[code]
function doit(&$matches) {
if(is_array($matches)) {
$start, $end;
if($matches[1] == 'b') {
$start = '<strong>';
$end = '</strong>';
}
elseif($matches[1] == 'i') {
$start = '<em>';
$end = '</em>';
}
elseif($matches[1] == 'quote') {
$start = '<blockquote>';
$start = '</blockquote>';
}
$content = $matches[1];
if(preg_match('#\[(?>b|i|quote)]#si', $content)) // for optimisation, for not applying th whole regex if no further nesting is there
$content = doit($content);
return $start . $content . $end;
}
$regexp = '#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si';
return preg_replace_callback($regexp,'doit',$matches); // this is for initialization and recursion later
}
[/code]
I haven't tested it, therefore I don't know whether it works or not. But you should get the basic idea. (I'll test later how well this works...)
This one would (or should) replace
doit('[foo] ... [i]Hello [b]World[/b][/i]. Hello [i]internet[/i].')
By
'[foo] ... <em>'.doit('Hello [b]World[/b]').'</em>. Hello <em>internet</em>.'
and
'[foo] ... <em>Hello<strong>World</strong></em>. Hello <em>internet</em>.'
in the end.
> a) You still don't get proper matches for recursively stacked tags, or am I
> missing something here? This only works for tags, which are not stacked.
Sorry, my English isn't that good, I don't understand what you want to say with "recursively stacked tags".
> b) You still don't get proper error reporting, f.e. "[a] ... [b] .. [/b]" would just
> leave the [a] unmatched & silently ignored.
Yep. This is intended behavior in my eyes.
I will have a look at your links... -
Nikita Popov at Fri, 22 Jan 2010 18:21:18 +0100
Ooops, im sorry, mixed up $matches[1] and $matches[2] in the code above.
Link to comment
This one works (tested):
function doit(&$matches) {
if(is_array($matches)) {
$start = $end = '';
if($matches[1] == 'b') {
$start = '<strong>';
$end = '</strong>';
}
elseif($matches[1] == 'i') {
$start = '<em>';
$end = '</em>';
}
elseif($matches[1] == 'quote') {
$start = '<blockquote>';
$start = '</blockquote>';
}
$content = $matches[2];
if(preg_match('#\[(?>b|i|quote)]#si', $content)) // for optimisation, for not applying th whole regex if no further nesting is there
$content = doit($content);
return $start . $content . $end;
}
$regexp = '#\[(b|i|quote)]((?>[^[]|\[(?!/?\1])|(?R))+)\[/\1]#si';
return preg_replace_callback($regexp,'doit',$matches); // this is for initialization and recursion later
} -
Anonymous at Tue, 13 Apr 2010 05:01:14 +0200
BBCode isn't entirely useless, as it allows bbcode implementers to design custom tags without forcing posters to use invalid html tags.
Link to comment
A common example of this is the [spoiler] tag, which makes text invisible until a mouseover. If you tried to implement this on a site that allowed posters to use a subset of HTML (<b><i> etc), having users type in <spoiler> could cause malformed XML. This can be avoided, of course, by sanitizing the comment right after submission, but why would you want to confuse people by having them refer to a nonexistent HTML tag anyways?
You can, of course, debate the importance of a spoiler tag, but it's just an example. If you're going the route of custom tags, it's better to go out and say that you're non-standard from the beginning than to pretend that you're using valid HTML. -
Anon Cow at Wed, 14 Apr 2010 05:25:24 +0200
Worth pointing out that if you allow a whitelist of HTML rather than using BBCode you will inevitably get someone complaining that they can't use a certain tag, especially in forum/comment and other public style software. So, from a usability perspective BBCode beats HTML because it manages the users expectations of what they can and can't do.
Link to comment
A Rich Text style UI is obviously the better for simple formatting, although it has usability drawbacks for more complex formatting and features such as [code] or [spoiler] tags, as well as the kinds of tagging Wiki uses. -
Ornela @ cheap host at Thu, 22 Jul 2010 16:32:21 +0200
i agree, using wysiwyg is much more simple for clients.
Link to comment -
Akki at Sun, 01 Aug 2010 09:44:38 +0200
The reason you would want to use BBCode over HTML is because you don't want to force the hole standard on your users. They don't care for things like semantics, they just want that word in bold.
Link to comment
You also have a lot of associations that are made simple:
[img]url[/img] => <img src="url" />
[b]text[/b] => <istrong>text<i/strong>
[i]text[/i] => <iem>text<i/em>
For youtube: [yt]id[/yt] => (insert mountain of code here)
[size=3]text[/size] => <h2>, <h3>, or some CSS thing
You also have questions like, should the users insert their own p tags?
And I disagree Kore, ] is not easier to type then > (because of the SHIFT involved).
Anyway, great article on why you can't/shouldn't be using regex to parse. :) -
deejayy at Fri, 06 Aug 2010 14:02:50 +0200
Yeah, and please substitute this with html:
Link to comment
[flash]/upload/demo.swf[/flash]
-
SasQ at Wed, 01 Sep 2010 03:53:31 +0200
It's fun to see people saying "I *can* parse HTML/BBcode with RegExps" and then pasting a code which is nothing less than a simple top-down recursive-descent parser encoded in PHP's functions/callbacks ;)
Link to comment
And as to the non-HTML BBcodes like `[spoiler]` or `[flash]`:
You can simply allow something like this: `<div class="spoiler">...</div>` for special blocks. For the Flash you can use: `<a class="flash">http://path.to/flash.swf</a>` and substitute it with proper markup in your script. SWFObject does exactly that, but from the JS level. -
SasQ at Wed, 01 Sep 2010 04:17:15 +0200
@Nikita Popov:
Link to comment
"Sorry, my English isn't that good, I don't understand what you want to say with 'recursively stacked tags'."
Something like that:
<p>These are <i><b>nested/stacked</b> tags</i>. Try that.</p>
Here the tags are nested hierarchically, one into another. You can think of it as of parentheses:
(These are ((nested/stacked) tags). Try that.)
And you can probably see that you have to check if they're nested properly. If they're not, like in this example:
<p>These are <i>badly <b>nested</i> tags</b>.</p>
a RegExp can't handle with it, because it understands only one level of nesting (Chomsky's level 3 grammar). You need a level 2 grammar parser which understands recursive, hierarchical, self-similar patterns.
But it doesn't mean that you have to write a full-blown parsing engine ;) Often it could be done by a simple recursive-descent parser using recursive function calls for each non-terminal symbol of the grammar.
Fields with bold names are mandatory.
Extracting data from HTML on Sun, 24 May 2009 12:55:49 +0200 in Kore Nordmann - PHP / Projects / Politics
A lot of people try to scrape content from HTML - the first approach always
seem to be regular expressions, which are incapable of parsing HTML - which
I proved earlier, already. So, how to do it properly with PHP?
Parse with regular expressions on Sun, 24 May 2009 12:59:22 +0200 in Kore Nordmann - PHP / Projects / Politics
With recursive patterns in PCRE you can actually match recursive structures,
even you should not try this. A regular expression to validate BBCode
documents is included in the blog post.