Kore Nordmann - PHP / Projects / Politics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :Author: Kore Nordmann :Date: Sat, 14 Feb 2009 20:55:45 +0100 :Revision: 6 :Copyright: CC by-sa =========================================== Do NOT try parsing with regular expressions =========================================== :Description: Often people try to use regular expression to parse markup like BBCodes or HTML. Why this will never work, or can work, and why BBCode suck in general. As a reference to reduce the amount of explanations on this topic for the future. .. contents:: Table of Contents :depth: 3 Some days ago somebody joined a german IRC__ channel where we try to help others using PHP, where I hang around for some years now, and had some problems with his regular expressions he tried to use to "parse" BBCode__ he wants to use in his custom application. I answered something like: __ irc://euirc.net/#php __ http://en.wikipedia.org/wiki/BBCode It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions. The usual reaction: No reaction. No belive. Some days later nearly the same question again, from the same guy, same answer by me, same result. No, not really the same result, I decided to try to summarize the reasons for this in a small blog post. .. note:: If you are just looking for a way to "parse" HTML, I added a new blog entry, which describes how to do that: `"Extracting data from HTML"`__. __ http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html What are BBCodes? ================= BBCodes, as you could see from the wikipedia link mentioned above, are the same class of language like HTML. In my eyes it does not make any sense not to use (X)HTML, but a markup language which has exactly the same complexity in terms of parsing and typing of the user. BBCode is nothing more then a subset of HTML using [] insted of <>. Stupid, but OK, we now have this kind of markup. Having a language like HTML, more or less XML, we get the same problems here. If you want to do the parsing properly you need to handle recursive tag structures, need to check for matching tags etc. You may of course just replace the BBCode with the according HTML elements, but the you might not only end up with invalid code, but also with user provided inlined HTML which destroys the layout of your website. The second thing you might do is just checking for a closing tag for each opened tag, or just replace BBCodes, which also have a closing tag, but you still easily find examples where it creates invalid markup, even it might not always destroy your layout, but result in undefined behaviour of the displaying browser. :: [b] hello [i] [/b] world [/i] [b] ! [/b] You may also check for this using a regular expression, but trust me, I will *always* find a way to bypass your checks - the next paragraph will prove this. Parsing recursive structures with regular expressions ===================================================== It is quite easy to *prove*, that it is not possible to properly detect and parse recursive structures using regular expressions. When you have studied computer science you of course know `Chomsky hierarchies`__ and therefore know, that regular expressions are a type 3 grammar, also called regular languages, which are equivalent to `finite state machines`__. (So you also know, that each regular expression is quite easy to transform into a FSM). __ http://en.wikipedia.org/wiki/Chomsky_hierarchy __ http://en.wikipedia.org/wiki/Finite_state_machine So lets take a step further and prove, that recursive structures are not parseable. Using the `pumping lemma`__ you may not prove, that a language is a regular language, but you can prove, that a language is not a regular language. __ http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages Talking about rekursive stuff lets take one of the simplest possible examples - the braces language. You may think of this like opening and closing elements instead of opening and closing braces. Consider the laguage: :: L = { (^m )^m; m >= 1 } All words, where the number of opening braces equals the number of closing braces. Then one number n exists, so that every word x, which is element of this language, and has the length 2 * n can be stripped into: :: x = u v w = (^n )^n Where u, v and w are words fulfilling the following conditions: - Length of the word uv is less or equal n - Length of the word uvw is of course two times n The pumping lemma now says the that each word of the following definition always is element of the defined language, when L is a regular language. :: u v^k w; k >= 1 With this definitions you for example get something like: :: u = (^( n - i ) v = (^i w = )^n Using for example k = 2 and i = 1, you get a word which is not element of the language we defined at the beginning: :: x = (^( n + 1) )^n The numbers of opening braces obviously do not match the number of closing braces any more, so that this word is not a member of the prior defined language. Thats all you need to prove the point. I can match this language with PCRE! ==================================== Yes, you can. `PCRE`__ can do more then regualar expressions can do by definition. The class of languages you can detect with PCRE regular expressions is, as far as I know, not clearly defined. The number of back references is limited (as the memory is in every computer, but 99 back references are far less, then some GBs of memory for your parser ;), and so on. __ http://php.net/pcre Some people say, that regular expressions are never readable, even I disagree here, I would say writing a real parser using PCRE regular expression will definitely be unreadable. But I will evaluate this in further detail and write a follow up on this. Where you can use regular expressions in your parser ==================================================== When you really want to parse BBCodes write a common `compiler`__ using `syntax trees`__ etc. - or simply use a common rich text editor and strip down the resulting (X)HTML to the set you want to allow in your application. `DOM`__ and similar PHP extensions do a great job here - and they ensure valid markup. __ http://en.wikipedia.org/wiki/Compiler __ http://en.wikipedia.org/wiki/Syntax_tree __ http://php.net/DOM When writing a custom parser you can - and should - use regular expressions in the tokenizer. They do a great job here. To rephrase it: Use regular expressions to recognize words, not structures. Using BBCode in your application ================================ As written earlier - I personally do not see any sense in using BBCodes and even wiki markup, because their complexity equals the complexity of HTML, they are another language to learn for the user, they differ in details in each implementation, and do not reuse potential knowledge of the user about HTML. And even, if the user does not know HTML yet, why should it be harder to learn your current subset, then your special BBCode implementation? Are <> really so much harder to type then []? On german keyboards they are even a lot easier to type... But if you really want to use such stuff, there are existing packages which help you using them in your application, so that you can reuse existing working parsers and not end up with not working regular expression experiments, and continue investing hours in your application to fix some funny edge cases. Some examples: - `patBBCode`__ - `PEAR::Text_Wiki_BBCode`__ __ http://www.php-tools.net/site.php?file=patBBCode/overview.xml __ http://pear.php.net/package/Text_Wiki_BBCode Trackbacks ========== - Extracting data from HTML on Sun, 24 May 2009 12:55:49 +0200 in Kore Nordmann - PHP / Projects / Politics A lot of people try to scrape content from HTML - the first approach always seem to be regular expressions, which are incapable of parsing HTML - which I proved earlier, already. So, how to do it properly with PHP? - Parse with regular expressions on Sun, 24 May 2009 12:59:22 +0200 in Kore Nordmann - PHP / Projects / Politics With recursive patterns in PCRE you can actually match recursive structures, even you should not try this. A regular expression to validate BBCode documents is included in the blog post. Comments ======== - Whisller at Sat, 28 Jul 2007 02:28:16 +0200 I exactly think like You. I don't understand why users think [b] or *string* is better than . I'm using html purifier, I think it is good library - but now I don't have a time to read documentation and code files. - Martin Holzhauer at Sat, 28 Jul 2007 07:15:56 +0200 There is also an bbcode Extension[1] in Pecl in an early development state 1. http://pecl.php.net/package/bbcode/ - kore at Sat, 28 Jul 2007 10:07:21 +0200 I see some sense in using *foo* and _bar_ for markup, because this established in the Usenet, maybe even before HTML was used, and is still used in a lot of text base applications, like Mail and IRC. ... but you just could keep those, because everyone will understand this markup even without conversion to HTML, I think. - Toby at Sat, 28 Jul 2007 10:18:35 +0200 Wer hat Dich denn geärgert? ;) - Christopher Hogan at Sun, 29 Jul 2007 02:48:45 +0200 I concur, I use a form of markdown if the client requests it in the CMS. But my software also provides a secondary option using tinyMCE for clients that want WYSIWYG editing functionality. I think wysiwyg is WAY more simple for clients. Since they TinyMCE Moxicode editor displays html pretty much the same way dreamweaver woud, the only other steps I take to ensure valid XHTML are: 1.) use DOM to validate 2.) using the parse tags functionality in DOM I check the content in the tags for ISO-8859 extended char sets and filter them (htmlentities function and htmlspecialchars functions won't replace them into entities correctly). I do this by using PHP 5.2 and the new filter_input/filter_var/filter_array functions 3.) encode ampersands/ addslashes or check for / use magic quotes if I'm putting it in an RDBMS. 4.) My version of the backend code for TinyMCE uses AJAX to update elements on parts of the page dynamically and re-export the files as regular old HTML. - Ammar Ibrahim at Sun, 29 Jul 2007 08:46:13 +0200 Exactly! This is also similar discussion to templating engines, where they try to simplify if statements and such. The thing they don't understand is that it makes things harder. Wiki syntax is very hard compared to HTML, I still don't understand why it was invented. But the only advatnage of BBCode is the protection against XSS, it's very easy to strip out all HTML, and then convert BBCode to HTML. But if you allow users to write HTML, sanitizing HTML is extremely complicated. - your mother at Tue, 28 Aug 2007 10:37:46 +0200 i hope you are brushing your teeth dear! - Boris at Wed, 29 Aug 2007 17:01:11 +0200 Actually there is a good point for using BBCode instead of a small html subset, it is much easier to avoid CSS-attacks. - Kore at Wed, 29 Aug 2007 23:53:07 +0200 @Boris: No, actually not. You can just do the same white listing approach. With BBCode you only convert a subset, while you just allow a very specific subset with (X)Html - in most cases just do not allow any attributes and you are safe. - _p_ at Fri, 31 Aug 2007 17:56:34 +0200 Isn't it a secure way to generate markup? strip tags and then replace bbcode with regular expressions to have the control over what we generate ((X)HTML etc.). - Kore at Sat, 01 Sep 2007 11:47:56 +0200 @_p_: Please actually read the blog post before commenting. This would help you to answer properly. Regular expressions are just not capable of converting BBCode to HTML. - Evert at Mon, 03 Sep 2007 17:50:50 +0200 I use regular expressions for parsing non-valid html.. I do need some PHP around it to aid it though :) I do think BBCode is more secure than trying to clean html. Black lists for HTML almost never are 100% waterproof, just because there are so many variations. By using BBCode, everything that wasn't handled correctly by the parser will simply be spit out as their BBCode.. e.g.: [script] is harmless. Making the same mistake with cleaning html would produce