Kore Nordmann - PHP / Projects / Politics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :Author: Kore Nordmann :Date: Mon, 16 Mar 2009 22:35:56 +0100 :Revision: 3 :Copyright: CC by-sa ====================== The document component ====================== :Description: Conversion between different document formats can be really hard. The document component can help you there, and already implements support for several markup languages... Conversion between different document markup languages is not only common for Content Management Systems, but also common applications like wikis or forums do it all the day. In most cases it is just one language used in the user interface, like some wiki dialect, or BBCode and rendering this to HTML. But at some point you might also want to integrate technical documentation, written in Docbook, or render PDF documents for printing. The `document component`__, from the `eZ Components`__ project, knows about different markup languages and can convert between all of them - while the conversion process itself stays highly configurable. The known languages are currently: Docbook A XML based document markup language, developed since the early 1990s. It implements the most complete document markup. ReST ReStructured Text is a text based format, also with a quite complete markup, which is especially nice to read when viewing text files, but still can be converted to lots of different output formats. HTML We all know this. But HTML has issues, when considered as a document markup language, because most documents are framed by layout, navigation, etc. The document component implements a filter stack with some heuristics for efficient HTML conversion. Wiki There hundreds of wiki markup flavours out there. The document component can currently read three flavours (Creole, Dokuwiki and Confluence) and write one (Creole). Most wiki dialects lack support for common markup, like footnotes, citations or even inner page links. eZ XML The XML based markup language used internally by eZ Publish. To make the power of these conversions visible, I quickly hacked up a `small application`__, which allows you to enter text in a textarea, validate the input and convert the input between all those markup languages. Have fun playing with that. __ http://ezcomponents.org/docs/tutorials/Document __ http://ezcomponents.org/ __ http://k023.de/document/ The code ======== The code which does all the conversion stuff is just this, with some additional error handling:: $classes = array( 'rst' => 'ezcDocumentRst', 'docbook' => 'ezcDocumentDocbook', 'creole' => 'ezcDocumentWiki', 'xhtml' => 'ezcDocumentXhtml', 'ezxml' => 'ezcDocumentEzXml', ); $sourceClass = $classes[$from]; $source = new $sourceClass(); $source->options->errorReporting = E_PARSE; $source->loadString( $text ); $destinationClass = $classes[$to]; $destination = new $destinationClass(); $destination->options->errorReporting = E_PARSE; $destination->createFromDocbook( $source->getAsDocbook() ); echo $destination; It first instantiates the document class representing the input format and loads the passed text. After that the destination format class is created and its content is set from the converted source document. The document component uses Docbook as an intermediate format, because it covers all markup used by any of the other languages. But if required conversion shortcuts can be implemented, like for direct ReST to HTML conversion. More examples and starting points for extending the document component with custom markup or custom conversions are offered in the tutorial__. __ http://ezcomponents.org/docs/tutorials/Document The future ========== Currently we are working on adding support for ODF reading and writing, to make the Open Document Format an equal member in the set of formats. Additionally we are implementing user customizable PDF rendering for all documents in this release cycle. This will use different backends, like `pecl/haru`__ for the actual PDF creation and focus on styling and proper text rendering. `The design document`__ for this is, as always, available in the `eZ Components SVN`__. You are welcome to contribute support for more markup formats, like additional wiki dialects, BBCode or Markdown - contact us on IRC__ or using the mailinglist__. __ http://pecl.php.net/package/haru __ http://svn.ez.no/svn/ezcomponents/trunk/Document/design/pdf_design.txt __ http://svn.ez.no/svn/ezcomponents/trunk/ __ http://ezcomponents.org/support/irc __ http://ezcomponents.org/support/mailinglist Trackbacks ========== Comments ======== - harald at Mon, 16 Mar 2009 20:31:13 +0100 this is slightly off-topic, but: does anyone know of any ReST parser/formatter written in php? i had a look eZ components few days ago -- but they seem to just use to python document-tools for this? - Kore at Mon, 16 Mar 2009 20:41:39 +0100 No, "they" / we / I do not use docutils for this. I implemented a ReST parser in PHP. The Horde-Project did the same. But writing a ReST parser is actually not trivial since ReST is not even possible to specify using a context-free grammar, so that common parser approaches fail here. I would say that the ReST parser in the document component is now quite complete and relatively bug-free - at least we have lots of tests for it, and are happy to improve it any time you find something broken. And I do not know about items in the ReST specification, which are not implemented, except for some of the default directives, which you can always plug in yourself. - peter at Mon, 16 Mar 2009 22:29:27 +0100 You always write "the document component". What document component? - Kore at Mon, 16 Mar 2009 22:35:23 +0100 @peter: Sorry if that was unclear - the document component from the eZ Components project. I added some more links to clarify on that. - harald at Tue, 17 Mar 2009 13:05:29 +0100 many thanks -- i think i have to have a closer look at eZ Components! the document component seems to be very useful, too. - Non_E at Sat, 25 Apr 2009 12:38:29 +0200 Hello, this is a really good idea. I think will watch it's progress because format conversion is really necessary. I just do not like the "new $sourceClass" method of creating instances. I think that it would be wise to force some mandatory interface. That's why I prefer another approach: "new ezcDocument(ezcDocumentEngine $engine)". The class and interface names were chosen arbitraryly for this example. - example essay UK at Thu, 02 Sep 2010 11:47:03 +0200 Thanks a lot for enjoying this beauty article with me. I am apreciating it very much! I am new to programming and somehow I learned something about eZ Components. I am looking forward to another great article. Good luck! - Rick at Sun, 03 Oct 2010 10:50:00 +0200 its so good to know that ReST is supported.I have never been in the situation to convert the document formats myself but this will surely be of help.will recommend to ppl who want to do conversions.thanks. - resignation letter sample at Sun, 24 Oct 2010 19:59:33 +0200 Html is really easy with frontpage or dreamwaver - Research Paper Help at Fri, 26 Nov 2010 12:30:06 +0100 This is the language that Web pages are written in. If you want to create really great Web pages then you will need to learn this. As far as computer languages go this is the easiest to learn. You can create a Web page without it using a Web page editing program but the program will still use HTML to create the page. - goldbet at Fri, 25 Mar 2011 12:28:12 +0100 Thanks for the informations Kore and especially for the code,very nice the blog! - Gucci pas cher femme at Tue, 29 Nov 2011 07:21:17 +0100 jour surpasser son idole et de gagner son Championnat WWE quatrième femme en moins de trois years.Yet, le recul, qui est vraiment surpris