Extracting data from HTML - Kore Nordmann

Extracting data from HTML

A lot of people try to scrape content from HTML - the first approach always seem to be regular expressions, which are incapable of parsing HTML - which I proved earlier, already. So, how to do it properly with PHP?

This is quite trivial and intuitive to do - simpler then writing regular expressions is for most people. PHP has this fantastic DOM extension, which builds on top of libxml2 and can not only work with XML, but also with HTML. Let's take a look at a snippet for extracting all links from a website:

<?php $oldSetting = libxml_use_internal_errors( true ); libxml_clear_errors(); $html = new DOMDocument(); $html->loadHtmlFile( 'http://kore-nordmann.de/blog.html' ); $xpath = new DOMXPath( $html ); $links = $xpath->query( '//a' ); foreach ( $links as $link ) { echo $link->getAttribute( 'href' ), "\n"; } libxml_clear_errors(); libxml_use_internal_errors( $oldSetting ); ?>

That's all, and it even works for websites, which do not pass validators, but throw a lot of errors. The function libxml_use_internal_errors( true ) tells the libxml to not expose DOM warnings and errors through PHPs error reporting system, but store them internally. All validation errors can be requested later using the libxml_get_errors() function. We also clear the yet occurred errors, so we get a clean new list - if something happened before.

Then you can just load the HTML file, using the special methods loadHtml() or loadHtmlFile(). You should normally use the latter, because the encoding detection works far better this way. This methods can handle and correct a lot common mistakes done in HTML markup and will return a clean XML representation.

After that you can just query the contents of the HTML document using XPath (Toby and Jakob just published a good and complete guide to XPath). And in most cases you don't even need any complex XPath queries, but just a "//a" or similar will do the job.

The returned DOMNodeList implements Traversable since several versions, so you can just iterate over it using foreach and do further processing. And in the end you should always reset the libxml error reporting to is original state to not unintentionally mess with other parts of the application.

This example just echos a list with all links in the parsed HTML document, with no complex regular expressions (which won't always do the right things - never - that's proven - believe me), but just a trivial XPath query.

Remember that not all websites allow content scraping. This might violate their terms of service or you can easily violate somebodies copyright embedding scraped contents in your application. Use with caution.