Extracting data from HTML
A lot of people try to scrape content from HTML - the first approach always seem to be regular expressions, which are incapable of parsing HTML - which I proved earlier, already. So, how to do it properly with PHP?
This is quite trivial and intuitive to do - simpler then writing regular expressions is for most people. PHP has this fantastic DOM extension, which builds on top of libxml2 and can not only work with XML, but also with HTML. Let's take a look at a snippet for extracting all links from a website:
<?php
$oldSetting = libxml_use_internal_errors( true );
libxml_clear_errors();
$html = new DOMDocument();
$html->loadHtmlFile( 'http://kore-nordmann.de/blog.html' );
$xpath = new DOMXPath( $html );
$links = $xpath->query( '//a' );
foreach ( $links as $link )
{
echo $link->getAttribute( 'href' ), "\n";
}
libxml_clear_errors();
libxml_use_internal_errors( $oldSetting );
?>That's all, and it even works for websites, which do not pass validators, but throw a lot of errors. The function libxml_use_internal_errors( true ) tells the libxml to not expose DOM warnings and errors through PHPs error reporting system, but store them internally. All validation errors can be requested later using the libxml_get_errors() function. We also clear the yet occurred errors, so we get a clean new list - if something happened before.
Then you can just load the HTML file, using the special methods loadHtml() or loadHtmlFile(). You should normally use the latter, because the encoding detection works far better this way. This methods can handle and correct a lot common mistakes done in HTML markup and will return a clean XML representation.
After that you can just query the contents of the HTML document using XPath (Toby and Jakob just published a good and complete guide to XPath). And in most cases you don't even need any complex XPath queries, but just a "//a" or similar will do the job.
The returned DOMNodeList implements Traversable since several versions, so you can just iterate over it using foreach and do further processing. And in the end you should always reset the libxml error reporting to is original state to not unintentionally mess with other parts of the application.
This example just echos a list with all links in the parsed HTML document, with no complex regular expressions (which won't always do the right things - never - that's proven - believe me), but just a trivial XPath query.
Remember that not all websites allow content scraping. This might violate their terms of service or you can easily violate somebodies copyright embedding scraped contents in your application. Use with caution.
Trackbacks
Comments
-
pb at Sat, 14 Feb 2009 14:50:02 +0100
You should check phpQuery ( http://code.google.com/p/phpquery/ ) which is a great helper for solving more complex extraction tasks.
Link to comment -
Permana Jayanta at Sat, 14 Feb 2009 16:05:01 +0100
Nice info, thanks ...
Link to comment -
Diogo at Sat, 14 Feb 2009 16:08:19 +0100
Good tip!
Link to comment
And it's awesome to have somewhere to point programmers to when we're asked about parsing HTML with regular expressions. -
dodger at Sun, 15 Feb 2009 10:19:34 +0100
Kore,
Link to comment
I am so sorry but this is __VERY__ wrong. Your approach is ok for amateur scraping but in no way for professional, high-quality scapring.
Reasons:
1. It's slow. Fucking slow. If you scrape Millions of pages per day - useless.
2. Not robust. It does not survive changes in HTML, whereas Regex do have at least a slight chance that they still work
3. You might not believe but there is a lot of __REALLY_ wrong html out there which has a tendecy to confuse the dom parsing
We tried. Actually this is the first thing what seems obvious and I can tell you this is the wrong way. There is only one worse option - try to fix the html before parsing with the DOM (htmlTidy etc.)
:) -
Thomas Koch at Sun, 15 Feb 2009 11:53:13 +0100
Hi Kore,
Link to comment
could you explain me, in which way loadHtmlFile is better then loadHtml in regard to encoding detection? I suppose loadHtmlFile does also try to get the encoding information from the http header, which is not available to loadHtml?
Can I get the same result by manually passing the value of the content-encoding header to DOMDocument->encoding?
I can not use loadHtmlFile in my application since loading and parsing is done by two independent processes. -
kore at Sun, 15 Feb 2009 13:39:25 +0100
@dodger:
Link to comment
We are talking about different applications here - your usecase is "slightly" different compared to what most people do when it comes to content scraping. ;-)
But let's argue in detail:
1) That might be true - but it is completely irrelevant, unless you really want to scrape millions of sites. For "normal" extraction libxml is just fast enough.
2) It shouldn't be less robust. Actually with XPath2, XPath is a turing complete language, which gives you MUCH more power then regular expressions have - which also makes it possible to scrape content based on structural aspects of the website, which regular expressions are not even able to express at all. And you can do exactly the same you can do with regular expressions as well (already with XPath 1) - scrape contents on loose aspects of the markup.
3) This might be true - but I am talking about extracting data from HTML - some times the used markup cannot be considered HTML anymore, and then it might really confuse DOM too much. Even it really handles a *whole*lot* of broken markup very well. :-)
I know you tried and switch - but still - DOM + XPath is always the way to start with. Which is especially true for beginners, which are the target audience of this blog post. You might want to switch to regular expressions, if argued well (like you did for your usecase). But don't get the beginenrs started with that, and always remember, that each regular expression needs to be handcrafted for every site (even for something trivial like link extraction), since they just CANNOT parse HTML / XML properly (like proven before). -
kore at Sun, 15 Feb 2009 13:41:53 +0100
@Thomas Koch:
Link to comment
I think that is exactly what it does - but I did not have a look at the actual source code. It is just what I experienced. Not sure if setting that property helps - just try it... -
dodger at Sun, 15 Feb 2009 16:08:17 +0100
Ok point taken Kore :)
Link to comment
When will you visit Munich again ? Let's meet for a beer :)
-
sky at Sat, 21 Feb 2009 17:09:13 +0100
Thanks... your code is very simple :)
Link to comment
it's easy to understand... -
Cezary Tomczak at Sun, 21 Jun 2009 20:10:53 +0200
@Kore:
Link to comment
When I copy the source code from your page, all new lines are gone in my editor, and makes it a one-line unreadable spaghetti code ;) (Editplus , XP)
Why would beginner prefer writing 11 lines of code instead of 3?
$content = file_get_contents('http://kore-nordmann.de/blog.html');
preg_match_all('#<a[^<>]+href\s*=\s*[\'"]([^\'"]*)[\'"]#i', $content, $matches);
foreach ($matches[1] as $url) { echo $url.'<br>'; }
Btw. your solution missed 1 link, which regexp found: "http://mozilla.org/firefox" in the head section inside cdata.
Xpath = 61 links
Regexp = 62 links
Regexp > Xpath
Simple math ;) -
kore at Sun, 21 Jun 2009 22:29:56 +0200
@Cezary Tomczak: Is this a satire?
Link to comment
Refrenced in the blog post is the proof that it is not possible to parse recursive structures with regular expressions. There is no point in arguing about that. Regular expressions might, of course, work in some cases for such examples, but won't work in every case.
And you actually provided an example where they do *not* work, with your example. Check the XML specification and you will notice, that stuff in a CDATA section should not be considered markup and therefore the "link" in there is a false positive by your regular expression.
Beside that your regular expression has several other flaws, which could easily be spottet by crafting HTML for those - but there is no point in that, since there already is a mathematical proof for that. -
Cezary Tomczak at Mon, 22 Jun 2009 12:22:51 +0200
Okay, I'm not an expert, but refering to your example, regexp works just fine.
Link to comment
Xml specification, sure, theory is good, but the Firefox link that regexp found, is a real link, visible to IE users, and it is an important link I think!
The snippet you provided as an example was "to extract all links from a website".
I would define it another way (as it works): "to extract all links from a website, except links that are visible to IE users, or except links that are visible in browsers with javascript-enabled, maybe except some of the links from ajax-based websites which are probably rare and useless in nowadays".
;) -
kore at Mon, 22 Jun 2009 17:32:18 +0200
@Cezary Tomczak: *sigh* Standards for markup languages are there for a reason, they define how to process contents. If you want something else and find just "any link in some binary stuff" you of course do not need a parser for a markup language, but something to tokenize the binary stuff in a way you want - regular expressions work fine in tokenizers, of course.
Link to comment
Regular expressions fail in far more cases when "parsing" HTML, by design - as shown by the cited prove. Only because you found one (wrong) example, where *you* think they behave more like you think, it does *not* make them better there by any means.
To answer your (wrong) suggestion: Regular expression won't find any links fetched using ECMAScript (JavaScript, Ajax, ..) either. You need a full ECMAScript-Engine for that, which would use DOM again to traverse the markup-tree - notice anything here? -
Cezary Tomczak at Mon, 22 Jun 2009 20:48:18 +0200
Okay, I got it, just teasing with you ;)
Link to comment
--
Regular Expressions Fanatic -
james at Fri, 17 Jul 2009 17:13:23 +0200
Interresting approach to inspect clean HTML.
Link to comment
However, you should use a SAX parser if you need performance, like XMLReader for PHP.
For really bad HTML, BeautifullSoup for python is the tool you need.
Also, don't forget about Perl for data processing, that's what the langage was made for.
Thanks for posting, xpath is interresting indeed!
--
"Choose the right tool" -
Matbaa at Sat, 24 Oct 2009 10:05:21 +0200
Very good posting. But libxml2 is not possible If you use shared hosting.
Link to comment -
Process Reklam Ajansı at Sat, 09 Jan 2010 14:16:08 +0100
Refrenced in the blog post is the proof that it is not possible to parse recursive structures with regular expressions. There is no point in arguing about that. Regular expressions might, of course, work in some cases for such examples, but won't work in every case.
Link to comment
Fields with bold names are mandatory.
Do not try using regular expressions for parsing on Sun, 24 May 2009 13:00:32 +0200 in Kore Nordmann - PHP / Projects / Politics
Often people try to use regular expression to parse markup like
BBCodes or HTML. Why this will never work, or can work, and why BBCode suck
in general. As a reference to reduce the amount of explanations on this
topic for the future.
Daten aus HTML-Dokumenten extrahieren mit Zend_Dom_Query on Tue, 12 Jan 2010 15:41:58 +0100 in sorgalla.com
Steht man vor dem Problem Daten aus HTML-Dokumenten zu extrahieren, kommen einem oft als erstes reguläre Ausdrücke in den Sinn. Kore Nordmann hat jedoch bewiesen, dass es nicht möglich ist, HTML mit regulären Ausdrücken zu parsen und empfiehlt stattdes...