Extracting data from HTML

First published at Saturday, 14 February 2009

Warning: This blog post is more then 16 years old – read and use with care.

Extracting data from HTML

A lot of people try to scrape content from HTML - the first approach always seem to be regular expressions, which are incapable of parsing HTML - which I proved earlier, already. So, how to do it properly with PHP?

This is quite trivial and intuitive to do - simpler then writing regular expressions is for most people. PHP has this fantastic DOM extension, which builds on top of libxml2 and can not only work with XML, but also with HTML. Let's take a look at a snippet for extracting all links from a website:

<?php
$oldSetting = libxml_use_internal_errors( true );
libxml_clear_errors();

$html = new DOMDocument();
$html->loadHtmlFile( 'https://kore-nordmann.de/blog.html' );

$xpath = new DOMXPath( $html );
$links = $xpath->query( '//a' );

foreach ( $links as $link )
{
        echo $link->getAttribute( 'href' ), "\n";
}

libxml_clear_errors();
libxml_use_internal_errors( $oldSetting );
?>

That's all, and it even works for websites, which do not pass validators, but throw a lot of errors. The function libxml_use_internal_errors( true ) tells the libxml to not expose DOM warnings and errors through PHPs error reporting system, but store them internally. All validation errors can be requested later using the libxml_get_errors() function. We also clear the yet occurred errors, so we get a clean new list - if something happened before.

Then you can just load the HTML file, using the special methods loadHtml() or loadHtmlFile(). You should normally use the latter, because the encoding detection works far better this way. This methods can handle and correct a lot common mistakes done in HTML markup and will return a clean XML representation.

After that you can just query the contents of the HTML document using XPath (Toby and Jakob just published a good and complete guide to XPath). And in most cases you don't even need any complex XPath queries, but just a "//a" or similar will do the job.

The returned DOMNodeList implements Traversable since several versions, so you can just iterate over it using foreach and do further processing. And in the end you should always reset the libxml error reporting to is original state to not unintentionally mess with other parts of the application.

This example just echos a list with all links in the parsed HTML document, with no complex regular expressions (which won't always do the right things - never - that's proven - believe me), but just a trivial XPath query.

Remember that not all websites allow content scraping. This might violate their terms of service or you can easily violate somebodies copyright embedding scraped contents in your application. Use with caution.

Comments

pb at Saturday, 14.2. 2009

You should check phpQuery ( http://code.google.com/p/phpquery/ ) which is a great helper for solving more complex extraction tasks.

Permana Jayanta at Saturday, 14.2. 2009

Nice info, thanks ...

Diogo at Saturday, 14.2. 2009

Good tip!

And it's awesome to have somewhere to point programmers to when we're asked about parsing HTML with regular expressions.

dodger at Sunday, 15.2. 2009

Kore,

I am so sorry but this is __VERY__ wrong. Your approach is ok for amateur scraping but in no way for professional, high-quality scapring.

Reasons:

1. It's slow. Fucking slow. If you scrape Millions of pages per day - useless. 2. Not robust. It does not survive changes in HTML, whereas Regex do have at least a slight chance that they still work 3. You might not believe but there is a lot of __REALLY_ wrong html out there which has a tendecy to confuse the dom parsing

We tried. Actually this is the first thing what seems obvious and I can tell you this is the wrong way. There is only one worse option - try to fix the html before parsing with the DOM (htmlTidy etc.)

Thomas Koch at Sunday, 15.2. 2009

Hi Kore,

could you explain me, in which way loadHtmlFile is better then loadHtml in regard to encoding detection? I suppose loadHtmlFile does also try to get the encoding information from the http header, which is not available to loadHtml? Can I get the same result by manually passing the value of the content-encoding header to DOMDocument->encoding? I can not use loadHtmlFile in my application since loading and parsing is done by two independent processes.

kore at Sunday, 15.2. 2009

@dodger:

We are talking about different applications here - your usecase is "slightly" different compared to what most people do when it comes to content scraping. ;-)

But let's argue in detail: 1) That might be true - but it is completely irrelevant, unless you really want to scrape millions of sites. For "normal" extraction libxml is just fast enough.

2) It shouldn't be less robust. Actually with XPath2, XPath is a turing complete language, which gives you MUCH more power then regular expressions have - which also makes it possible to scrape content based on structural aspects of the website, which regular expressions are not even able to express at all. And you can do exactly the same you can do with regular expressions as well (already with XPath 1) - scrape contents on loose aspects of the markup.

3) This might be true - but I am talking about extracting data from HTML - some times the used markup cannot be considered HTML anymore, and then it might really confuse DOM too much. Even it really handles a whole*lot of broken markup very well. :-)

I know you tried and switch - but still - DOM + XPath is always the way to start with. Which is especially true for beginners, which are the target audience of this blog post. You might want to switch to regular expressions, if argued well (like you did for your usecase). But don't get the beginenrs started with that, and always remember, that each regular expression needs to be handcrafted for every site (even for something trivial like link extraction), since they just CANNOT parse HTML / XML properly (like proven before).

kore at Sunday, 15.2. 2009

@Thomas Koch:

I think that is exactly what it does - but I did not have a look at the actual source code. It is just what I experienced. Not sure if setting that property helps - just try it...

dodger at Sunday, 15.2. 2009

Ok point taken Kore :)

When will you visit Munich again ? Let's meet for a beer :)

sky at Saturday, 21.2. 2009

Thanks... your code is very simple :) it's easy to understand...

Cezary Tomczak at Sunday, 21.6. 2009

@Kore:

When I copy the source code from your page, all new lines are gone in my editor, and makes it a one-line unreadable spaghetti code ;) (Editplus , XP)

Why would beginner prefer writing 11 lines of code instead of 3?

$content = file_get_contents('https://kore-nordmann.de/blog.html'); preg_match_all('#<a[^<>]+hrefs*=s*['"]([^'"]*)['"]#i', $content, $matches); foreach ($matches[1] as $url) { echo $url.'<br>'; }

Btw. your solution missed 1 link, which regexp found: "http://mozilla.org/firefox" in the head section inside cdata.

Xpath = 61 links Regexp = 62 links

Regexp > Xpath

Simple math ;)

kore at Sunday, 21.6. 2009

@Cezary Tomczak: Is this a satire?

Refrenced in the blog post is the proof that it is not possible to parse recursive structures with regular expressions. There is no point in arguing about that. Regular expressions might, of course, work in some cases for such examples, but won't work in every case.

And you actually provided an example where they do not work, with your example. Check the XML specification and you will notice, that stuff in a CDATA section should not be considered markup and therefore the "link" in there is a false positive by your regular expression.

Beside that your regular expression has several other flaws, which could easily be spottet by crafting HTML for those - but there is no point in that, since there already is a mathematical proof for that.

Cezary Tomczak at Monday, 22.6. 2009

Okay, I'm not an expert, but refering to your example, regexp works just fine.

Xml specification, sure, theory is good, but the Firefox link that regexp found, is a real link, visible to IE users, and it is an important link I think!

The snippet you provided as an example was "to extract all links from a website".

I would define it another way (as it works): "to extract all links from a website, except links that are visible to IE users, or except links that are visible in browsers with javascript-enabled, maybe except some of the links from ajax-based websites which are probably rare and useless in nowadays".

;)

kore at Monday, 22.6. 2009

@Cezary Tomczak: sigh Standards for markup languages are there for a reason, they define how to process contents. If you want something else and find just "any link in some binary stuff" you of course do not need a parser for a markup language, but something to tokenize the binary stuff in a way you want - regular expressions work fine in tokenizers, of course.

Regular expressions fail in far more cases when "parsing" HTML, by design - as shown by the cited prove. Only because you found one (wrong) example, where you think they behave more like you think, it does not make them better there by any means.

To answer your (wrong) suggestion: Regular expression won't find any links fetched using ECMAScript (JavaScript, Ajax, ..) either. You need a full ECMAScript-Engine for that, which would use DOM again to traverse the markup-tree - notice anything here?

Cezary Tomczak at Monday, 22.6. 2009

Okay, I got it, just teasing with you ;) -- Regular Expressions Fanatic

james at Friday, 17.7. 2009

Interresting approach to inspect clean HTML.

However, you should use a SAX parser if you need performance, like XMLReader for PHP.

For really bad HTML, BeautifullSoup for python is the tool you need.

Also, don't forget about Perl for data processing, that's what the langage was made for.

Thanks for posting, xpath is interresting indeed!

-- "Choose the right tool"

Matbaa at Saturday, 24.10. 2009

Very good posting. But libxml2 is not possible If you use shared hosting.

Mega at Thursday, 18.3. 2010

it is really working. Thank you so much for sharing.

Doris Nguyen at Wednesday, 1.9. 2010

If i want to get a part of website; for example i want to get in the ID='first', how can i do ?

Matbaa at Friday, 3.9. 2010

Thank you for sharing tesafen has come across a very nice blog

Kartvizit at Thursday, 30.9. 2010

Thank you very much. However, as usual was a useful sharing. I wish these posts and more.

anand at Saturday, 18.12. 2010

NICE VERY NICE...I LOVE IT

JulesR at Thursday, 3.2. 2011

Thanks a lot Kore. This is great and I do agree with the approach, but hey! xpath is honestly a pain if you need to extract something in minutes.

For those --even programmers-- who want to grab data without a line of code, there are pretty good tools around. I'm not even talking about yahoo pipes, which is great but kind of complex to setup... The most powerful among the easy ones is by far outwit hub - http://www.outwit.com. Scrapers (with regexps if you want), macros, jobs... In most projects, I manage with just the hub instead of php. YPipes is also a fantastic technology if you have some time. In terms of speed the fast scraping mode in outwit is really amazing. It won't work if you need to login or if the page is dynamically built in javascript, but for most cases, I really recommend it. So once again: this post is great and I totally agree with it as well as with your replies, but the pride of making nice lines of code often push us to reinvent what already exists... Thanks for your posts. J

maviajansmatbaa at Saturday, 17.9. 2011

fotografcilik kursu at Sunday, 25.3. 2012

Nina at Friday, 31.8. 2012

Helped me a lot to run some xpath-queries, thanks kore!

Garet Claborn at Thursday, 14.11. 2013

Whoever is saying that regex will be faster is off their rocker. Regex is a slow, lazy way to parse in any situation. In fact, that's what regex literally is and why it was made.

Different libraries implementing proper character parsing certainly are not all made alike, but you should never use regex to parse anything except front-end tasks, if you must. Sadly, regex has many fanboys to defend it - and that's about the long and short of it's only saving grace.

If you are parsing millions of pages using regex, baka, you're inherently looping back and forth, in and out of scopes. Each time you find the end of a structure you're application will have to back up in the file and start from the beginning.

No way you will ever have a single pass parser with regex. With the DOM extension, you can easily target better performing libraries as needed, since it is at least structured correctly. I don't have any comment about libxml2's algorithm, since it is not always single pass either.

If you want to do something right, get a single-pass, multithread per file, SAX parser. Otherwise, what are you doing complaining about performance? Use clean logic first.

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog:

A classic RSS feed (for example in Portalific)
I'll toot about it on mastodon
All updates will go to LinkedIn, as well